Linux EXT4 FS development
 help / color / mirror / Atom feed
* [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path
@ 2026-05-11  7:23 Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
                   ` (22 more replies)
  0 siblings, 23 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Hi,

This version is a small revision of v3 with no design changes. It fixes
some issues pointed out by Jan and Sashiko, and adds numerous comments
to clarify functionality and key considerations. You can get commits
here:

 https://github.com/zhangyi089/linux/commits/ext4_buffered_iomap_v4/

Original Cover-letter:
===

This series adds the iomap buffered I/O path support for regular files,
based on the latest upstream kernel. It implements the core iomap APIs
on ext4 and introduces the 'buffered_iomap' mount option to enable the
iomap buffered I/O path. It supports default features, default mount
options and bigalloc feature. However, it does not support online
defragmentation, inline data, fsverify, fscrypt, non-extent inodes, and
data=journal mode, it will fall to buffered_head I/O path automatically
if these features and options are used.

This iomap buffered I/O path is not enabled by default because the
preceding features are not supported. Users can explicitly enable or
disable it via 'buffered_iomap' and 'nobuffered_iomap' mount options.

Key notes
=========

1. Lock ordering difference

   The lock ordering of folio lock and transaction start in the iomap
   path is the opposite of that in the buffer_head path.

2. data=ordered mode is not used

   Two main reasons:
   a) The lock ordering of folio lock and transaction start for
      data=ordered mode is opposite to the iomap path, which would cause
      a deadlock.
   b) The iomap writeback path does not support partial folio submission
      (required by data=ordered mode when block size < folio size, and
      it is currently handled by ext4_bio_write_folio()), which would
      also cause a deadlock.

   To replace data=ordered mode functionality:

   - For append write: Always allocate unwritten extents (dioread_nolock
     behavior) to prevent stale data exposure.

   - For post-EOF partial block zeroing: Issue zeroing I/O immediately
     and asynchronously or synchronously wait for completion before
     updating i_disksize. On ordered I/O completion, set i_disksize to
     i_size to avoid lost updates in the truncate up and append
     fallocate cases. (Jan suggested).

   - For online defragmentation: Not supported yet, needs further
     consideration.

3. Always enable dioread_nolock

   Two main reasons:
   a) Since data=ordered mode cannot be used, allocating written blocks
      directly would expose stale data.
   b) To optimize writeback, we should allocate blocks based on writeback
      length rather than per-folio mapping. Direct written allocation
      would over-allocate blocks.

   dioread_nolock has been the default mount option for many years, and
   Jan pointed out that we may no longer need to disable it, so gradually
   remove this mount option in the future.

Series structure
================

 - Patch 01-03: Simplify truncate operations and prepare for conversion.
 - Patch 04-16: Implement core iomap buffered read/write, writeback,
                mmap, and partial block zeroing paths.
 - Patch 17-21: Handle ordered I/O for zeroing post-EOF partial block.
 - Patch 22-23: Enable iomap buffered I/O path.

Testing and Performance
=======================

Tested with xfstests-bld using -g auto, fast_commit, and 64k
configurations. No new regressions were observed.

For the special case of zeroing post-EOF partial block, I add a new
generic/790 to address this scenario.

  https://lore.kernel.org/fstests/20260428085750.1072612-1-yi.zhang@huaweicloud.com/

Performance tested with fio on a 150 GB memory-backed virtual machine
(no much difference compared to v2 and v3, so no update):

 Buffered write (MiB/s)
 ===

  bs       write cache    uncached write
           bh     iomap   bh      iomap
  1k       423    403     36.3    57
  4k       1067   1093    58.4    61
  64k      4321   6488    869     1206
  1M       4640   7378    3158    4818

 Buffered read (MiB/s)
 ===

  bs       read hole        read pre-cache     read ondisk data
           bh     iomap     bh     iomap       bh      iomap
  1k       635    643       661    653         605     602
  4k       1987   2075      2128   2159        1761    1716
  64k      6068   6267      9472   9545        4475    4451
  1M       5471   6072      8657   9191        4405    4467

Large I/O write performance improved by approximately 30% to 50%.
Read performance showed no significant difference.

Changes since v3:
 - Rebased on the latest upstream kernel.
 - Improve commit messages for patches 07-23 to clarify functionality
   and key considerations.
 - Move the patches that enables IOMAP to the end of this series.
 - Patch 02: Move ext4_set_inode_size() declarations from ext4.h into
   inode.c, move truncate_pagecache() and ext4_truncate() to
   ext4_truncate_down() as Jan suggested.
 - Patch 08: Add check for non-extent inodes in the non-delalloc write
   path, and clarify the reason why we don't need to truncate blocks on
   short writes. (Pointed out by sashiko)
 - Patch 09: Fix the issue where DATA_ERR_ABORT fails to work in
   overwrite scenarios. Replace iomap_finish_ioends() with
   iomap_finish_ioend() during end_io to prevent might_sleep() being
   called in interrupt context. (Pointed out by sashiko)
 - Patch 11: Fix underflow of the nr_blks variable. (Pointed out by
   sashiko)
 - Patch 17: Factor out ext4_iomap_submit_zero_block() helper to handle
   ordered mode after zeroing a post-EOF partial block in the iomap
   path, also add comments.
 - Patch 18: Fix off-by-one in ext4_iomap_wb_ordered_wait() and clarify
   why a single i_ordered_len tracker suffices. (Pointed out by sashiko)
 - Patch 19: Fix an issue where the correct file size may be lost due to
   a missing memory barrier. (Pointed out by sashiko)
 - Patch 20: Change the logic for waiting on ordered I/Os in the insert
   range and collapse range from asynchronous to synchronous.
 - Patch 21: Allow per-inode journal mode changes but disallow per-inode
   extent type changes, add comments of restrictions on using iomap.

Changes since v2:
 - Rebased on the latest upstream kernel (7.1-rc1).
 - Added patches 01-03 to simplify truncate operations.
 - Added patch 13 to fix incorrect did_zero parameter in
   iomap_zero_range().
 - Added patches 19-22 to handle ordered I/O for zeroing post-EOF
   partial block.
 - Minor code and comment optimizations.

Changes since v1:
 - Rebase this series on linux-next 20260122.
 - Refactor partial block zero range, stop passing handle to
   ext4_block_truncate_page() and ext4_zero_partial_blocks(), and move
   partial block zeroing operation outside an active journal transaction
   to prevent potential deadlocks because of the lock ordering of folio
   and transaction start.
 - Clarify the lock ordering of folio lock and transaction start, update
   the comments accordingly.
 - Fix some issues related to fast commit, pollute post-EOF folio.
 - Some minor code and comments optimizations.

v3:     https://lore.kernel.org/linux-ext4/20260422021042.4157510-1-yi.zhang@huaweicloud.com/
v2:     https://lore.kernel.org/linux-ext4/20260203062523.3869120-1-yi.zhang@huawei.com/
v1:     https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/

Comments and suggestions are welcome!

Thanks,
Yi.

Zhang Yi (23):
  ext4: simplify size updating in ext4_setattr()
  ext4: factor out ext4_truncate_[up|down]()
  ext4: simplify error handling in ext4_setattr()
  ext4: add iomap address space operations for buffered I/O
  ext4: implement buffered read path using iomap
  ext4: pass out extent seq counter when mapping da blocks
  ext4: do not use data=ordered mode for inodes using buffered iomap
    path
  ext4: implement buffered write path using iomap
  ext4: implement writeback path using iomap
  ext4: implement mmap path using iomap
  iomap: correct the range of a partial dirty clear
  iomap: support invalidating partial folios
  iomap: fix incorrect did_zero setting in iomap_zero_iter()
  ext4: implement partial block zero range path using iomap
  ext4: add block mapping tracepoints for iomap buffered I/O path
  ext4: disable online defrag when inode using iomap buffered I/O path
  ext4: submit zeroed post-EOF data immediately in the iomap buffered
    I/O path
  ext4: wait for ordered I/O in the iomap buffered I/O path
  ext4: update i_disksize to i_size on ordered I/O completion
  ext4: wait for ordered I/O to complete during insert and collapse
    range
  ext4: add tracepoints for ordered I/O in the iomap buffered I/O path
  ext4: partially enable iomap for the buffered I/O path of regular
    files
  ext4: introduce a mount option for iomap buffered I/O path

 fs/ext4/ext4.h              |   57 +-
 fs/ext4/ext4_jbd2.c         |    8 +-
 fs/ext4/ext4_jbd2.h         |    7 +-
 fs/ext4/extents.c           |   18 +
 fs/ext4/file.c              |   20 +-
 fs/ext4/ialloc.c            |    1 +
 fs/ext4/inode.c             | 1040 ++++++++++++++++++++++++++++++-----
 fs/ext4/migrate.c           |    2 +
 fs/ext4/move_extent.c       |   11 +
 fs/ext4/page-io.c           |  210 +++++++
 fs/ext4/super.c             |   55 +-
 fs/iomap/buffered-io.c      |   22 +-
 fs/iomap/ioend.c            |    3 +-
 include/linux/iomap.h       |    1 +
 include/trace/events/ext4.h |  142 +++++
 15 files changed, 1446 insertions(+), 151 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr()
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-19  5:24   ` Ojaswin Mujoo
  2026-05-11  7:23 ` [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]() Zhang Yi
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

The logic for updating the file size in ext4_setattr() is currently
somewhat messy. By directly entering the error-handling path after
failing to add an orphan inode, the unnecessary recovery process
involving old_disksize and the file size can be avoided.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..0751dc55e94f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5953,7 +5953,6 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 	if (attr->ia_valid & ATTR_SIZE) {
 		handle_t *handle;
 		loff_t oldsize = inode->i_size;
-		loff_t old_disksize;
 		int shrink = (attr->ia_size < inode->i_size);
 
 		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
@@ -6037,6 +6036,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 			if (ext4_handle_valid(handle) && shrink) {
 				error = ext4_orphan_add(handle, inode);
 				orphan = 1;
+				if (error)
+					goto out_handle;
 			}
 
 			if (shrink)
@@ -6052,23 +6053,18 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
 					inode->i_sb->s_blocksize_bits);
 
-			down_write(&EXT4_I(inode)->i_data_sem);
-			old_disksize = EXT4_I(inode)->i_disksize;
-			EXT4_I(inode)->i_disksize = attr->ia_size;
-
 			/*
 			 * We have to update i_size under i_data_sem together
 			 * with i_disksize to avoid races with writeback code
-			 * running ext4_wb_update_i_disksize().
+			 * updating disksize in mpage_map_and_submit_extent().
 			 */
-			if (!error)
-				i_size_write(inode, attr->ia_size);
-			else
-				EXT4_I(inode)->i_disksize = old_disksize;
+			down_write(&EXT4_I(inode)->i_data_sem);
+			i_size_write(inode, attr->ia_size);
+			EXT4_I(inode)->i_disksize = attr->ia_size;
 			up_write(&EXT4_I(inode)->i_data_sem);
-			rc = ext4_mark_inode_dirty(handle, inode);
-			if (!error)
-				error = rc;
+
+			error = ext4_mark_inode_dirty(handle, inode);
+out_handle:
 			ext4_journal_stop(handle);
 			if (error)
 				goto out_mmap_sem;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]()
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-19  6:05   ` Ojaswin Mujoo
  2026-06-16  9:31   ` Jan Kara
  2026-05-11  7:23 ` [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr() Zhang Yi
                   ` (20 subsequent siblings)
  22 siblings, 2 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Refactor ext4_setattr() by introducing two helper functions,
ext4_truncate_up() and ext4_truncate_down(), to handle size changes. The
current ATTR_SIZE processing consolidates checks for both shrinking and
non-shrinking cases, leading to cluttered code. Separating the
truncation paths improves readability.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 199 +++++++++++++++++++++++++++---------------------
 1 file changed, 112 insertions(+), 87 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0751dc55e94f..35e958f89bd5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5855,6 +5855,112 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
 	}
 }
 
+/*
+ * Set i_size and i_disksize to 'newsize'.
+ *
+ * Both i_rwsem and i_data_sem are required here to avoid races between
+ * generic append writeback and concurrent truncate that also modify
+ * i_size and i_disksize.
+ */
+static inline void ext4_set_inode_size(struct inode *inode, loff_t newsize)
+{
+	WARN_ON_ONCE(S_ISREG(inode->i_mode) && !inode_is_locked(inode));
+
+	down_write(&EXT4_I(inode)->i_data_sem);
+	i_size_write(inode, newsize);
+	EXT4_I(inode)->i_disksize = newsize;
+	up_write(&EXT4_I(inode)->i_data_sem);
+}
+
+static int ext4_truncate_up(struct inode *inode, loff_t oldsize, loff_t newsize)
+{
+	ext4_lblk_t old_lblk, new_lblk;
+	handle_t *handle;
+	int ret;
+
+	if (!IS_ALIGNED(oldsize | newsize, i_blocksize(inode))) {
+		ret = ext4_inode_attach_jinode(inode);
+		if (ret)
+			return ret;
+	}
+
+	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+	if (!IS_ALIGNED(oldsize, i_blocksize(inode))) {
+		ret = ext4_block_zero_eof(inode, oldsize, LLONG_MAX);
+		if (ret)
+			return ret;
+	}
+
+	handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	old_lblk = oldsize > 0 ? (oldsize - 1) >> inode->i_blkbits : 0;
+	new_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
+	ext4_fc_track_range(handle, inode, old_lblk, new_lblk);
+
+	ext4_set_inode_size(inode, newsize);
+
+	ret = ext4_mark_inode_dirty(handle, inode);
+	ext4_journal_stop(handle);
+	if (ret)
+		return ret;
+	/*
+	 * isize extend must be called outside an active handle due to
+	 * the lock ordering of transaction start and folio lock in the
+	 * iomap buffered I/O path (folio lock -> transaction start).
+	 */
+	pagecache_isize_extended(inode, oldsize, newsize);
+	return 0;
+}
+
+static int ext4_truncate_down(struct inode *inode, loff_t oldsize,
+			      loff_t newsize, int *orphan)
+{
+	ext4_lblk_t start_lblk;
+	handle_t *handle;
+	int ret;
+
+	/* Do not change i_size. */
+	if (newsize == oldsize)
+		goto truncate;
+
+	/* Shrink. */
+	handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	if (ext4_handle_valid(handle)) {
+		ret = ext4_orphan_add(handle, inode);
+		*orphan = 1;
+		if (ret) {
+			ext4_journal_stop(handle);
+			return ret;
+		}
+	}
+
+	start_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
+	ext4_fc_track_range(handle, inode, start_lblk, EXT_MAX_BLOCKS - 1);
+
+	ext4_set_inode_size(inode, newsize);
+
+	ret = ext4_mark_inode_dirty(handle, inode);
+	ext4_journal_stop(handle);
+	if (ret)
+		return ret;
+
+	if (ext4_should_journal_data(inode))
+		ext4_wait_for_tail_page_commit(inode);
+truncate:
+	/*
+	 * Truncate pagecache after we've waited for commit in data=journal
+	 * mode to make pages freeable.  Call ext4_truncate() even if
+	 * i_size didn't change to truncatea possible preallocated blocks.
+	 */
+	truncate_pagecache(inode, newsize);
+	return ext4_truncate(inode);
+}
+
 /*
  * ext4_setattr()
  *
@@ -5951,7 +6057,6 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 	}
 
 	if (attr->ia_valid & ATTR_SIZE) {
-		handle_t *handle;
 		loff_t oldsize = inode->i_size;
 		int shrink = (attr->ia_size < inode->i_size);
 
@@ -6003,94 +6108,14 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 			goto err_out;
 		}
 
-		if (attr->ia_size != inode->i_size) {
-			/* attach jbd2 jinode for EOF folio tail zeroing */
-			if (attr->ia_size & (inode->i_sb->s_blocksize - 1) ||
-			    oldsize & (inode->i_sb->s_blocksize - 1)) {
-				error = ext4_inode_attach_jinode(inode);
-				if (error)
-					goto out_mmap_sem;
-			}
-
-			/*
-			 * Update c/mtime and tail zero the EOF folio on
-			 * truncate up. ext4_truncate() handles the shrink case
-			 * below.
-			 */
-			if (!shrink) {
-				inode_set_mtime_to_ts(inode,
-						      inode_set_ctime_current(inode));
-				if (oldsize & (inode->i_sb->s_blocksize - 1)) {
-					error = ext4_block_zero_eof(inode,
-							oldsize, LLONG_MAX);
-					if (error)
-						goto out_mmap_sem;
-				}
-			}
-
-			handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
-			if (IS_ERR(handle)) {
-				error = PTR_ERR(handle);
-				goto out_mmap_sem;
-			}
-			if (ext4_handle_valid(handle) && shrink) {
-				error = ext4_orphan_add(handle, inode);
-				orphan = 1;
-				if (error)
-					goto out_handle;
-			}
-
-			if (shrink)
-				ext4_fc_track_range(handle, inode,
-					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
-					inode->i_sb->s_blocksize_bits,
-					EXT_MAX_BLOCKS - 1);
-			else
-				ext4_fc_track_range(
-					handle, inode,
-					(oldsize > 0 ? oldsize - 1 : oldsize) >>
-					inode->i_sb->s_blocksize_bits,
-					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
-					inode->i_sb->s_blocksize_bits);
-
-			/*
-			 * We have to update i_size under i_data_sem together
-			 * with i_disksize to avoid races with writeback code
-			 * updating disksize in mpage_map_and_submit_extent().
-			 */
-			down_write(&EXT4_I(inode)->i_data_sem);
-			i_size_write(inode, attr->ia_size);
-			EXT4_I(inode)->i_disksize = attr->ia_size;
-			up_write(&EXT4_I(inode)->i_data_sem);
-
-			error = ext4_mark_inode_dirty(handle, inode);
-out_handle:
-			ext4_journal_stop(handle);
-			if (error)
-				goto out_mmap_sem;
-			if (!shrink) {
-				pagecache_isize_extended(inode, oldsize,
-							 inode->i_size);
-			} else if (ext4_should_journal_data(inode)) {
-				ext4_wait_for_tail_page_commit(inode);
-			}
+		if (attr->ia_size > oldsize)
+			error = ext4_truncate_up(inode, oldsize, attr->ia_size);
+		else {
+			/* Shrink or do not change i_size. */
+			error = ext4_truncate_down(inode, oldsize,
+						   attr->ia_size, &orphan);
 		}
 
-		/*
-		 * Truncate pagecache after we've waited for commit
-		 * in data=journal mode to make pages freeable.
-		 */
-		truncate_pagecache(inode, inode->i_size);
-		/*
-		 * Call ext4_truncate() even if i_size didn't change to
-		 * truncate possible preallocated blocks.
-		 */
-		if (attr->ia_size <= oldsize) {
-			rc = ext4_truncate(inode);
-			if (rc)
-				error = rc;
-		}
-out_mmap_sem:
 		filemap_invalidate_unlock(inode->i_mapping);
 	}
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr()
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]() Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-19  6:08   ` Ojaswin Mujoo
  2026-06-16  9:36   ` Jan Kara
  2026-05-11  7:23 ` [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O Zhang Yi
                   ` (19 subsequent siblings)
  22 siblings, 2 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Refactor the error handling in ext4_setattr() for better clarity:

 - Return directly on ext4_break_layouts() failure.
 - Propagate ext4_truncate() errors using the existing error variable
   and jump to the common 'err_out' label.
 - Propagate posix_acl_chmod() errors also through the error variable,
   as it theoretically does not return a non-fatal error.

With these changes, every error path either returns immediately or jumps
to err_out. Consequently, the "if (!error)" condition guarding
setattr_copy() and mark_inode_dirty() becomes unreachable for error
cases. Remove this redundant check and the unused rc variable can be
removed as well.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 35e958f89bd5..b1ef706987c3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5989,7 +5989,7 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		 struct iattr *attr)
 {
 	struct inode *inode = d_inode(dentry);
-	int error, rc = 0;
+	int error;
 	int orphan = 0;
 	const unsigned int ia_valid = attr->ia_valid;
 	bool inc_ivers = true;
@@ -6102,10 +6102,10 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 
 		filemap_invalidate_lock(inode->i_mapping);
 
-		rc = ext4_break_layouts(inode);
-		if (rc) {
+		error = ext4_break_layouts(inode);
+		if (error) {
 			filemap_invalidate_unlock(inode->i_mapping);
-			goto err_out;
+			return error;
 		}
 
 		if (attr->ia_size > oldsize)
@@ -6117,15 +6117,19 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		}
 
 		filemap_invalidate_unlock(inode->i_mapping);
+		if (error)
+			goto err_out;
 	}
 
-	if (!error) {
-		if (inc_ivers)
-			inode_inc_iversion(inode);
-		setattr_copy(idmap, inode, attr);
-		mark_inode_dirty(inode);
-	}
+	if (inc_ivers)
+		inode_inc_iversion(inode);
+	setattr_copy(idmap, inode, attr);
+	mark_inode_dirty(inode);
 
+	if (ia_valid & ATTR_MODE)
+		error = posix_acl_chmod(idmap, dentry, inode->i_mode);
+
+err_out:
 	/*
 	 * If the call to ext4_truncate failed to get a transaction handle at
 	 * all, we need to clean up the in-core orphan list manually.
@@ -6133,14 +6137,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 	if (orphan && inode->i_nlink)
 		ext4_orphan_del(NULL, inode);
 
-	if (!error && (ia_valid & ATTR_MODE))
-		rc = posix_acl_chmod(idmap, dentry, inode->i_mode);
-
-err_out:
-	if  (error)
+	if (error)
 		ext4_std_error(inode->i_sb, error);
-	if (!error)
-		error = rc;
 	return error;
 }
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (2 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr() Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-19  9:28   ` Ojaswin Mujoo
  2026-05-11  7:23 ` [PATCH v4 05/23] ext4: implement buffered read path using iomap Zhang Yi
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Introduce initial support for iomap in the buffered I/O path for regular
files on ext4.

  - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
    inode uses iomap instead of buffer_head for buffered I/O
  - Add helper ext4_inode_buffered_iomap() to check the flag
  - Add new address space operations ext4_iomap_aops with callbacks that
    will use generic iomap implementations
  - Add ext4_iomap_aops to ext4_set_aops() when the flag is set

The following callbacks(read_folio(), readahead(), writepages()) are
provided as placeholders and will be implemented in later patches.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  |  7 +++++++
 fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 94283a991e5c..1e27d73d7427 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1972,6 +1972,7 @@ enum {
 	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
 	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
+	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
@@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
 		!list_empty(&EXT4_I(inode)->i_orphan);
 }
 
+/* Whether the inode pass through the iomap infrastructure for buffered I/O */
+static inline bool ext4_inode_buffered_iomap(struct inode *inode)
+{
+	return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+}
+
 /*
  * Codes for operating systems
  */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b1ef706987c3..178ac2be37b7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
 	.iomap_begin = ext4_iomap_begin_report,
 };
 
+static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
+{
+	return 0;
+}
+
+static void ext4_iomap_readahead(struct readahead_control *rac)
+{
+
+}
+
+static int ext4_iomap_writepages(struct address_space *mapping,
+				 struct writeback_control *wbc)
+{
+	return 0;
+}
+
 /*
  * For data=journal mode, folio should be marked dirty only when it was
  * writeably mapped. When that happens, it was already attached to the
@@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
 	.swap_activate		= ext4_iomap_swap_activate,
 };
 
+static const struct address_space_operations ext4_iomap_aops = {
+	.read_folio		= ext4_iomap_read_folio,
+	.readahead		= ext4_iomap_readahead,
+	.writepages		= ext4_iomap_writepages,
+	.dirty_folio		= iomap_dirty_folio,
+	.bmap			= ext4_bmap,
+	.invalidate_folio	= iomap_invalidate_folio,
+	.release_folio		= iomap_release_folio,
+	.migrate_folio		= filemap_migrate_folio,
+	.is_partially_uptodate  = iomap_is_partially_uptodate,
+	.error_remove_folio	= generic_error_remove_folio,
+	.swap_activate		= ext4_iomap_swap_activate,
+};
+
 static const struct address_space_operations ext4_dax_aops = {
 	.writepages		= ext4_dax_writepages,
 	.dirty_folio		= noop_dirty_folio,
@@ -4015,6 +4045,8 @@ void ext4_set_aops(struct inode *inode)
 	}
 	if (IS_DAX(inode))
 		inode->i_mapping->a_ops = &ext4_dax_aops;
+	else if (ext4_inode_buffered_iomap(inode))
+		inode->i_mapping->a_ops = &ext4_iomap_aops;
 	else if (test_opt(inode->i_sb, DELALLOC))
 		inode->i_mapping->a_ops = &ext4_da_aops;
 	else
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 05/23] ext4: implement buffered read path using iomap
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (3 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-19 10:00   ` Ojaswin Mujoo
  2026-05-11  7:23 ` [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Implement the iomap read path for ext4 by introducing a new
ext4_iomap_buffered_read_ops instance. This provides the read_folio()
and readahead() callbacks for ext4_iomap_aops. The implementation
introduces:

 - ext4_iomap_map_blocks(): Helper function to query extent mappings for
   a given read range using ext4_map_blocks() and convert the mapping
   information to iomap type
 - ext4_iomap_buffered_read_begin(): The iomap_begin callback that maps
   blocks, validates filesystem state, and populates the iomap. It
   returns -ERANGE for inline data which is not yet supported.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 178ac2be37b7..6c4d9137b279 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3908,14 +3908,57 @@ const struct iomap_ops ext4_iomap_report_ops = {
 	.iomap_begin = ext4_iomap_begin_report,
 };
 
+static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
+		loff_t length, struct ext4_map_blocks *map)
+{
+	u8 blkbits = inode->i_blkbits;
+
+	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+		return -EINVAL;
+
+	/* Calculate the first and last logical blocks respectively. */
+	map->m_lblk = offset >> blkbits;
+	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
+
+	return ext4_map_blocks(NULL, inode, map, 0);
+}
+
+static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
+		loff_t length, unsigned int flags, struct iomap *iomap,
+		struct iomap *srcmap)
+{
+	struct ext4_map_blocks map;
+	int ret;
+
+	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
+		return -EIO;
+
+	/* Inline data support is not yet available. */
+	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+		return -ERANGE;
+
+	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
+	if (ret < 0)
+		return ret;
+
+	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
+	return 0;
+}
+
+const struct iomap_ops ext4_iomap_buffered_read_ops = {
+	.iomap_begin = ext4_iomap_buffered_read_begin,
+};
+
 static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
 {
+	iomap_bio_read_folio(folio, &ext4_iomap_buffered_read_ops);
 	return 0;
 }
 
 static void ext4_iomap_readahead(struct readahead_control *rac)
 {
-
+	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
 }
 
 static int ext4_iomap_writepages(struct address_space *mapping,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (4 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 05/23] ext4: implement buffered read path using iomap Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-19 10:02   ` Ojaswin Mujoo
  2026-06-16 10:04   ` Jan Kara
  2026-05-11  7:23 ` [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path Zhang Yi
                   ` (16 subsequent siblings)
  22 siblings, 2 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

The iomap buffered write path does not hold any locks between querying
inode extent mapping information and performing buffered writes. It
relies on the sequence counter saved in the inode to detect stale
mappings.

Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
blocks") added the m_seq field to ext4_map_blocks to pass out extent
sequence numbers, but it missed two callsites within
ext4_da_map_blocks(). These callsites are on the delayed allocation
path, which is also used in the iomap buffered write path. Pass out the
sequence counter to ensure stale mappings can be detected.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6c4d9137b279..39577a6b65b9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
 	ext4_check_map_extents_env(inode);
 
 	/* Lookup extent status tree firstly */
-	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
+	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
 		map->m_len = min_t(unsigned int, map->m_len,
 				   es.es_len - (map->m_lblk - es.es_lblk));
 
@@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
 	 * is held in write mode, before inserting a new da entry in
 	 * the extent status tree.
 	 */
-	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
+	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
 		map->m_len = min_t(unsigned int, map->m_len,
 				   es.es_len - (map->m_lblk - es.es_lblk));
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (5 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-19 10:41   ` Ojaswin Mujoo
  2026-06-16 10:01   ` Jan Kara
  2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
                   ` (15 subsequent siblings)
  22 siblings, 2 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

The data=ordered mode introduces two fundamental conflicts with the
iomap buffered write path, leading to potential deadlocks.

1) Lock ordering conflict
   In the iomap writeback path, each folio is processed sequentially:
   the folio lock is acquired first, followed by starting a transaction
   to create block mappings. In data=ordered mode, writeback triggered
   by the journal commit process may attempt to acquire a folio lock
   that is already held by iomap. Meanwhile, iomap, under that same
   folio lock, may start a new transaction and wait for the currently
   committing transaction to finish, resulting in a deadlock.

2) Partial folio submission not supported
   When block size is smaller than folio size, a folio may contain both
   mapped and unmapped blocks. In data=ordered mode, if the journal
   waits for such a folio to be written back while the regular writeback
   process has already started committing it (with the writeback flag
   set), mapping the remaining unmapped blocks can deadlock. This is
   because the writeback flag is cleared only after the entire folio is
   processed and committed.

To support data=ordered mode, the iomap core would need two invasive
changes:
 - Acquire the transaction handle before locking any folio for
   writeback.
 - Support partial folio submission.

Both changes are complicated and risk performance regressions.
Therefore, we must avoid using data=ordered mode when converting to the
iomap path.

Currently, data=ordered mode is used in three scenarios:
 - Append write
 - Post-EOF partial block truncate-up followed by append write
 - Online defragmentation

We can address the first two without data=ordered mode:
 - For append write: always allocate unwritten blocks (i.e. always
   enable dioread_nolock), preserving the behavior of current
   extent-type inodes.
 - For post-EOF truncate-up + append write: postpone updating i_disksize
   until after the zeroed partial block has been written back.

Online defragmentation does not yet support iomap; this can be resolved
separately in the future.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4_jbd2.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 63d17c5201b5..26999f173870 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
 
 static inline int ext4_should_order_data(struct inode *inode)
 {
-	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
+	/*
+	 * inodes using the iomap buffered I/O path do not use the
+	 * data=ordered mode.
+	 */
+	return !ext4_inode_buffered_iomap(inode) &&
+		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
 }
 
 static inline int ext4_should_writeback_data(struct inode *inode)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (6 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-26 17:10   ` Ojaswin Mujoo
                     ` (2 more replies)
  2026-05-11  7:23 ` [PATCH v4 09/23] ext4: implement writeback " Zhang Yi
                   ` (14 subsequent siblings)
  22 siblings, 3 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Introduce two new iomap_ops instances for ext4 buffered writes:

 - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
   ext4_da_map_blocks() to map delalloc extents.
 - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
   ext4_iomap_get_blocks() to directly allocate blocks.

Also add ext4_iomap_valid() for the iomap infrastructure to check extent
validity.

Key changes and considerations:

 - Unwritten extents for new blocks (dioread_nolock always on)
   Since data=ordered mode is not used to prevent stale data exposure in
   the non-delayed allocation path, new blocks are always allocated as
   unwritten extents.

 - Short write and write failure handling
   a. Delalloc path: On short write or failure, the stale delalloc range
      must be dropped and its space reservation released. Otherwise, a
      clean folio may cover leftover delalloc extents, causing
      inaccurate space reservation accounting.
   b. Non-delalloc path: No cleanup of allocated blocks is needed on
      short write.

 - Lock ordering reversal
   The folio lock and transaction start ordering is reversed compared to
   the buffer_head buffered write path. To handle this, the journal
   handle must be stopped in iomap_begin() callbacks. The lock ordering
   documentation in super.c has been updated accordingly.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h  |   4 ++
 fs/ext4/file.c  |  20 +++++-
 fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/ext4/super.c |  10 ++-
 4 files changed, 192 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1e27d73d7427..4832e7f7db82 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
 int do_journal_get_write_access(handle_t *handle, struct inode *inode,
 				struct buffer_head *bh);
 void ext4_set_inode_mapping_order(struct inode *inode);
+int ext4_nonda_switch(struct super_block *sb);
 #define FALL_BACK_TO_NONDELALLOC 1
 #define CONVERT_INLINE_DATA	 2
 
@@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
 
 extern const struct iomap_ops ext4_iomap_ops;
 extern const struct iomap_ops ext4_iomap_report_ops;
+extern const struct iomap_ops ext4_iomap_buffered_write_ops;
+extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
+extern const struct iomap_write_ops ext4_iomap_write_ops;
 
 static inline int ext4_buffer_uptodate(struct buffer_head *bh)
 {
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..7f9bfbbc4a4e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	return count;
 }
 
+static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
+					 struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	const struct iomap_ops *iomap_ops;
+
+	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
+		iomap_ops = &ext4_iomap_buffered_da_write_ops;
+	else
+		iomap_ops = &ext4_iomap_buffered_write_ops;
+
+	return iomap_file_buffered_write(iocb, from, iomap_ops,
+					 &ext4_iomap_write_ops, NULL);
+}
+
 static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 					struct iov_iter *from)
 {
@@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	if (ret <= 0)
 		goto out;
 
-	ret = generic_perform_write(iocb, from);
+	if (ext4_inode_buffered_iomap(inode))
+		ret = ext4_iomap_buffered_write(iocb, from);
+	else
+		ret = generic_perform_write(iocb, from);
 
 out:
 	inode_unlock(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 39577a6b65b9..1ae7d3f4a1c8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
 	return ret;
 }
 
-static int ext4_nonda_switch(struct super_block *sb)
+int ext4_nonda_switch(struct super_block *sb)
 {
 	s64 free_clusters, dirty_clusters;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
 	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
 }
 
+static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
+{
+	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
+}
+
+const struct iomap_write_ops ext4_iomap_write_ops = {
+	.iomap_valid = ext4_iomap_valid,
+};
+
 static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 			   struct ext4_map_blocks *map, loff_t offset,
 			   loff_t length, unsigned int flags)
@@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		iomap->flags |= IOMAP_F_MERGED;
 
+	iomap->validity_cookie = map->m_seq;
+
 	/*
 	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
 	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
@@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
 	.iomap_begin = ext4_iomap_begin_report,
 };
 
+/* Map blocks */
+typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
+
 static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
-		loff_t length, struct ext4_map_blocks *map)
+		loff_t length, ext4_get_blocks_t get_blocks,
+		struct ext4_map_blocks *map)
 {
 	u8 blkbits = inode->i_blkbits;
 
@@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
 	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
 			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
 
+	if (get_blocks)
+		return get_blocks(inode, map);
+
 	return ext4_map_blocks(NULL, inode, map, 0);
 }
 
@@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
 	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
 		return -ERANGE;
 
-	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
+	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
 	if (ret < 0)
 		return ret;
 
@@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
 	return 0;
 }
 
+static int ext4_iomap_get_blocks(struct inode *inode,
+				 struct ext4_map_blocks *map)
+{
+	loff_t i_size = i_size_read(inode);
+	handle_t *handle;
+	int ret;
+
+	/*
+	 * Check if the blocks have already been allocated, this could
+	 * avoid initiating a new journal transaction and return the
+	 * mapping information directly.
+	 */
+	if ((map->m_lblk + map->m_len) <=
+	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
+		ret = ext4_map_blocks(NULL, inode, map, 0);
+		if (ret < 0)
+			return ret;
+		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
+				    EXT4_MAP_DELAYED))
+			return 0;
+	}
+
+	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
+			ext4_chunk_trans_blocks(inode, map->m_len));
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	ret = ext4_map_blocks(handle, inode, map,
+			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
+	/*
+	 * Stop handle here following the lock ordering of the folio lock
+	 * and the transaction start.
+	 */
+	ext4_journal_stop(handle);
+
+	return ret;
+}
+
+static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
+		loff_t offset, loff_t length, unsigned int flags,
+		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
+{
+	int ret, retries = 0;
+	struct ext4_map_blocks map;
+	ext4_get_blocks_t *get_blocks;
+
+	ret = ext4_emergency_state(inode->i_sb);
+	if (unlikely(ret))
+		return ret;
+
+	/* Inline data and non-extent are not supported. */
+	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+		return -ERANGE;
+	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+		return -EINVAL;
+	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
+		return -EINVAL;
+
+	if (delalloc)
+		get_blocks = ext4_da_map_blocks;
+	else
+		get_blocks = ext4_iomap_get_blocks;
+retry:
+	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry;
+	if (ret < 0)
+		return ret;
+
+	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
+	return 0;
+}
+
+static int ext4_iomap_buffered_write_begin(struct inode *inode,
+		loff_t offset, loff_t length, unsigned int flags,
+		struct iomap *iomap, struct iomap *srcmap)
+{
+	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
+						  iomap, srcmap, false);
+}
+
+static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
+		loff_t offset, loff_t length, unsigned int flags,
+		struct iomap *iomap, struct iomap *srcmap)
+{
+	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
+						  iomap, srcmap, true);
+}
+
+/*
+ * On write failure, drop the stale delayed allocation range and release
+ * its reserved space for both start and end blocks. Otherwise, we may
+ * leave a range of delayed extents covered by a clean folio, which can
+ * result in inaccurate space reservation accounting.
+ */
+static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
+				     loff_t length, struct iomap *iomap)
+{
+	down_write(&EXT4_I(inode)->i_data_sem);
+	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
+			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
+	up_write(&EXT4_I(inode)->i_data_sem);
+}
+
+static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
+					    loff_t length, ssize_t written,
+					    unsigned int flags,
+					    struct iomap *iomap)
+{
+	loff_t start_byte, end_byte;
+
+	/* If we didn't reserve the blocks, we're not allowed to punch them. */
+	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
+		return 0;
+
+	/* Nothing to do if we've written the entire delalloc extent */
+	start_byte = iomap_last_written_block(inode, offset, written);
+	end_byte = round_up(offset + length, i_blocksize(inode));
+	if (start_byte >= end_byte)
+		return 0;
+
+	filemap_invalidate_lock(inode->i_mapping);
+	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
+				     iomap, ext4_iomap_punch_delalloc);
+	filemap_invalidate_unlock(inode->i_mapping);
+	return 0;
+}
+
+/*
+ * Since we always allocate unwritten extents, there is no need for
+ * iomap_end to clean up allocated blocks on a short write.
+ */
+const struct iomap_ops ext4_iomap_buffered_write_ops = {
+	.iomap_begin = ext4_iomap_buffered_write_begin,
+};
+
+const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
+	.iomap_begin = ext4_iomap_buffered_da_write_begin,
+	.iomap_end = ext4_iomap_buffered_da_write_end,
+};
+
 const struct iomap_ops ext4_iomap_buffered_read_ops = {
 	.iomap_begin = ext4_iomap_buffered_read_begin,
 };
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..9bc294b769db 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
  *   -> page lock -> i_data_sem (rw)
  *
  * buffered write path:
- * sb_start_write -> i_mutex -> mmap_lock
- * sb_start_write -> i_mutex -> transaction start -> page lock ->
- *   i_data_sem (rw)
+ * sb_start_write -> i_rwsem (w) -> mmap_lock
+ * - buffer_head path:
+ *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
+ *     i_data_sem (rw)
+ * - iomap path:
+ *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
+ *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
  *
  * truncate:
  * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 09/23] ext4: implement writeback path using iomap
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (7 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-27 12:49   ` Ojaswin Mujoo
  2026-06-16 11:47   ` Jan Kara
  2026-05-11  7:23 ` [PATCH v4 10/23] ext4: implement mmap " Zhang Yi
                   ` (13 subsequent siblings)
  22 siblings, 2 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Add the iomap writeback path for ext4 buffered I/O. This introduces:

 - ext4_iomap_writepages(): the main writeback entry point.
 - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
   block mapping and I/O submission.
 - A new end I/O worker for converting unwritten extents, updating file
   size, and handling DATA_ERR_ABORT after I/O completion.

Core implementation details:

 - ->writeback_range() callback
   Calls ext4_iomap_map_writeback_range() to query the longest range of
   existing mapped extents. For performance, when a block range is not
   yet allocated, it allocates based on the writeback length and delalloc
   extent length, rather than allocating for a single folio at a time.
   The folio is then added to an iomap_ioend instance.

 - ->writeback_submit() callback
   Registers ext4_iomap_end_bio() as the end bio callback. This callback
   schedules a worker to handle:
   - Unwritten extent conversion.
   - i_disksize update after data is written back.
   - Journal abort on writeback I/O failure.

Key changes and considerations:

- Append write and unwritten extents
  Since data=ordered mode is not used to prevent stale data exposure
  during append writebacks, new blocks are always allocated as unwritten
  extents (i.e. always enable dioread_nolock), and i_disksize update is
  postponed until I/O completion. Additionally, the deadlock that the
  reserve handle was expected to resolve does not occur anymore.
  Therefore, the end I/O worker can start a normal journal handle
  instead of a reserve handle when converting unwritten extents.

- Lock ordering
  The ->writeback_range() callback runs under the folio lock, requiring
  the journal handle to be started under that same lock. This reverses
  the order compared to the buffer_head writeback path. The lock ordering
  documentation in super.c has been updated accordingly.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h        |   4 +
 fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
 fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
 fs/ext4/super.c       |   7 +-
 fs/iomap/ioend.c      |   3 +-
 include/linux/iomap.h |   1 +
 6 files changed, 346 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4832e7f7db82..078feda47e36 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1173,6 +1173,8 @@ struct ext4_inode_info {
 	 */
 	struct list_head i_rsv_conversion_list;
 	struct work_struct i_rsv_conversion_work;
+	struct list_head i_iomap_ioend_list;
+	struct work_struct i_iomap_ioend_work;
 
 	/*
 	 * Transactions that contain inode's metadata needed to complete
@@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
 		size_t len);
 extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
 extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
+extern void ext4_iomap_end_io(struct work_struct *work);
+extern void ext4_iomap_end_bio(struct bio *bio);
 
 /* mmp.c */
 extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1ae7d3f4a1c8..a80195bd6f20 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -44,6 +44,7 @@
 #include <linux/iversion.h>
 
 #include "ext4_jbd2.h"
+#include "ext4_extents.h"
 #include "xattr.h"
 #include "acl.h"
 #include "truncate.h"
@@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
 	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
 }
 
+static int ext4_iomap_map_one_extent(struct inode *inode,
+				     struct ext4_map_blocks *map)
+{
+	struct extent_status es;
+	handle_t *handle = NULL;
+	int credits, map_flags;
+	int retval;
+
+	credits = ext4_chunk_trans_blocks(inode, map->m_len);
+	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	map->m_flags = 0;
+	/*
+	 * It is necessary to look up extent and map blocks under i_data_sem
+	 * in write mode, otherwise, the delalloc extent may become stale
+	 * during concurrent truncate operations.
+	 */
+	ext4_fc_track_inode(handle, inode);
+	down_write(&EXT4_I(inode)->i_data_sem);
+	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
+		retval = es.es_len - (map->m_lblk - es.es_lblk);
+		map->m_len = min_t(unsigned int, retval, map->m_len);
+
+		if (ext4_es_is_delayed(&es)) {
+			map->m_flags |= EXT4_MAP_DELAYED;
+			trace_ext4_da_write_pages_extent(inode, map);
+			/*
+			 * Call ext4_map_create_blocks() to allocate any
+			 * delayed allocation blocks. It is possible that
+			 * we're going to need more metadata blocks, however
+			 * we must not fail because we're in writeback and
+			 * there is nothing we can do so it might result in
+			 * data loss. So use reserved blocks to allocate
+			 * metadata if possible.
+			 */
+			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
+				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
+				    EXT4_EX_NOCACHE;
+
+			retval = ext4_map_create_blocks(handle, inode, map,
+							map_flags);
+			if (retval > 0)
+				ext4_fc_track_range(handle, inode, map->m_lblk,
+						map->m_lblk + map->m_len - 1);
+			goto out;
+		} else if (unlikely(ext4_es_is_hole(&es)))
+			goto out;
+
+		/* Found written or unwritten extent. */
+		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
+		map->m_flags = ext4_es_is_written(&es) ?
+			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
+		goto out;
+	}
+
+	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
+out:
+	up_write(&EXT4_I(inode)->i_data_sem);
+	ext4_journal_stop(handle);
+	return retval < 0 ? retval : 0;
+}
+
+static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
+					  loff_t offset, unsigned int dirty_len)
+{
+	struct inode *inode = wpc->inode;
+	struct super_block *sb = inode->i_sb;
+	struct journal_s *journal = EXT4_SB(sb)->s_journal;
+	struct ext4_map_blocks map;
+	unsigned int blkbits = inode->i_blkbits;
+	unsigned int index = offset >> blkbits;
+	unsigned int blk_end, blk_len;
+	int ret;
+
+	ret = ext4_emergency_state(sb);
+	if (unlikely(ret))
+		return ret;
+
+	/* Check validity of the cached writeback mapping. */
+	if (offset >= wpc->iomap.offset &&
+	    offset < wpc->iomap.offset + wpc->iomap.length &&
+	    ext4_iomap_valid(inode, &wpc->iomap))
+		return 0;
+
+	blk_len = dirty_len >> blkbits;
+	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
+				      (UINT_MAX - 1));
+	if (blk_end > index + blk_len)
+		blk_len = blk_end - index + 1;
+
+retry:
+	map.m_lblk = index;
+	map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
+	ret = ext4_map_blocks(NULL, inode, &map,
+			      EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * The map is not a delalloc extent, it must either be a hole
+	 * or an extent which have already been allocated.
+	 */
+	if (!(map.m_flags & EXT4_MAP_DELAYED))
+		goto out;
+
+	/* Map one delalloc extent. */
+	ret = ext4_iomap_map_one_extent(inode, &map);
+	if (ret < 0) {
+		if (ext4_emergency_state(sb))
+			return ret;
+
+		/*
+		 * Retry transient ENOSPC errors, if
+		 * ext4_count_free_blocks() is non-zero, a commit
+		 * should free up blocks.
+		 */
+		if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
+			jbd2_journal_force_commit_nested(journal);
+			goto retry;
+		}
+
+		ext4_msg(sb, KERN_CRIT,
+			 "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
+			 inode->i_ino, (unsigned long long)map.m_lblk,
+			 (unsigned int)map.m_len, -ret);
+		ext4_msg(sb, KERN_CRIT,
+			 "This should not happen!! Data will be lost\n");
+		if (ret == -ENOSPC)
+			ext4_print_free_blocks(inode);
+		return ret;
+	}
+out:
+	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
+	return 0;
+}
+
+static void ext4_iomap_discard_folio(struct folio *folio, loff_t pos)
+{
+	struct inode *inode = folio->mapping->host;
+	loff_t length = folio_pos(folio) + folio_size(folio) - pos;
+
+	ext4_iomap_punch_delalloc(inode, pos, length, NULL);
+}
+
+static ssize_t ext4_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
+					  struct folio *folio, u64 offset,
+					  unsigned int len, u64 end_pos)
+{
+	ssize_t ret;
+
+	ret = ext4_iomap_map_writeback_range(wpc, offset, len);
+	if (!ret)
+		ret = iomap_add_to_ioend(wpc, folio, offset, end_pos, len);
+	if (ret < 0)
+		ext4_iomap_discard_folio(folio, offset);
+	return ret;
+}
+
+static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
+				       int error)
+{
+	struct iomap_ioend *ioend = wpc->wb_ctx;
+	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
+
+	/*
+	 * After I/O completion, a worker needs to be scheduled when:
+	 * 1) Unwritten extents require conversion.
+	 * 2) The file size needs to be extended.
+	 * 3) The journal needs to be aborted due to an I/O error.
+	 */
+	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
+	    (ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize)) ||
+	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
+		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
+
+	return iomap_ioend_writeback_submit(wpc, error);
+}
+
+static const struct iomap_writeback_ops ext4_writeback_ops = {
+	.writeback_range = ext4_iomap_writeback_range,
+	.writeback_submit = ext4_iomap_writeback_submit,
+};
+
 static int ext4_iomap_writepages(struct address_space *mapping,
 				 struct writeback_control *wbc)
 {
-	return 0;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = inode->i_sb;
+	long nr = wbc->nr_to_write;
+	int alloc_ctx, ret;
+	struct iomap_writepage_ctx wpc = {
+		.inode = inode,
+		.wbc = wbc,
+		.ops = &ext4_writeback_ops,
+	};
+
+	ret = ext4_emergency_state(sb);
+	if (unlikely(ret))
+		return ret;
+
+	alloc_ctx = ext4_writepages_down_read(sb);
+	trace_ext4_writepages(inode, wbc);
+	ret = iomap_writepages(&wpc);
+	trace_ext4_writepages_result(inode, wbc, ret, nr - wbc->nr_to_write);
+	ext4_writepages_up_read(sb, alloc_ctx);
+
+	return ret;
 }
 
 /*
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index dc82e7b57e75..3050c887329f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -22,6 +22,7 @@
 #include <linux/bio.h>
 #include <linux/workqueue.h>
 #include <linux/kernel.h>
+#include <linux/iomap.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
 #include <linux/sched/mm.h>
@@ -611,3 +612,128 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
 
 	return 0;
 }
+
+static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
+					 loff_t end)
+{
+	loff_t new_disksize = end;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	int ret;
+
+	if (new_disksize <= READ_ONCE(ei->i_disksize))
+		return 0;
+
+	/*
+	 * Update on-disk size after IO is completed. Races with truncate
+	 * are avoided by checking i_size under i_data_sem.
+	 */
+	down_write(&ei->i_data_sem);
+	new_disksize = min(new_disksize, i_size_read(inode));
+	if (new_disksize > ei->i_disksize)
+		ei->i_disksize = new_disksize;
+	up_write(&ei->i_data_sem);
+	ret = ext4_mark_inode_dirty(handle, inode);
+	if (ret)
+		EXT4_ERROR_INODE_ERR(inode, -ret, "Failed to mark inode dirty");
+
+	return ret;
+}
+
+static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
+{
+	struct inode *inode = ioend->io_inode;
+	struct super_block *sb = inode->i_sb;
+	loff_t pos = ioend->io_offset;
+	size_t size = ioend->io_size;
+	handle_t *handle;
+	int credits;
+	int ret, err;
+
+	ret = blk_status_to_errno(ioend->io_bio.bi_status);
+	if (unlikely(ret)) {
+		if (test_opt(sb, DATA_ERR_ABORT) && !ext4_emergency_state(sb))
+			jbd2_journal_abort(EXT4_SB(sb)->s_journal, ret);
+		goto out;
+	}
+
+	/* We may need to convert one extent and dirty the inode. */
+	credits = ext4_chunk_trans_blocks(inode,
+			EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
+	handle = ext4_journal_start(inode, EXT4_HT_EXT_CONVERT, credits);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out_err;
+	}
+
+	if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN) {
+		ret = ext4_convert_unwritten_extents(handle, inode, pos, size);
+		if (ret)
+			goto out_journal;
+	}
+
+	ret = ext4_iomap_wb_update_disksize(handle, inode, pos + size);
+out_journal:
+	err = ext4_journal_stop(handle);
+	if (!ret)
+		ret = err;
+out_err:
+	if (ret < 0 && !ext4_emergency_state(sb)) {
+		ext4_msg(sb, KERN_EMERG,
+			 "failed to convert unwritten extents to written extents or update inode size -- potential data loss! (inode %llu, error %d)",
+			 inode->i_ino, ret);
+	}
+out:
+	iomap_finish_ioends(ioend, ret);
+}
+
+/*
+ * Work on buffered iomap completed IO, to convert unwritten extents to
+ * mapped extents
+ */
+void ext4_iomap_end_io(struct work_struct *work)
+{
+	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
+						  i_iomap_ioend_work);
+	struct iomap_ioend *ioend;
+	struct list_head ioend_list;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+	list_replace_init(&ei->i_iomap_ioend_list, &ioend_list);
+	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+
+	iomap_sort_ioends(&ioend_list);
+	while (!list_empty(&ioend_list)) {
+		ioend = list_entry(ioend_list.next, struct iomap_ioend, io_list);
+		list_del_init(&ioend->io_list);
+		iomap_ioend_try_merge(ioend, &ioend_list);
+		ext4_iomap_finish_ioend(ioend);
+	}
+}
+
+void ext4_iomap_end_bio(struct bio *bio)
+{
+	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+	struct inode *inode = ioend->io_inode;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	unsigned long flags;
+
+	/* Needs to convert unwritten extents or update the i_disksize. */
+	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
+	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
+		goto defer;
+
+	/* Needs to abort the journal on data_err=abort.  */
+	if (unlikely(ioend->io_bio.bi_status))
+		goto defer;
+
+	iomap_finish_ioend(ioend, 0);
+	return;
+defer:
+	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+	if (list_empty(&ei->i_iomap_ioend_list))
+		queue_work(sbi->rsv_conversion_wq, &ei->i_iomap_ioend_work);
+	list_add_tail(&ioend->io_list, &ei->i_iomap_ioend_list);
+	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9bc294b769db..51d87db53543 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -123,7 +123,10 @@ static const struct fs_parameter_spec ext4_param_specs[];
  * sb_start_write -> i_mutex -> transaction start -> i_data_sem (rw)
  *
  * writepages:
- * transaction start -> page lock(s) -> i_data_sem (rw)
+ * - buffer_head path:
+ *   transaction start -> folio lock(s) -> i_data_sem (rw)
+ * - iomap path:
+ *   folio lock -> transaction start -> i_data_sem (rw)
  */
 
 static const struct fs_context_operations ext4_context_ops = {
@@ -1428,10 +1431,12 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 #endif
 	ei->jinode = NULL;
 	INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
+	INIT_LIST_HEAD(&ei->i_iomap_ioend_list);
 	spin_lock_init(&ei->i_completed_io_lock);
 	ei->i_sync_tid = 0;
 	ei->i_datasync_tid = 0;
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+	INIT_WORK(&ei->i_iomap_ioend_work, ext4_iomap_end_io);
 	ext4_fc_init_inode(&ei->vfs_inode);
 	spin_lock_init(&ei->i_fc_lock);
 	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index acf3cf98b23a..89bbd3027b81 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -305,7 +305,7 @@ ssize_t iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, struct folio *folio,
 }
 EXPORT_SYMBOL_GPL(iomap_add_to_ioend);
 
-static u32 iomap_finish_ioend(struct iomap_ioend *ioend, int error)
+u32 iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 {
 	if (ioend->io_parent) {
 		struct bio *bio = &ioend->io_bio;
@@ -333,6 +333,7 @@ static u32 iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 		return iomap_finish_ioend_buffered_read(ioend);
 	return iomap_finish_ioend_buffered_write(ioend);
 }
+EXPORT_SYMBOL_GPL(iomap_finish_ioend);
 
 /*
  * Ioend completion routine for merged bios. This can only be called from task
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 2c5685adf3a9..7974ed441300 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -479,6 +479,7 @@ struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio,
 		loff_t file_offset, u16 ioend_flags);
 struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend,
 		unsigned int max_len, bool is_append);
+u32 iomap_finish_ioend(struct iomap_ioend *ioend, int error);
 void iomap_finish_ioends(struct iomap_ioend *ioend, int error);
 void iomap_ioend_try_merge(struct iomap_ioend *ioend,
 		struct list_head *more_ioends);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 10/23] ext4: implement mmap path using iomap
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (8 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 09/23] ext4: implement writeback " Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-27 12:56   ` Ojaswin Mujoo
  2026-06-16 11:56   ` Jan Kara
  2026-05-11  7:23 ` [PATCH v4 11/23] iomap: correct the range of a partial dirty clear Zhang Yi
                   ` (12 subsequent siblings)
  22 siblings, 2 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Introduce ext4_iomap_page_mkwrite() to implement the mmap iomap path
for ext4. The heavy lifting is delegated to iomap_page_mkwrite(), which
only requires ext4_iomap_buffered_write_ops and
ext4_iomap_buffered_da_write_ops to allocate and map blocks.

Note that the lock ordering between folio lock and transaction start in
this path is reversed compared to the buffer_head buffered write path.
The lock ordering documentation in super.c has been updated accordingly.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 32 +++++++++++++++++++++++++++++++-
 fs/ext4/super.c |  8 ++++++--
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a80195bd6f20..c6fe42d012fc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4020,7 +4020,7 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
 		return -ERANGE;
 	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
 		return -EINVAL;
-	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
+	if (WARN_ON_ONCE(!(flags & (IOMAP_WRITE | IOMAP_FAULT))))
 		return -EINVAL;
 
 	if (delalloc)
@@ -4080,6 +4080,14 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
 	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
 		return 0;
 
+	/*
+	 * iomap_page_mkwrite() will never fail in a way that requires delalloc
+	 * extents that it allocated to be revoked.  Hence never try to release
+	 * them here.
+	 */
+	if (flags & IOMAP_FAULT)
+		return 0;
+
 	/* Nothing to do if we've written the entire delalloc extent */
 	start_byte = iomap_last_written_block(inode, offset, written);
 	end_byte = round_up(offset + length, i_blocksize(inode));
@@ -7191,6 +7199,23 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
 	return ret;
 }
 
+static vm_fault_t ext4_iomap_page_mkwrite(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	const struct iomap_ops *iomap_ops;
+
+	/*
+	 * ext4_nonda_switch() could writeback this folio, so have to
+	 * call it before lock folio.
+	 */
+	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
+		iomap_ops = &ext4_iomap_buffered_da_write_ops;
+	else
+		iomap_ops = &ext4_iomap_buffered_write_ops;
+
+	return iomap_page_mkwrite(vmf, iomap_ops, NULL);
+}
+
 vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -7213,6 +7238,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 
 	filemap_invalidate_lock_shared(mapping);
 
+	if (ext4_inode_buffered_iomap(inode)) {
+		ret = ext4_iomap_page_mkwrite(vmf);
+		goto out;
+	}
+
 	err = ext4_convert_inline_data(inode);
 	if (err)
 		goto out_ret;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 51d87db53543..62bfe05a64bc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -100,8 +100,12 @@ static const struct fs_parameter_spec ext4_param_specs[];
  * Lock ordering
  *
  * page fault path:
- * mmap_lock -> sb_start_pagefault -> invalidate_lock (r) -> transaction start
- *   -> page lock -> i_data_sem (rw)
+ * - buffer_head path:
+ *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
+ *     transaction start -> folio lock -> i_data_sem (rw)
+ * - iomap path:
+ *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
+ *     folio lock -> transaction start -> i_data_sem (rw)
  *
  * buffered write path:
  * sb_start_write -> i_rwsem (w) -> mmap_lock
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 11/23] iomap: correct the range of a partial dirty clear
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (9 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 10/23] ext4: implement mmap " Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-11  7:46   ` Christoph Hellwig
  2026-05-11  7:23 ` [PATCH v4 12/23] iomap: support invalidating partial folios Zhang Yi
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

The block range calculation in ifs_clear_range_dirty() is incorrect when
partially clearing a range in a folio. We cannot clear the dirty bit of
the first block or the last block if the start or end offset is not
blocksize-aligned. This has not yet caused any issues since we always
clear a whole folio in iomap_writeback_folio().

Fix this by rounding up the first block to blocksize alignment, and
calculate the last block by rounding down (using truncation). Correct
the nr_blks calculation accordingly.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
This is modified from:
 https://lore.kernel.org/linux-fsdevel/20240812121159.3775074-2-yi.zhang@huaweicloud.com/
Changes:
 - Use round_up() instead of DIV_ROUND_UP() to prevent wasted integer
   division.

 fs/iomap/buffered-io.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d7b648421a70..64351a448a8b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -176,13 +176,17 @@ static void ifs_clear_range_dirty(struct folio *folio,
 {
 	struct inode *inode = folio->mapping->host;
 	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
-	unsigned int first_blk = (off >> inode->i_blkbits);
-	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
-	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned int first_blk = round_up(off, i_blocksize(inode)) >>
+				 inode->i_blkbits;
+	unsigned int last_blk = (off + len) >> inode->i_blkbits;
 	unsigned long flags;
 
+	if (first_blk >= last_blk)
+		return;
+
 	spin_lock_irqsave(&ifs->state_lock, flags);
-	bitmap_clear(ifs->state, first_blk + blks_per_folio, nr_blks);
+	bitmap_clear(ifs->state, first_blk + blks_per_folio,
+		     last_blk - first_blk);
 	spin_unlock_irqrestore(&ifs->state_lock, flags);
 }
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 12/23] iomap: support invalidating partial folios
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (10 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 11/23] iomap: correct the range of a partial dirty clear Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 13/23] iomap: fix incorrect did_zero setting in iomap_zero_iter() Zhang Yi
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Current iomap_invalidate_folio() can only invalidate an entire folio. If
we truncate a partial folio on a filesystem where the block size is
smaller than the folio size, it will leave behind dirty bits for the
truncated or punched blocks. During the write-back process, it will
attempt to map the invalid hole range. Fortunately, this has not caused
any real problems so far because the ->writeback_range() function
corrects the length.

However, the implementation of FALLOC_FL_ZERO_RANGE in ext4 depends on
the support for invalidating partial folios. When ext4 partially zeroes
out a dirty and unwritten folio, it does not perform a flush first like
XFS. Therefore, if the dirty bits of the corresponding area cannot be
cleared, the zeroed area after writeback remains in the written state
rather than reverting to the unwritten state. Fix this by supporting
invalidation of partial folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
This is cherry picked form:
 https://lore.kernel.org/linux-fsdevel/20240812121159.3775074-3-yi.zhang@huaweicloud.com/
No code changes, only update the commit message to explain why Ext4
needs this.

 fs/iomap/buffered-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 64351a448a8b..876c2f507f58 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -761,6 +761,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
 		ifs_free(folio);
+	} else {
+		iomap_clear_range_dirty(folio, offset, len);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 13/23] iomap: fix incorrect did_zero setting in iomap_zero_iter()
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (11 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 12/23] iomap: support invalidating partial folios Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 14/23] ext4: implement partial block zero range path using iomap Zhang Yi
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

The did_zero output parameter was unconditionally set after the loop,
which is incorrect. It should only be set when the zeroing operation
actually completes, not when IOMAP_F_STALE is set or when
IOMAP_F_FOLIO_BATCH is set but !folio causes the loop to break early,
or when iomap_iter_advance() returns an error.

This causes did_zero to be incorrectly set when zeroing a clean
unwritten extent because the loop exits early without actually zeroing
any data.

Fix it by using a local variable to track whether any folio was actually
zeroed, and only set did_zero after the loop if zeroing happened.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
This is cherry picked form:
 https://lore.kernel.org/linux-fsdevel/20260310082250.3535486-1-yi.zhang@huaweicloud.com/
No changes.

 fs/iomap/buffered-io.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 876c2f507f58..27ab33edbdee 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1542,6 +1542,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 		const struct iomap_write_ops *write_ops)
 {
 	u64 bytes = iomap_length(iter);
+	bool zeroed = false;
 	int status;
 
 	do {
@@ -1560,6 +1561,8 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 		/* a NULL folio means we're done with a folio batch */
 		if (!folio) {
 			status = iomap_iter_advance_full(iter);
+			if (status)
+				return status;
 			break;
 		}
 
@@ -1570,6 +1573,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 				bytes);
 
 		folio_zero_range(folio, offset, bytes);
+		zeroed = true;
 		folio_mark_accessed(folio);
 
 		ret = iomap_write_end(iter, bytes, bytes, folio);
@@ -1579,10 +1583,10 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 
 		status = iomap_iter_advance(iter, bytes);
 		if (status)
-			break;
+			return status;
 	} while ((bytes = iomap_length(iter)) > 0);
 
-	if (did_zero)
+	if (did_zero && zeroed)
 		*did_zero = true;
 	return status;
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (12 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 13/23] iomap: fix incorrect did_zero setting in iomap_zero_iter() Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-27 13:13   ` Ojaswin Mujoo
  2026-06-16 12:28   ` Jan Kara
  2026-05-11  7:23 ` [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
                   ` (8 subsequent siblings)
  22 siblings, 2 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
ext4_iomap_block_zero_range() to implement block zeroing via the iomap
infrastructure for ext4.

ext4_iomap_block_zero_range() calls iomap_zero_range() with
ext4_iomap_zero_begin() as the callback. The callback locates and zeros
out either a mapped partial block or a dirty, unwritten partial block.

Important constraints:

Zeroing out under an active journal handle can cause deadlock, because
the order of acquiring the folio lock and starting a handle is
inconsistent with the iomap writeback path.

Therefore, ext4_iomap_block_zero_range():
- Must NOT be called under an active handle.
- Cannot rely on data=ordered mode to ensure zeroed data persistence
  before updating i_disksize (for the cases of post-EOF append write,
  post-EOF fallocate, and truncate up). In subsequent patches, we will
  address this by synchronizing commit I/O but doesn't waiting for
  completion, and updating i_disksize to i_size only after the zeroed
  data has been written back.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c6fe42d012fc..e0dae2501292 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
 	return 0;
 }
 
+static int ext4_iomap_zero_begin(struct inode *inode,
+		loff_t offset, loff_t length, unsigned int flags,
+		struct iomap *iomap, struct iomap *srcmap)
+{
+	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
+	struct ext4_map_blocks map;
+	u8 blkbits = inode->i_blkbits;
+	unsigned int iomap_flags = 0;
+	int ret;
+
+	ret = ext4_emergency_state(inode->i_sb);
+	if (unlikely(ret))
+		return ret;
+
+	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
+		return -EINVAL;
+
+	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Look up dirty folios for unwritten mappings within EOF. Providing
+	 * this bypasses the flush iomap uses to trigger extent conversion
+	 * when unwritten mappings have dirty pagecache in need of zeroing.
+	 */
+	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+		loff_t start = ((loff_t)map.m_lblk) << blkbits;
+		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
+
+		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
+		if ((start >> blkbits) < map.m_lblk + map.m_len)
+			map.m_len = (start >> blkbits) - map.m_lblk;
+	}
+
+	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
+	iomap->flags |= iomap_flags;
+
+	return 0;
+}
+
+static const struct iomap_ops ext4_iomap_zero_ops = {
+	.iomap_begin = ext4_iomap_zero_begin,
+};
+
 /*
  * Since we always allocate unwritten extents, there is no need for
  * iomap_end to clean up allocated blocks on a short write.
@@ -4616,6 +4661,47 @@ static int ext4_block_journalled_zero_range(struct inode *inode, loff_t from,
 	return err;
 }
 
+static int ext4_block_iomap_zero_range(struct inode *inode, loff_t from,
+				       loff_t length, bool *did_zero,
+				       bool *zero_written)
+{
+	int ret;
+
+	/*
+	 * Zeroing out under an active handle can cause deadlock since
+	 * the order of acquiring the folio lock and starting a handle is
+	 * inconsistent with the iomap writeback procedure.
+	 */
+	if (WARN_ON_ONCE(ext4_handle_valid(journal_current_handle())))
+		return -EINVAL;
+
+	/* The zeroing scope should not extend across a block. */
+	if (WARN_ON_ONCE((from >> inode->i_blkbits) !=
+			 ((from + length - 1) >> inode->i_blkbits)))
+		return -EINVAL;
+
+	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS) &&
+	    !(inode_state_read_once(inode) & (I_NEW | I_FREEING)))
+		WARN_ON_ONCE(!inode_is_locked(inode) &&
+			!rwsem_is_locked(&inode->i_mapping->invalidate_lock));
+
+	ret = iomap_zero_range(inode, from, length, did_zero,
+			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
+			       NULL);
+	if (ret)
+		return ret;
+
+	/*
+	 * TODO: The iomap does not distinguish between different types of
+	 * zeroing and always sets zero_written if a zeroing operation is
+	 * performed, which may result in unnecessary order operations.
+	 */
+	if (did_zero && zero_written)
+		*zero_written = *did_zero;
+
+	return 0;
+}
+
 /*
  * Zeros out a mapping of length 'length' starting from file offset
  * 'from'.  The range to be zero'd must be contained with in one block.
@@ -4642,6 +4728,9 @@ static int ext4_block_zero_range(struct inode *inode,
 	} else if (ext4_should_journal_data(inode)) {
 		return ext4_block_journalled_zero_range(inode, from, length,
 							did_zero);
+	} else if (ext4_inode_buffered_iomap(inode)) {
+		return ext4_block_iomap_zero_range(inode, from, length,
+						   did_zero, zero_written);
 	}
 	return ext4_block_do_zero_range(inode, from, length, did_zero,
 					zero_written);
@@ -4682,6 +4771,9 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
 	 * truncating up or performing an append write, because there might be
 	 * exposing stale on-disk data which may caused by concurrent post-EOF
 	 * mmap write during folio writeback.
+	 *
+	 * TODO: In the iomap path, handle this by updating i_disksize to
+	 * i_size after the zeroed data has been written back.
 	 */
 	if (ext4_should_order_data(inode) &&
 	    did_zero && zero_written && !IS_DAX(inode)) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (13 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 14/23] ext4: implement partial block zero range path using iomap Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-27 13:14   ` Ojaswin Mujoo
  2026-06-16 12:29   ` Jan Kara
  2026-05-11  7:23 ` [PATCH v4 16/23] ext4: disable online defrag when inode using " Zhang Yi
                   ` (7 subsequent siblings)
  22 siblings, 2 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Add tracepoints for iomap buffered read, write, partial block zeroing,
and writeback operations to help debug the iomap buffered I/O path.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c             |  6 +++++
 include/trace/events/ext4.h | 45 +++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e0dae2501292..239d387ffaf2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3961,6 +3961,8 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
 	if (ret < 0)
 		return ret;
 
+	trace_ext4_iomap_buffered_read_begin(inode, &map, offset, length,
+					     flags);
 	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
 	return 0;
 }
@@ -4034,6 +4036,8 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
 	if (ret < 0)
 		return ret;
 
+	trace_ext4_iomap_buffered_write_begin(inode, &map, offset, length,
+					      flags);
 	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
 	return 0;
 }
@@ -4136,6 +4140,7 @@ static int ext4_iomap_zero_begin(struct inode *inode,
 			map.m_len = (start >> blkbits) - map.m_lblk;
 	}
 
+	trace_ext4_iomap_zero_begin(inode, &map, offset, length, flags);
 	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
 	iomap->flags |= iomap_flags;
 
@@ -4308,6 +4313,7 @@ static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
 		return ret;
 	}
 out:
+	trace_ext4_iomap_map_writeback_range(inode, &map, offset, dirty_len, 0);
 	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
 	return 0;
 }
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f493642cf121..ebafa06cd191 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -3096,6 +3096,51 @@ TRACE_EVENT(ext4_move_extent_exit,
 		  __entry->ret)
 );
 
+DECLARE_EVENT_CLASS(ext4_set_iomap_class,
+	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map,
+		 loff_t offset, loff_t length, unsigned int flags),
+	TP_ARGS(inode, map, offset, length, flags),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(u64, ino)
+		__field(ext4_lblk_t, m_lblk)
+		__field(unsigned int, m_len)
+		__field(unsigned int, m_flags)
+		__field(u64, m_seq)
+		__field(loff_t, offset)
+		__field(loff_t, length)
+		__field(unsigned int, iomap_flags)
+	),
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->m_lblk		= map->m_lblk;
+		__entry->m_len		= map->m_len;
+		__entry->m_flags	= map->m_flags;
+		__entry->m_seq		= map->m_seq;
+		__entry->offset		= offset;
+		__entry->length		= length;
+		__entry->iomap_flags	= flags;
+
+	),
+	TP_printk("dev %d:%d ino %llu m_lblk %u m_len %u m_flags %s m_seq %llu orig_off 0x%llx orig_len 0x%llx iomap_flags 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino, __entry->m_lblk, __entry->m_len,
+		  show_mflags(__entry->m_flags), __entry->m_seq,
+		  __entry->offset, __entry->length, __entry->iomap_flags)
+)
+
+#define DEFINE_SET_IOMAP_EVENT(name) \
+DEFINE_EVENT(ext4_set_iomap_class, name, \
+	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map, \
+		 loff_t offset, loff_t length, unsigned int flags), \
+	TP_ARGS(inode, map, offset, length, flags))
+
+DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_read_begin);
+DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_write_begin);
+DEFINE_SET_IOMAP_EVENT(ext4_iomap_map_writeback_range);
+DEFINE_SET_IOMAP_EVENT(ext4_iomap_zero_begin);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 16/23] ext4: disable online defrag when inode using iomap buffered I/O path
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (14 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-27 13:14   ` Ojaswin Mujoo
  2026-06-16 12:30   ` Jan Kara
  2026-05-11  7:23 ` [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the " Zhang Yi
                   ` (6 subsequent siblings)
  22 siblings, 2 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Online defragmentation does not currently support inodes using the
iomap buffered I/O path. The existing implementation relies on
buffer_head for sub-folio block management and data=ordered mode for
data consistency, both of which are incompatible with the iomap path.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/move_extent.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 3329b7ad5dbd..f707a1096544 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -476,6 +476,17 @@ static int mext_check_validity(struct inode *orig_inode,
 		return -EOPNOTSUPP;
 	}
 
+	/*
+	 * TODO: support online defrag for inodes that using the buffered
+	 * I/O iomap path.
+	 */
+	if (ext4_inode_buffered_iomap(orig_inode) ||
+	    ext4_inode_buffered_iomap(donor_inode)) {
+		ext4_msg(sb, KERN_ERR,
+			 "Online defrag not supported for inode with iomap buffered IO path");
+		return -EOPNOTSUPP;
+	}
+
 	if (donor_inode->i_mode & (S_ISUID|S_ISGID)) {
 		ext4_debug("ext4 move extent: suid or sgid is set to donor file [ino:orig %llu, donor %llu]\n",
 			   orig_inode->i_ino, donor_inode->i_ino);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (15 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 16/23] ext4: disable online defrag when inode using " Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-27 13:41   ` Ojaswin Mujoo
  2026-05-11  7:23 ` [PATCH v4 18/23] ext4: wait for ordered I/O " Zhang Yi
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

In the generic buffered_head I/O path, we rely on the data=order mode to
ensure that the zeroed EOF block data is written before updating
i_disksize, thus preventing stale data from being exposed.

However, the iomap buffered I/O path cannot use this mechanism. Instead,
we issue the I/O immediately after performing the zero operation
(without synchronous waiting for performance). This can reduce the risk
of exposing stale data, but it does not guarantee that the zero data
will be flushed to disk before the metadata of i_disksize is updated.
The subsequent patches will wait for this I/O to complete before
updating i_disksize.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 55 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 239d387ffaf2..e013aeb03d7b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
 					zero_written);
 }
 
+static int ext4_iomap_submit_zero_block(struct inode *inode,
+					loff_t from, loff_t end)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct folio *folio;
+	bool do_submit = false;
+
+	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
+	if (IS_ERR(folio))
+		/* Already writeback and clear? */
+		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
+
+	folio_wait_writeback(folio);
+	WARN_ON_ONCE(folio_test_writeback(folio));
+
+	if (likely(folio_test_dirty(folio)))
+		do_submit = true;
+	folio_unlock(folio);
+	folio_put(folio);
+
+	/* Submit zeroed block. */
+	if (do_submit)
+		return filemap_fdatawrite_range(mapping, from, end - 1);
+	return 0;
+}
+
 /*
  * Zero out a mapping from file offset 'from' up to the end of the block
  * which corresponds to 'from' or to the given 'end' inside this block.
@@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
 	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
 		return 0;
 
-	if (length > blocksize - offset)
+	if (length > blocksize - offset) {
 		length = blocksize - offset;
+		end = from + length;
+	}
 
 	err = ext4_block_zero_range(inode, from, length,
 				    &did_zero, &zero_written);
@@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
 	 * TODO: In the iomap path, handle this by updating i_disksize to
 	 * i_size after the zeroed data has been written back.
 	 */
-	if (ext4_should_order_data(inode) &&
-	    did_zero && zero_written && !IS_DAX(inode)) {
-		handle_t *handle;
+	if (did_zero && zero_written && !IS_DAX(inode)) {
+		if (ext4_should_order_data(inode)) {
+			handle_t *handle;
 
-		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
-		if (IS_ERR(handle))
-			return PTR_ERR(handle);
+			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
+			if (IS_ERR(handle))
+				return PTR_ERR(handle);
 
-		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
-		ext4_journal_stop(handle);
-		if (err)
-			return err;
+			err = ext4_jbd2_inode_add_write(handle, inode, from,
+							length);
+			ext4_journal_stop(handle);
+			if (err)
+				return err;
+		/*
+		 * inodes using the iomap buffered I/O path do not use the
+		 * data=ordered mode. We submit zeroed range directly here.
+		 * Do not wait for I/O completion for performance.
+		 *
+		 * TODO: Any operation that extends i_disksize (including
+		 * append write end io past the zeroed boundary, truncate up,
+		 * and append fallocate) must wait for the relevant I/O to
+		 * complete before updating i_disksize.
+		 */
+		} else if (ext4_inode_buffered_iomap(inode)) {
+			err = ext4_iomap_submit_zero_block(inode, from, end);
+			if (err)
+				return err;
+		}
 	}
 
 	return 0;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (16 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the " Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-27 15:58   ` Ojaswin Mujoo
  2026-05-11  7:23 ` [PATCH v4 19/23] ext4: update i_disksize to i_size on ordered I/O completion Zhang Yi
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

For append writes, wait for ordered I/O to complete before updating
i_disksize. This ensures that zeroed data is flushed to disk before the
metadata update, preventing stale data from being exposed during
unaligned post-EOF append writes.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h    | 11 +++++++
 fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
 fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
 fs/ext4/super.c   | 23 ++++++++++----
 4 files changed, 161 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 078feda47e36..9ce2128eea3e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1195,6 +1195,15 @@ struct ext4_inode_info {
 #ifdef CONFIG_FS_ENCRYPTION
 	struct fscrypt_inode_info *i_crypt_info;
 #endif
+
+	/*
+	 * Track ordered zeroed data during post-EOF append writes, fallocate,
+	 * and truncate-up operations. These parameters are used only in the
+	 * iomap buffered I/O path.
+	 */
+	ext4_lblk_t i_ordered_lblk;
+	ext4_lblk_t i_ordered_len;
+	wait_queue_head_t i_ordered_wq;
 };
 
 /*
@@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
 			     __u64 len, __u64 *moved_len);
 
 /* page-io.c */
+#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
+
 extern int __init ext4_init_pageio(void);
 extern void ext4_exit_pageio(void);
 extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e013aeb03d7b..11fb369efeb1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
 {
 	struct iomap_ioend *ioend = wpc->wb_ctx;
 	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
+	ext4_lblk_t start, end, order_lblk, order_len;
 
 	/*
 	 * After I/O completion, a worker needs to be scheduled when:
@@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
 	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
 		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
 
+	/*
+	 * Mark the I/O as ordered. Ordered I/O requires separate endio
+	 * handling and must not be merged with regular I/O operations.
+	 */
+	order_len = READ_ONCE(ei->i_ordered_len);
+	if (order_len) {
+		/*
+		 * Pair with smp_store_release() in ext4_block_zero_eof().
+		 * Ensure we see the updated i_ordered_lblk that was written
+		 * before the release store to i_ordered_len.
+		 */
+		smp_rmb();
+		order_lblk = READ_ONCE(ei->i_ordered_lblk);
+		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
+		end = EXT4_B_TO_LBLK(ioend->io_inode,
+				     ioend->io_offset + ioend->io_size);
+
+		if (start <= order_lblk && end >= order_lblk + order_len) {
+			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
+			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
+			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
+		}
+	}
+
 	return iomap_ioend_writeback_submit(wpc, error);
 }
 
@@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
 					loff_t from, loff_t end)
 {
 	struct address_space *mapping = inode->i_mapping;
+	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct folio *folio;
 	bool do_submit = false;
+	int ret;
 
 	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
 	if (IS_ERR(folio))
@@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
 	folio_wait_writeback(folio);
 	WARN_ON_ONCE(folio_test_writeback(folio));
 
-	if (likely(folio_test_dirty(folio)))
+	/*
+	 * Mark the ordered range. It will be cleared upon I/O completion
+	 * in ext4_iomap_end_bio(). Any operation that extends i_disksize
+	 * (including append write end io past the zeroed boundary,
+	 * truncate up and append fallocate) must wait for this I/O to
+	 * complete before updating i_disksize.
+	 *
+	 * When multiple overlapping unaligned EOF writes are in flight, we
+	 * only need to track and wait for the first one. Subsequent writes
+	 * will zero the gap in memory and ensure that the zeroed data is
+	 * written out along with the valid data in the same block before
+	 * i_disksize is updated.
+	 */
+	if (likely(folio_test_dirty(folio) &&
+		   READ_ONCE(ei->i_ordered_len) == 0)) {
+		WRITE_ONCE(ei->i_ordered_lblk,
+			   from >> inode->i_blkbits);
+		/*
+		 * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
+		 * and ext4_iomap_wb_ordered_wait(). Ensure the updated
+		 * i_ordered_lblk is visible when i_ordered_len becomes
+		 * non-zero.
+		 */
+		smp_store_release(&ei->i_ordered_len, 1);
 		do_submit = true;
+	}
 	folio_unlock(folio);
 	folio_put(folio);
 
 	/* Submit zeroed block. */
-	if (do_submit)
-		return filemap_fdatawrite_range(mapping, from, end - 1);
+	if (do_submit) {
+		ret = filemap_fdatawrite_range(mapping, from, end - 1);
+		if (ret) {
+			/*
+			 * Pairs with wait_event() in
+			 * ext4_iomap_wb_ordered_wait(). Ensure
+			 * i_ordered_len = 0 is visible before waking up
+			 * waiters.
+			 */
+			smp_store_release(&ei->i_ordered_len, 0);
+			wake_up_all(&ei->i_ordered_wq);
+			return ret;
+		}
+	}
 	return 0;
 }
 
@@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
 		 * data=ordered mode. We submit zeroed range directly here.
 		 * Do not wait for I/O completion for performance.
 		 *
-		 * TODO: Any operation that extends i_disksize (including
-		 * append write end io past the zeroed boundary, truncate up,
-		 * and append fallocate) must wait for the relevant I/O to
-		 * complete before updating i_disksize.
+		 * The end_io handler ext4_iomap_wb_ordered_wait() will wait
+		 * for I/O completion before updating i_disksize if the write
+		 * extends beyond the zeroed boundary.
+		 *
+		 * TODO: Any other operation that extends i_disksize
+		 * (including truncate up and append fallocate) must wait for
+		 * the relevant I/O to complete before updating i_disksize.
 		 */
 		} else if (ext4_inode_buffered_iomap(inode)) {
 			err = ext4_iomap_submit_zero_block(inode, from, end);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 3050c887329f..ad05ebb49bf6 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
 	return 0;
 }
 
+/*
+ * If the old disk size is not block size aligned and the current
+ * writeback range is entirely beyond the old EOF block, we should
+ * wait for the zeroed data written in ext4_block_zero_eof() to be
+ * written out, otherwise, it may expose stale data in that block.
+ */
+static void ext4_iomap_wb_ordered_wait(struct inode *inode,
+				       loff_t pos, loff_t end)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	unsigned int blocksize = i_blocksize(inode);
+	loff_t disksize = READ_ONCE(ei->i_disksize);
+	ext4_lblk_t order_lblk, order_len;
+
+	/*
+	 * Waiting for ordered I/O is unnecessary when:
+	 * - The on-disk size is block-aligned (no stale data exists).
+	 * - The write start is within the block of the old EOF
+	 *   (overwriting, or appending to a block that already contains
+	 *   valid data).
+	 */
+	if (!(disksize & (blocksize - 1)) ||
+	    pos < round_up(disksize, blocksize))
+		return;
+
+	order_len = READ_ONCE(ei->i_ordered_len);
+	if (!order_len)
+		return;
+
+	/*
+	 * Pair with smp_store_release() in ext4_iomap_end_bio() and
+	 * ext4_block_zero_eof(). Ensure we see the updated i_ordered_lblk
+	 * that was written before the release store to i_ordered_len.
+	 */
+	smp_rmb();
+	order_lblk = READ_ONCE(ei->i_ordered_lblk);
+	if ((pos >> inode->i_blkbits) >= order_lblk + order_len)
+		wait_event(ei->i_ordered_wq, READ_ONCE(ei->i_ordered_len) == 0);
+}
+
 static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
 					 loff_t end)
 {
@@ -656,6 +696,9 @@ static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
 		goto out;
 	}
 
+	/* Wait ordered zero data to be written out. */
+	ext4_iomap_wb_ordered_wait(inode, pos, pos + size);
+
 	/* We may need to convert one extent and dirty the inode. */
 	credits = ext4_chunk_trans_blocks(inode,
 			EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
@@ -717,8 +760,25 @@ void ext4_iomap_end_bio(struct bio *bio)
 	struct inode *inode = ioend->io_inode;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	unsigned long io_mode = (unsigned long)ioend->io_private;
 	unsigned long flags;
 
+	/*
+	 * This is an ordered I/O, clear the ordered range set in
+	 * ext4_block_zero_eof() and wake up all waiters that will update
+	 * the inode i_disksize.
+	 */
+	if (io_mode == EXT4_IOMAP_IOEND_ORDER_IO) {
+		/*
+		 * Pairs with wait_event() in ext4_iomap_wb_ordered_wait().
+		 * Ensure i_ordered_len = 0 is visible before waking up
+		 * waiters.
+		 */
+		smp_store_release(&ei->i_ordered_len, 0);
+		wake_up_all(&ei->i_ordered_wq);
+		goto defer;
+	}
+
 	/* Needs to convert unwritten extents or update the i_disksize. */
 	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
 	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 62bfe05a64bc..9c0a00e716f3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1444,6 +1444,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ext4_fc_init_inode(&ei->vfs_inode);
 	spin_lock_init(&ei->i_fc_lock);
 	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
+	ei->i_ordered_lblk = 0;
+	ei->i_ordered_len = 0;
+	init_waitqueue_head(&ei->i_ordered_wq);
 	return &ei->vfs_inode;
 }
 
@@ -1480,12 +1483,20 @@ static void ext4_destroy_inode(struct inode *inode)
 		dump_stack();
 	}
 
-	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS) &&
-	    WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
-		ext4_msg(inode->i_sb, KERN_ERR,
-			 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
-			 inode->i_ino, EXT4_I(inode),
-			 EXT4_I(inode)->i_reserved_data_blocks);
+	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS)) {
+		if (WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
+			ext4_msg(inode->i_sb, KERN_ERR,
+				 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
+				 inode->i_ino, EXT4_I(inode),
+				 EXT4_I(inode)->i_reserved_data_blocks);
+
+		if (WARN_ON_ONCE(EXT4_I(inode)->i_ordered_len))
+			ext4_msg(inode->i_sb, KERN_ERR,
+				 "Inode %llu (%p): i_ordered_lblk (%u) and i_ordered_len (%u) not cleared!",
+				 inode->i_ino, EXT4_I(inode),
+				 EXT4_I(inode)->i_ordered_lblk,
+				 EXT4_I(inode)->i_ordered_len);
+	}
 }
 
 static void ext4_shutdown(struct super_block *sb)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 19/23] ext4: update i_disksize to i_size on ordered I/O completion
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (17 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 18/23] ext4: wait for ordered I/O " Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 20/23] ext4: wait for ordered I/O to complete during insert and collapse range Zhang Yi
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Currently, i_disksize is updated after ordered data writeback to prevent
exposing stale data in the post-EOF block. However, operations like
append allocate, zero range and truncate update i_disksize directly. If
the new i_disksize exceeds the original value, metadata may be written
back before the zeroed data is persisted. To avoid this, we defer
i_disksize updates when i_ordered_len is non-zero, only applying them
after ordered I/O completes.

However, this deferral introduces a new problem: on ordered I/O
completion, i_disksize is updated only to the end of that specific I/O,
discarding any later updates (e.g., from fallocate) and causing
filesystem inconsistency. A potential fix would involve scanning for
dirty or writeback folios beyond the current position, then updating
i_disksize to the start of the first such folio or to i_size. However,
folio scanning is expensive and concurrency with operations like
fallocate makes this approach prohibitively complex.

Instead, when ordered zero I/O completes, update i_disksize directly to
i_size. This may expose zeroed data (if dirty data within the range is
not yet on disk after crash recovery), but it will never expose stale
data. This limitation is restricted to unaligned append writes and is
deemed acceptable.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h    | 29 +++++++++++++++++++++++++----
 fs/ext4/inode.c   | 30 ++++++++++++++++++++----------
 fs/ext4/page-io.c | 25 ++++++++++++++++++++-----
 3 files changed, 65 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9ce2128eea3e..0a3bb44f1e6e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3493,13 +3493,21 @@ do {								\
 #define EXT4_FREECLUSTERS_WATERMARK 0
 #endif
 
-/* Update i_disksize. Requires i_rwsem to avoid races with truncate */
+/*
+ * Update i_disksize. Requires i_rwsem to avoid races with truncate.
+ *
+ * In the iomap buffered I/O path, a non-zero i_ordered_len indicates that
+ * an ordered I/O (zeroing the EOF partial block) is still in progress.
+ * In that case, i_disksize will be updated after the ordered data has
+ * been written out.
+ */
 static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
 {
 	WARN_ON_ONCE(S_ISREG(inode->i_mode) &&
 		     !inode_is_locked(inode));
 	down_write(&EXT4_I(inode)->i_data_sem);
-	if (newsize > EXT4_I(inode)->i_disksize)
+	if (newsize > EXT4_I(inode)->i_disksize &&
+	    READ_ONCE(EXT4_I(inode)->i_ordered_len) == 0)
 		WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize);
 	up_write(&EXT4_I(inode)->i_data_sem);
 }
@@ -3514,8 +3522,21 @@ static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize)
 		changed = 1;
 	}
 	if (newsize > EXT4_I(inode)->i_disksize) {
-		ext4_update_i_disksize(inode, newsize);
-		changed |= 2;
+		/*
+		 * Pairs with smp_store_release() in ext4_iomap_end_bio()
+		 * that clears i_ordered_len.  The smp_mb() ensures the
+		 * i_size store above is globally visible before we read
+		 * i_ordered_len.  This way, if we skip the i_disksize
+		 * update because i_ordered_len is still non-zero, the
+		 * ordered-I/O completion path (which reads i_size under
+		 * i_data_sem) is guaranteed to see the new i_size and will
+		 * update i_disksize correctly.
+		 */
+		smp_mb();
+		if (READ_ONCE(EXT4_I(inode)->i_ordered_len) == 0) {
+			ext4_update_i_disksize(inode, newsize);
+			changed |= 2;
+		}
 	}
 	return changed;
 }
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 11fb369efeb1..1e208b3fad34 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4868,9 +4868,6 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
 	 * truncating up or performing an append write, because there might be
 	 * exposing stale on-disk data which may caused by concurrent post-EOF
 	 * mmap write during folio writeback.
-	 *
-	 * TODO: In the iomap path, handle this by updating i_disksize to
-	 * i_size after the zeroed data has been written back.
 	 */
 	if (did_zero && zero_written && !IS_DAX(inode)) {
 		if (ext4_should_order_data(inode)) {
@@ -4894,9 +4891,15 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
 		 * for I/O completion before updating i_disksize if the write
 		 * extends beyond the zeroed boundary.
 		 *
-		 * TODO: Any other operation that extends i_disksize
-		 * (including truncate up and append fallocate) must wait for
-		 * the relevant I/O to complete before updating i_disksize.
+		 * When zeroed I/O is in progress, operations that extend
+		 * i_disksize are handled as follows:
+		 *
+		 *  - Truncate up, append fallocate and zero_range:
+		 *    Defer the update. The file size will be updated to
+		 *    i_size by the end_io handler once the ongoing I/O
+		 *    completes.
+		 *
+		 *  - TODO: handle insert range and collapse range.
 		 */
 		} else if (ext4_inode_buffered_iomap(inode)) {
 			err = ext4_iomap_submit_zero_block(inode, from, end);
@@ -6512,11 +6515,16 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
 }
 
 /*
- * Set i_size and i_disksize to 'newsize'.
+ * Set i_size and i_disksize to 'newsize'.  In the iomap buffered I/O path,
+ * if i_ordered_len is non-zero and newsize exceeds the current i_disksize,
+ * the actual i_disksize update is deferred until after the ordered data is
+ * written out.  In that case, i_disksize will be set to i_size upon I/O
+ * completion.
  *
  * Both i_rwsem and i_data_sem are required here to avoid races between
- * generic append writeback and concurrent truncate that also modify
- * i_size and i_disksize.
+ * generic append writeback (or ordered I/O writeback) and concurrent
+ * operations (e.g., fallocate, truncate) that also modify i_size and
+ * i_disksize.
  */
 static inline void ext4_set_inode_size(struct inode *inode, loff_t newsize)
 {
@@ -6524,7 +6532,9 @@ static inline void ext4_set_inode_size(struct inode *inode, loff_t newsize)
 
 	down_write(&EXT4_I(inode)->i_data_sem);
 	i_size_write(inode, newsize);
-	EXT4_I(inode)->i_disksize = newsize;
+	if (READ_ONCE(EXT4_I(inode)->i_ordered_len) == 0 ||
+	    newsize < EXT4_I(inode)->i_disksize)
+		WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize);
 	up_write(&EXT4_I(inode)->i_data_sem);
 }
 
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index ad05ebb49bf6..2ad9f900c9f3 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -654,13 +654,13 @@ static void ext4_iomap_wb_ordered_wait(struct inode *inode,
 }
 
 static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
-					 loff_t end)
+					 loff_t end, bool is_ordered)
 {
-	loff_t new_disksize = end;
+	loff_t new_disksize, i_size;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	int ret;
 
-	if (new_disksize <= READ_ONCE(ei->i_disksize))
+	if (end <= READ_ONCE(ei->i_disksize) && !is_ordered)
 		return 0;
 
 	/*
@@ -668,7 +668,20 @@ static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
 	 * are avoided by checking i_size under i_data_sem.
 	 */
 	down_write(&ei->i_data_sem);
-	new_disksize = min(new_disksize, i_size_read(inode));
+	i_size = i_size_read(inode);
+
+	/*
+	 * Update i_disksize to i_size when completing an ordered I/O that
+	 * zeroes the old EOF partial block.  This is safe because we never
+	 * directly allocate written blocks during buffered writes.
+	 *
+	 * This ensures i_disksize is correctly advanced during truncate-up
+	 * or append fallocate on a block-unaligned file, preventing it
+	 * from remaining stale.  A downside is that zeroed data may be
+	 * exposed after crash recovery if the dirty data in this range is
+	 * not yet on disk, but stale data will never be exposed.
+	 */
+	new_disksize = is_ordered ? i_size : min(end, i_size);
 	if (new_disksize > ei->i_disksize)
 		ei->i_disksize = new_disksize;
 	up_write(&ei->i_data_sem);
@@ -685,6 +698,7 @@ static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
 	struct super_block *sb = inode->i_sb;
 	loff_t pos = ioend->io_offset;
 	size_t size = ioend->io_size;
+	unsigned long io_mode = (unsigned long)ioend->io_private;
 	handle_t *handle;
 	int credits;
 	int ret, err;
@@ -714,7 +728,8 @@ static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
 			goto out_journal;
 	}
 
-	ret = ext4_iomap_wb_update_disksize(handle, inode, pos + size);
+	ret = ext4_iomap_wb_update_disksize(handle, inode, pos + size,
+			io_mode == EXT4_IOMAP_IOEND_ORDER_IO);
 out_journal:
 	err = ext4_journal_stop(handle);
 	if (!ret)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 20/23] ext4: wait for ordered I/O to complete during insert and collapse range
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (18 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 19/23] ext4: update i_disksize to i_size on ordered I/O completion Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 21/23] ext4: add tracepoints for ordered I/O in the iomap buffered I/O path Zhang Yi
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Currently, i_disksize is updated after ordered data writeback to prevent
exposing stale data in the post-EOF block, and operations like append
allocate, zero range, and truncate defer the i_disksize update until
ordered I/O completes.

However, insert range and collapse range still directly update
i_disksize. This is safe because they have already called
filemap_write_and_wait_range() to flush data up to LLONG_MAX, ensuring
that ordered I/O has completed if any dirty data was present.

One exception is when the ordered I/O is caused by a previous truncate
up. In this case, there is no dirty data to flush. Therefore, add an
explicit wait for I/O completion to handle this case. This will not have
significant impact on performance.

Finally, also add a WARN_ON_ONCE check before updating i_disksize to
detect any unexpected cases that could still expose stale data.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 18 ++++++++++++++++++
 fs/ext4/inode.c   |  4 +++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 125f628e738a..85c74c37f0ca 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5565,6 +5565,14 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 	if (ret)
 		return ret;
 
+	/*
+	 * Wait for ordered I/O to be complete. Updating i_disksize beyond
+	 * the current i_disksize here risks exposuring stale data.
+	 */
+	if (ext4_inode_buffered_iomap(inode))
+		wait_event(EXT4_I(inode)->i_ordered_wq,
+			   READ_ONCE(EXT4_I(inode)->i_ordered_len) == 0);
+
 	truncate_pagecache(inode, start);
 
 	credits = ext4_chunk_trans_extent(inode, 0);
@@ -5597,6 +5605,7 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 		goto out_handle;
 	}
 
+	WARN_ON_ONCE(READ_ONCE(EXT4_I(inode)->i_ordered_len) != 0);
 	new_size = inode->i_size - len;
 	i_size_write(inode, new_size);
 	EXT4_I(inode)->i_disksize = new_size;
@@ -5661,6 +5670,14 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 	if (ret)
 		return ret;
 
+	/*
+	 * Wait for ordered I/O to be complete. Updating i_disksize beyond
+	 * the current i_disksize here risks exposuring stale data.
+	 */
+	if (ext4_inode_buffered_iomap(inode))
+		wait_event(EXT4_I(inode)->i_ordered_wq,
+			   READ_ONCE(EXT4_I(inode)->i_ordered_len) == 0);
+
 	truncate_pagecache(inode, start);
 
 	credits = ext4_chunk_trans_extent(inode, 0);
@@ -5671,6 +5688,7 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 	ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE, handle);
 
 	/* Expand file to avoid data loss if there is error while shifting */
+	WARN_ON_ONCE(READ_ONCE(EXT4_I(inode)->i_ordered_len) != 0);
 	inode->i_size += len;
 	EXT4_I(inode)->i_disksize += len;
 	ret = ext4_mark_inode_dirty(handle, inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 23efb44f0c27..e47b504e85c9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4899,7 +4899,9 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
 		 *    i_size by the end_io handler once the ongoing I/O
 		 *    completes.
 		 *
-		 *  - TODO: handle insert range and collapse range.
+		 *  - Insert range and collapse range operations:
+		 *    Wait synchronously for the relevant I/O to complete
+		 *    before updating i_disksize.
 		 */
 		} else if (ext4_inode_buffered_iomap(inode)) {
 			err = ext4_iomap_submit_zero_block(inode, from, end);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 21/23] ext4: add tracepoints for ordered I/O in the iomap buffered I/O path
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (19 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 20/23] ext4: wait for ordered I/O to complete during insert and collapse range Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 22/23] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 23/23] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi
  22 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

To facilitate the tracing of ordered I/Os in the iomap buffered I/O
path, add tracepoints to track the ordered I/O flow:

 - ext4_iomap_ordered_submit: trace when ordered I/O is being submitted;
 - ext4_iomap_ordered_complete: trace when ordered I/O completes;
 - ext4_iomap_disksize_update: trace when i_disksize is updated, either
   when appending I/O or when an ordered I/O completes;
 - ext4_block_zero_eof - trace zero EOF partial block.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c             |  4 ++
 fs/ext4/page-io.c           |  9 ++++
 include/trace/events/ext4.h | 97 +++++++++++++++++++++++++++++++++++++
 3 files changed, 110 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e47b504e85c9..cf83b4e619e0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4376,6 +4376,9 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
 				     ioend->io_offset + ioend->io_size);
 
 		if (start <= order_lblk && end >= order_lblk + order_len) {
+			trace_ext4_iomap_ordered_submit(ioend->io_inode,
+					ioend->io_offset, ioend->io_size,
+					order_lblk, order_len);
 			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
 			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
 			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
@@ -4910,6 +4913,7 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
 		}
 	}
 
+	trace_ext4_block_zero_eof(inode, from, length, did_zero, zero_written);
 	return 0;
 }
 
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 2ad9f900c9f3..b5b32dc388be 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -31,6 +31,8 @@
 #include "xattr.h"
 #include "acl.h"
 
+#include <trace/events/ext4.h>
+
 static struct kmem_cache *io_end_cachep;
 static struct kmem_cache *io_end_vec_cachep;
 
@@ -682,6 +684,9 @@ static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
 	 * not yet on disk, but stale data will never be exposed.
 	 */
 	new_disksize = is_ordered ? i_size : min(end, i_size);
+	trace_ext4_iomap_disksize_update(inode, end, i_size, ei->i_disksize,
+					 new_disksize, is_ordered);
+
 	if (new_disksize > ei->i_disksize)
 		ei->i_disksize = new_disksize;
 	up_write(&ei->i_data_sem);
@@ -784,6 +789,10 @@ void ext4_iomap_end_bio(struct bio *bio)
 	 * the inode i_disksize.
 	 */
 	if (io_mode == EXT4_IOMAP_IOEND_ORDER_IO) {
+		trace_ext4_iomap_ordered_complete(inode, ioend->io_offset,
+				ioend->io_size, READ_ONCE(ei->i_ordered_lblk),
+				READ_ONCE(ei->i_ordered_len));
+
 		/*
 		 * Pairs with wait_event() in ext4_iomap_wb_ordered_wait().
 		 * Ensure i_ordered_len = 0 is visible before waking up
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index ebafa06cd191..423aec6d09d1 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -3141,6 +3141,103 @@ DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_write_begin);
 DEFINE_SET_IOMAP_EVENT(ext4_iomap_map_writeback_range);
 DEFINE_SET_IOMAP_EVENT(ext4_iomap_zero_begin);
 
+/* Ordered I/O tracepoints for iomap buffered I/O path */
+DECLARE_EVENT_CLASS(ext4_iomap_ordered_io,
+	TP_PROTO(struct inode *inode, loff_t io_offset, size_t io_size,
+		 ext4_lblk_t i_ordered_lblk, unsigned int i_ordered_len),
+	TP_ARGS(inode, io_offset, io_size, i_ordered_lblk, i_ordered_len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(u64, ino)
+		__field(loff_t, io_offset)
+		__field(size_t, io_size)
+		__field(ext4_lblk_t, i_ordered_lblk)
+		__field(unsigned int, i_ordered_len)
+	),
+	TP_fast_assign(
+		__entry->dev = inode->i_sb->s_dev;
+		__entry->ino = inode->i_ino;
+		__entry->io_offset = io_offset;
+		__entry->io_size = io_size;
+		__entry->i_ordered_lblk = i_ordered_lblk;
+		__entry->i_ordered_len = i_ordered_len;
+	),
+	TP_printk("dev %d:%d ino %llu io_offset %lld io_size %zu i_ordered_lblk %u i_ordered_len %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino, __entry->io_offset, __entry->io_size,
+		  __entry->i_ordered_lblk, __entry->i_ordered_len)
+);
+
+DEFINE_EVENT(ext4_iomap_ordered_io, ext4_iomap_ordered_submit,
+	TP_PROTO(struct inode *inode, loff_t io_offset, size_t io_size,
+		 ext4_lblk_t i_ordered_lblk, unsigned int i_ordered_len),
+	TP_ARGS(inode, io_offset, io_size, i_ordered_lblk, i_ordered_len)
+);
+
+DEFINE_EVENT(ext4_iomap_ordered_io, ext4_iomap_ordered_complete,
+	TP_PROTO(struct inode *inode, loff_t io_offset, size_t io_size,
+		 ext4_lblk_t i_ordered_lblk, unsigned int i_ordered_len),
+	TP_ARGS(inode, io_offset, io_size, i_ordered_lblk, i_ordered_len)
+);
+
+
+/* i_disksize update tracepoint */
+TRACE_EVENT(ext4_iomap_disksize_update,
+	TP_PROTO(struct inode *inode, loff_t end, loff_t i_size,
+		 loff_t i_disksize, loff_t new_disksize, bool is_ordered),
+	TP_ARGS(inode, end, i_size, i_disksize, new_disksize, is_ordered),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(u64, ino)
+		__field(loff_t, end)
+		__field(loff_t, i_size)
+		__field(loff_t, i_disksize)
+		__field(loff_t, new_disksize)
+		__field(bool, is_ordered)
+	),
+	TP_fast_assign(
+		__entry->dev = inode->i_sb->s_dev;
+		__entry->ino = inode->i_ino;
+		__entry->end = end;
+		__entry->i_size = i_size;
+		__entry->i_disksize = i_disksize;
+		__entry->new_disksize = new_disksize;
+		__entry->is_ordered = is_ordered;
+	),
+	TP_printk("dev %d:%d ino %llu end %lld i_size %lld i_disksize %lld new_disksize %lld is_ordered %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino, __entry->end, __entry->i_size,
+		  __entry->i_disksize, __entry->new_disksize,
+		  __entry->is_ordered)
+);
+
+/* Block zero EOF tracepoint */
+TRACE_EVENT(ext4_block_zero_eof,
+	TP_PROTO(struct inode *inode, loff_t from, loff_t length,
+		 bool did_zero, bool zero_written),
+	TP_ARGS(inode, from, length, did_zero, zero_written),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(u64, ino)
+		__field(loff_t, from)
+		__field(loff_t, length)
+		__field(bool, did_zero)
+		__field(bool, zero_written)
+	),
+	TP_fast_assign(
+		__entry->dev = inode->i_sb->s_dev;
+		__entry->ino = inode->i_ino;
+		__entry->from = from;
+		__entry->length = length;
+		__entry->did_zero = did_zero;
+		__entry->zero_written = zero_written;
+	),
+	TP_printk("dev %d:%d ino %llu zero EOF from %lld length %lld did_zero %d zero_written %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino, __entry->from, __entry->length,
+		  __entry->did_zero, __entry->zero_written)
+);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 22/23] ext4: partially enable iomap for the buffered I/O path of regular files
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (20 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 21/23] ext4: add tracepoints for ordered I/O in the iomap buffered I/O path Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 23/23] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi
  22 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Introduce ext4_enable_buffered_iomap() to determine whether a regular
file inode should use the iomap buffered I/O path. We now support the
default filesystem features, mount options, and the bigalloc feature.
However, inline data, fsverity, fscrypt, indirect inode type, and
data=journal mode are not fully supported.

The decision is made at inode initialization time in __ext4_new_inode()
and __ext4_iget() by setting the EXT4_STATE_BUFFERED_IOMAP state flag.
If any of these unsupported features are met, the inode silently falls
back to the traditional buffer_head path. Switching the buffered I/O
path on an active inode is not supported, with the exception of changing
a per-inode journal flag.

For features like encryption, verity, and inline data that can be
dynamically enabled at the superblock level, checking the global feature
flag avoids the complexity of toggling the path on individual inodes.

Additionally:

 - Extend ext4_inode_journal_mode() to force ordered mode for inodes
   using the iomap path under a data=journal mount. For the global data
   journal mode (EXT4_MOUNT_JOURNAL_DATA), dynamic enablement is
   deferred until the next inode re-initialization. For the per-inode
   data journal mode (EXT4_INODE_JOURNAL_DATA), dynamic changes take
   effect immediately, as it is safe to switch address_space operations
   and drop all page cache under i_rwsem and filemap_invalidate_lock.

 - Add WARN_ON_ONCE() guards in _ext4_get_block() and
   ext4_do_writepages() to catch inodes using the iomap path from
   accidentally entering the legacy buffer_head writeback path.

 - Reject extent-to-indirect migration via ext4_ind_migrate() for inodes
   on the iomap path.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h      |  1 +
 fs/ext4/ext4_jbd2.c |  8 +++--
 fs/ext4/ialloc.c    |  1 +
 fs/ext4/inode.c     | 77 +++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/migrate.c   |  2 ++
 5 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0a3bb44f1e6e..afba952abd28 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3068,6 +3068,7 @@ int ext4_walk_page_buffers(handle_t *handle,
 int do_journal_get_write_access(handle_t *handle, struct inode *inode,
 				struct buffer_head *bh);
 void ext4_set_inode_mapping_order(struct inode *inode);
+void ext4_enable_buffered_iomap(struct inode *inode);
 int ext4_nonda_switch(struct super_block *sb);
 #define FALL_BACK_TO_NONDELALLOC 1
 #define CONVERT_INLINE_DATA	 2
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 9a8c225f2753..4534cf6f5e76 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -17,8 +17,12 @@ int ext4_inode_journal_mode(struct inode *inode)
 	    test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
 	    (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
 	    !test_opt(inode->i_sb, DELALLOC))) {
-		/* We do not support data journalling for encrypted data */
-		if (S_ISREG(inode->i_mode) && IS_ENCRYPTED(inode))
+		/*
+		 * We do not support data journalling for encrypted data
+		 * and buffered IOMAP path.
+		 */
+		if (S_ISREG(inode->i_mode) &&
+		    (IS_ENCRYPTED(inode) || ext4_inode_buffered_iomap(inode)))
 			return EXT4_INODE_ORDERED_DATA_MODE;  /* ordered */
 		return EXT4_INODE_JOURNAL_DATA_MODE;	/* journal data */
 	}
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 3fd8f0099852..ea64b9e9e382 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1340,6 +1340,7 @@ struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
 		}
 	}
 
+	ext4_enable_buffered_iomap(inode);
 	ext4_set_inode_mapping_order(inode);
 
 	ext4_update_inode_fsync_trans(handle, inode, 1);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cf83b4e619e0..0407e7b54dcd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -918,6 +918,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 
 	if (ext4_has_inline_data(inode))
 		return -ERANGE;
+	/* inode using the iomap buffered I/O path should not go here. */
+	if (WARN_ON_ONCE(ext4_inode_buffered_iomap(inode)))
+		return -EINVAL;
 
 	map.m_lblk = iblock;
 	map.m_len = bh->b_size >> inode->i_blkbits;
@@ -2797,6 +2800,12 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 	if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
 		goto out_writepages;
 
+	/* inode using the iomap buffered I/O path should not go here. */
+	if (WARN_ON_ONCE(ext4_inode_buffered_iomap(inode))) {
+		ret = -EINVAL;
+		goto out_writepages;
+	}
+
 	/*
 	 * If the filesystem has aborted, it is read-only, so return
 	 * right away instead of dumping stack traces later on that
@@ -3929,6 +3938,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
 {
 	u8 blkbits = inode->i_blkbits;
 
+	/* inode using the buffer_head buffered I/O path should not go here. */
+	if (WARN_ON_ONCE(!ext4_inode_buffered_iomap(inode)))
+		return -EINVAL;
 	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
 		return -EINVAL;
 
@@ -4406,6 +4418,10 @@ static int ext4_iomap_writepages(struct address_space *mapping,
 		.ops = &ext4_writeback_ops,
 	};
 
+	/* inode using the buffer_head buffered I/O path should not go here. */
+	if (WARN_ON_ONCE(!ext4_inode_buffered_iomap(inode)))
+		return -EINVAL;
+
 	ret = ext4_emergency_state(sb);
 	if (unlikely(ret))
 		return ret;
@@ -5864,6 +5880,59 @@ static int check_igot_inode(struct inode *inode, ext4_iget_flags flags,
 	return -EFSCORRUPTED;
 }
 
+/*
+ * Determine whether an inode should use the iomap buffered I/O path.
+ * EXT4_STATE_BUFFERED_IOMAP is generally set at inode initialization
+ * time. Online switching of the buffered I/O path on an active inode is
+ * NOT supported, with the exception of changing a per-inode journal
+ * flag.
+ *
+ * For features like inline data, fsverity, and encryption that can be
+ * dynamically enabled or disabled, we check the superblock-level
+ * feature flags. If any of these is globally enabled, no inode is
+ * allowed into the iomap buffered I/O path. This avoids the complexity
+ * of dynamic toggling.
+ *
+ * For the global data journal mode (EXT4_MOUNT_JOURNAL_DATA), dynamic
+ * change through remount is deferred. It will only become available
+ * after the inode is re-initialized (i.e., after the last reference
+ * drops and the inode is re-read from disk with the journal flag
+ * cleared).
+ *
+ * For the per-inode data journal mode (EXT4_INODE_JOURNAL_DATA),
+ * dynamic changes take effect immediately. This is safe because
+ * address_space operations can be switched and all page cache can be
+ * dropped under i_rwsem and filemap_invalidate_lock.
+ *
+ * For extent-to-indirect block migration (via EXT4_IOC_SETFLAGS
+ * clearing EXT4_EXTENTS_FL), this operation is directly rejected for
+ * inodes using the iomap path.
+ */
+void ext4_enable_buffered_iomap(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	if (!S_ISREG(inode->i_mode))
+		return;
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE))
+		return;
+
+	/* Unsupported Features */
+	if (ext4_has_feature_inline_data(sb))
+		return;
+	if (ext4_has_feature_verity(sb))
+		return;
+	if (ext4_has_feature_encrypt(sb))
+		return;
+	if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
+	    ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
+		return;
+	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+		return;
+
+	ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+}
+
 void ext4_set_inode_mapping_order(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
@@ -6149,6 +6218,8 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	if (ret)
 		goto bad_inode;
 
+	ext4_enable_buffered_iomap(inode);
+
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
 		inode->i_fop = &ext4_file_operations;
@@ -7326,9 +7397,10 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
 	 * the inode's in-core data-journaling state flag now.
 	 */
 
-	if (val)
+	if (val) {
 		ext4_set_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
-	else {
+		ext4_clear_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+	} else {
 		err = jbd2_journal_flush(journal, 0);
 		if (err < 0) {
 			jbd2_journal_unlock_updates(journal);
@@ -7337,6 +7409,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
 			return err;
 		}
 		ext4_clear_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
+		ext4_enable_buffered_iomap(inode);
 	}
 	ext4_set_aops(inode);
 	ext4_set_inode_mapping_order(inode);
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 477d43d7e294..3b49ecf427ae 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -620,6 +620,8 @@ int ext4_ind_migrate(struct inode *inode)
 
 	if (ext4_has_feature_bigalloc(inode->i_sb))
 		return -EOPNOTSUPP;
+	if (ext4_inode_buffered_iomap(inode))
+		return -EOPNOTSUPP;
 
 	/*
 	 * In order to get correct extent info, force all delayed allocation
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH v4 23/23] ext4: introduce a mount option for iomap buffered I/O path
  2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
                   ` (21 preceding siblings ...)
  2026-05-11  7:23 ` [PATCH v4 22/23] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
@ 2026-05-11  7:23 ` Zhang Yi
  22 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Since the iomap buffered I/O path does not yet support all existing ext4
features, it cannot be enabled by default. Introduce the
'buffered_iomap' and 'nobuffered_iomap' mount options to explicitly
enable or disable the iomap buffered I/O path for regular files.

Toggling this option via remount is allowed. The change of I/O path will
not take effect immediately. It will be deferred. The new setting will
only take effect after the inode is re-initialized (i.e., after the last
reference is dropped and the inode is re-read from disk).

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h  | 1 +
 fs/ext4/inode.c | 6 ++++++
 fs/ext4/super.c | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index afba952abd28..33da8c1915a7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1290,6 +1290,7 @@ struct ext4_inode_info {
 						    * scanning in mballoc
 						    */
 #define EXT4_MOUNT2_ABORT		0x00000100 /* Abort filesystem */
+#define EXT4_MOUNT2_BUFFERED_IOMAP	0x00000200 /* Use iomap for buffered I/O */
 
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0407e7b54dcd..1432ef29748b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5907,11 +5907,17 @@ static int check_igot_inode(struct inode *inode, ext4_iget_flags flags,
  * For extent-to-indirect block migration (via EXT4_IOC_SETFLAGS
  * clearing EXT4_EXTENTS_FL), this operation is directly rejected for
  * inodes using the iomap path.
+ *
+ * When remounting to toggle the buffered_iomap mount option, the change
+ * of I/O path is deferred as well, it will be available after the inode
+ * is re-initialized.
  */
 void ext4_enable_buffered_iomap(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 
+	if (!test_opt2(sb, BUFFERED_IOMAP))
+		return;
 	if (!S_ISREG(inode->i_mode))
 		return;
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9c0a00e716f3..2fc07739c9e8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1733,6 +1733,7 @@ enum {
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
 	Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
+	Opt_buffered_iomap, Opt_nobuffered_iomap,
 	Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
 #ifdef CONFIG_EXT4_DEBUG
 	Opt_fc_debug_max_replay, Opt_fc_debug_force
@@ -1871,6 +1872,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
 	fsparam_flag	("no_prefetch_block_bitmaps",
 						Opt_no_prefetch_block_bitmaps),
 	fsparam_s32	("mb_optimize_scan",	Opt_mb_optimize_scan),
+	fsparam_flag	("buffered_iomap",	Opt_buffered_iomap),
+	fsparam_flag	("nobuffered_iomap",	Opt_nobuffered_iomap),
 	fsparam_string	("check",		Opt_removed),	/* mount option from ext2/3 */
 	fsparam_flag	("nocheck",		Opt_removed),	/* mount option from ext2/3 */
 	fsparam_flag	("reservation",		Opt_removed),	/* mount option from ext2/3 */
@@ -1964,6 +1967,10 @@ static const struct mount_opts {
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
 	{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
 	 MOPT_SET},
+	{Opt_buffered_iomap, EXT4_MOUNT2_BUFFERED_IOMAP,
+	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
+	{Opt_nobuffered_iomap, EXT4_MOUNT2_BUFFERED_IOMAP,
+	 MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
 #ifdef CONFIG_EXT4_DEBUG
 	{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
 	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 11/23] iomap: correct the range of a partial dirty clear
  2026-05-11  7:23 ` [PATCH v4 11/23] iomap: correct the range of a partial dirty clear Zhang Yi
@ 2026-05-11  7:46   ` Christoph Hellwig
  2026-05-11  8:57     ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Hellwig @ 2026-05-11  7:46 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

Plase send the iomap patches out separate, including to all the
relevant lists from the iomap MAINTAINERS entry.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 11/23] iomap: correct the range of a partial dirty clear
  2026-05-11  7:46   ` Christoph Hellwig
@ 2026-05-11  8:57     ` Zhang Yi
  0 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-05-11  8:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, yi.zhang,
	yizhang089, yangerkun, yukuai

On 5/11/2026 3:46 PM, Christoph Hellwig wrote:
> Plase send the iomap patches out separate, including to all the
> relevant lists from the iomap MAINTAINERS entry.
> 

OK, sure, will do.

Best Regards,
Yi


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr()
  2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
@ 2026-05-19  5:24   ` Ojaswin Mujoo
  0 siblings, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-19  5:24 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:21PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The logic for updating the file size in ext4_setattr() is currently
> somewhat messy. By directly entering the error-handling path after
> failing to add an orphan inode, the unnecessary recovery process
> involving old_disksize and the file size can be avoided.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Looks good, feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

> ---
>  fs/ext4/inode.c | 22 +++++++++-------------
>  1 file changed, 9 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c2c2d6ac7f3d..0751dc55e94f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5953,7 +5953,6 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  	if (attr->ia_valid & ATTR_SIZE) {
>  		handle_t *handle;
>  		loff_t oldsize = inode->i_size;
> -		loff_t old_disksize;
>  		int shrink = (attr->ia_size < inode->i_size);
>  
>  		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
> @@ -6037,6 +6036,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  			if (ext4_handle_valid(handle) && shrink) {
>  				error = ext4_orphan_add(handle, inode);
>  				orphan = 1;
> +				if (error)
> +					goto out_handle;
>  			}
>  
>  			if (shrink)
> @@ -6052,23 +6053,18 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
>  					inode->i_sb->s_blocksize_bits);
>  
> -			down_write(&EXT4_I(inode)->i_data_sem);
> -			old_disksize = EXT4_I(inode)->i_disksize;
> -			EXT4_I(inode)->i_disksize = attr->ia_size;
> -
>  			/*
>  			 * We have to update i_size under i_data_sem together
>  			 * with i_disksize to avoid races with writeback code
> -			 * running ext4_wb_update_i_disksize().
> +			 * updating disksize in mpage_map_and_submit_extent().
>  			 */
> -			if (!error)
> -				i_size_write(inode, attr->ia_size);
> -			else
> -				EXT4_I(inode)->i_disksize = old_disksize;
> +			down_write(&EXT4_I(inode)->i_data_sem);
> +			i_size_write(inode, attr->ia_size);
> +			EXT4_I(inode)->i_disksize = attr->ia_size;
>  			up_write(&EXT4_I(inode)->i_data_sem);
> -			rc = ext4_mark_inode_dirty(handle, inode);
> -			if (!error)
> -				error = rc;
> +
> +			error = ext4_mark_inode_dirty(handle, inode);
> +out_handle:
>  			ext4_journal_stop(handle);
>  			if (error)
>  				goto out_mmap_sem;
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]()
  2026-05-11  7:23 ` [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]() Zhang Yi
@ 2026-05-19  6:05   ` Ojaswin Mujoo
  2026-06-16  9:31   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-19  6:05 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:22PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Refactor ext4_setattr() by introducing two helper functions,
> ext4_truncate_up() and ext4_truncate_down(), to handle size changes. The
> current ATTR_SIZE processing consolidates checks for both shrinking and
> non-shrinking cases, leading to cluttered code. Separating the
> truncation paths improves readability.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>


Looks good to me Zhang:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>


> ---
>  fs/ext4/inode.c | 199 +++++++++++++++++++++++++++---------------------
>  1 file changed, 112 insertions(+), 87 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0751dc55e94f..35e958f89bd5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5855,6 +5855,112 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
>  	}
>  }
>  
> +/*
> + * Set i_size and i_disksize to 'newsize'.
> + *
> + * Both i_rwsem and i_data_sem are required here to avoid races between
> + * generic append writeback and concurrent truncate that also modify
> + * i_size and i_disksize.
> + */
> +static inline void ext4_set_inode_size(struct inode *inode, loff_t newsize)
> +{
> +	WARN_ON_ONCE(S_ISREG(inode->i_mode) && !inode_is_locked(inode));
> +
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	i_size_write(inode, newsize);
> +	EXT4_I(inode)->i_disksize = newsize;
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_truncate_up(struct inode *inode, loff_t oldsize, loff_t newsize)
> +{
> +	ext4_lblk_t old_lblk, new_lblk;
> +	handle_t *handle;
> +	int ret;
> +
> +	if (!IS_ALIGNED(oldsize | newsize, i_blocksize(inode))) {
> +		ret = ext4_inode_attach_jinode(inode);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> +	if (!IS_ALIGNED(oldsize, i_blocksize(inode))) {
> +		ret = ext4_block_zero_eof(inode, oldsize, LLONG_MAX);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	old_lblk = oldsize > 0 ? (oldsize - 1) >> inode->i_blkbits : 0;
> +	new_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
> +	ext4_fc_track_range(handle, inode, old_lblk, new_lblk);
> +
> +	ext4_set_inode_size(inode, newsize);
> +
> +	ret = ext4_mark_inode_dirty(handle, inode);
> +	ext4_journal_stop(handle);
> +	if (ret)
> +		return ret;
> +	/*
> +	 * isize extend must be called outside an active handle due to
> +	 * the lock ordering of transaction start and folio lock in the
> +	 * iomap buffered I/O path (folio lock -> transaction start).
> +	 */
> +	pagecache_isize_extended(inode, oldsize, newsize);
> +	return 0;
> +}
> +
> +static int ext4_truncate_down(struct inode *inode, loff_t oldsize,
> +			      loff_t newsize, int *orphan)
> +{
> +	ext4_lblk_t start_lblk;
> +	handle_t *handle;
> +	int ret;
> +
> +	/* Do not change i_size. */
> +	if (newsize == oldsize)
> +		goto truncate;
> +
> +	/* Shrink. */
> +	handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	if (ext4_handle_valid(handle)) {
> +		ret = ext4_orphan_add(handle, inode);
> +		*orphan = 1;
> +		if (ret) {
> +			ext4_journal_stop(handle);
> +			return ret;
> +		}
> +	}
> +
> +	start_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
> +	ext4_fc_track_range(handle, inode, start_lblk, EXT_MAX_BLOCKS - 1);
> +
> +	ext4_set_inode_size(inode, newsize);
> +
> +	ret = ext4_mark_inode_dirty(handle, inode);
> +	ext4_journal_stop(handle);
> +	if (ret)
> +		return ret;
> +
> +	if (ext4_should_journal_data(inode))
> +		ext4_wait_for_tail_page_commit(inode);
> +truncate:
> +	/*
> +	 * Truncate pagecache after we've waited for commit in data=journal
> +	 * mode to make pages freeable.  Call ext4_truncate() even if
> +	 * i_size didn't change to truncatea possible preallocated blocks.
> +	 */
> +	truncate_pagecache(inode, newsize);
> +	return ext4_truncate(inode);
> +}
> +
>  /*
>   * ext4_setattr()
>   *
> @@ -5951,7 +6057,6 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  	}
>  
>  	if (attr->ia_valid & ATTR_SIZE) {
> -		handle_t *handle;
>  		loff_t oldsize = inode->i_size;
>  		int shrink = (attr->ia_size < inode->i_size);
>  
> @@ -6003,94 +6108,14 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  			goto err_out;
>  		}
>  
> -		if (attr->ia_size != inode->i_size) {
> -			/* attach jbd2 jinode for EOF folio tail zeroing */
> -			if (attr->ia_size & (inode->i_sb->s_blocksize - 1) ||
> -			    oldsize & (inode->i_sb->s_blocksize - 1)) {
> -				error = ext4_inode_attach_jinode(inode);
> -				if (error)
> -					goto out_mmap_sem;
> -			}
> -
> -			/*
> -			 * Update c/mtime and tail zero the EOF folio on
> -			 * truncate up. ext4_truncate() handles the shrink case
> -			 * below.
> -			 */
> -			if (!shrink) {
> -				inode_set_mtime_to_ts(inode,
> -						      inode_set_ctime_current(inode));
> -				if (oldsize & (inode->i_sb->s_blocksize - 1)) {
> -					error = ext4_block_zero_eof(inode,
> -							oldsize, LLONG_MAX);
> -					if (error)
> -						goto out_mmap_sem;
> -				}
> -			}
> -
> -			handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> -			if (IS_ERR(handle)) {
> -				error = PTR_ERR(handle);
> -				goto out_mmap_sem;
> -			}
> -			if (ext4_handle_valid(handle) && shrink) {
> -				error = ext4_orphan_add(handle, inode);
> -				orphan = 1;
> -				if (error)
> -					goto out_handle;
> -			}
> -
> -			if (shrink)
> -				ext4_fc_track_range(handle, inode,
> -					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> -					inode->i_sb->s_blocksize_bits,
> -					EXT_MAX_BLOCKS - 1);
> -			else
> -				ext4_fc_track_range(
> -					handle, inode,
> -					(oldsize > 0 ? oldsize - 1 : oldsize) >>
> -					inode->i_sb->s_blocksize_bits,
> -					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> -					inode->i_sb->s_blocksize_bits);
> -
> -			/*
> -			 * We have to update i_size under i_data_sem together
> -			 * with i_disksize to avoid races with writeback code
> -			 * updating disksize in mpage_map_and_submit_extent().
> -			 */
> -			down_write(&EXT4_I(inode)->i_data_sem);
> -			i_size_write(inode, attr->ia_size);
> -			EXT4_I(inode)->i_disksize = attr->ia_size;
> -			up_write(&EXT4_I(inode)->i_data_sem);
> -
> -			error = ext4_mark_inode_dirty(handle, inode);
> -out_handle:
> -			ext4_journal_stop(handle);
> -			if (error)
> -				goto out_mmap_sem;
> -			if (!shrink) {
> -				pagecache_isize_extended(inode, oldsize,
> -							 inode->i_size);
> -			} else if (ext4_should_journal_data(inode)) {
> -				ext4_wait_for_tail_page_commit(inode);
> -			}
> +		if (attr->ia_size > oldsize)
> +			error = ext4_truncate_up(inode, oldsize, attr->ia_size);
> +		else {
> +			/* Shrink or do not change i_size. */
> +			error = ext4_truncate_down(inode, oldsize,
> +						   attr->ia_size, &orphan);
>  		}
>  
> -		/*
> -		 * Truncate pagecache after we've waited for commit
> -		 * in data=journal mode to make pages freeable.
> -		 */
> -		truncate_pagecache(inode, inode->i_size);
> -		/*
> -		 * Call ext4_truncate() even if i_size didn't change to
> -		 * truncate possible preallocated blocks.
> -		 */
> -		if (attr->ia_size <= oldsize) {
> -			rc = ext4_truncate(inode);
> -			if (rc)
> -				error = rc;
> -		}
> -out_mmap_sem:
>  		filemap_invalidate_unlock(inode->i_mapping);
>  	}
>  
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr()
  2026-05-11  7:23 ` [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr() Zhang Yi
@ 2026-05-19  6:08   ` Ojaswin Mujoo
  2026-06-16  9:36   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-19  6:08 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:23PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Refactor the error handling in ext4_setattr() for better clarity:
> 
>  - Return directly on ext4_break_layouts() failure.
>  - Propagate ext4_truncate() errors using the existing error variable
>    and jump to the common 'err_out' label.
>  - Propagate posix_acl_chmod() errors also through the error variable,
>    as it theoretically does not return a non-fatal error.
> 
> With these changes, every error path either returns immediately or jumps
> to err_out. Consequently, the "if (!error)" condition guarding
> setattr_copy() and mark_inode_dirty() becomes unreachable for error
> cases. Remove this redundant check and the unused rc variable can be
> removed as well.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>


Looks good to me:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

> ---
>  fs/ext4/inode.c | 32 +++++++++++++++-----------------
>  1 file changed, 15 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 35e958f89bd5..b1ef706987c3 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5989,7 +5989,7 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  		 struct iattr *attr)
>  {
>  	struct inode *inode = d_inode(dentry);
> -	int error, rc = 0;
> +	int error;
>  	int orphan = 0;
>  	const unsigned int ia_valid = attr->ia_valid;
>  	bool inc_ivers = true;
> @@ -6102,10 +6102,10 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  
>  		filemap_invalidate_lock(inode->i_mapping);
>  
> -		rc = ext4_break_layouts(inode);
> -		if (rc) {
> +		error = ext4_break_layouts(inode);
> +		if (error) {
>  			filemap_invalidate_unlock(inode->i_mapping);
> -			goto err_out;
> +			return error;
>  		}
>  
>  		if (attr->ia_size > oldsize)
> @@ -6117,15 +6117,19 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  		}
>  
>  		filemap_invalidate_unlock(inode->i_mapping);
> +		if (error)
> +			goto err_out;
>  	}
>  
> -	if (!error) {
> -		if (inc_ivers)
> -			inode_inc_iversion(inode);
> -		setattr_copy(idmap, inode, attr);
> -		mark_inode_dirty(inode);
> -	}
> +	if (inc_ivers)
> +		inode_inc_iversion(inode);
> +	setattr_copy(idmap, inode, attr);
> +	mark_inode_dirty(inode);
>  
> +	if (ia_valid & ATTR_MODE)
> +		error = posix_acl_chmod(idmap, dentry, inode->i_mode);
> +
> +err_out:
>  	/*
>  	 * If the call to ext4_truncate failed to get a transaction handle at
>  	 * all, we need to clean up the in-core orphan list manually.
> @@ -6133,14 +6137,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  	if (orphan && inode->i_nlink)
>  		ext4_orphan_del(NULL, inode);
>  
> -	if (!error && (ia_valid & ATTR_MODE))
> -		rc = posix_acl_chmod(idmap, dentry, inode->i_mode);
> -
> -err_out:
> -	if  (error)
> +	if (error)
>  		ext4_std_error(inode->i_sb, error);
> -	if (!error)
> -		error = rc;
>  	return error;
>  }
>  
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
  2026-05-11  7:23 ` [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O Zhang Yi
@ 2026-05-19  9:28   ` Ojaswin Mujoo
  2026-05-19 12:35     ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-19  9:28 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce initial support for iomap in the buffered I/O path for regular
> files on ext4.
> 
>   - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
>     inode uses iomap instead of buffer_head for buffered I/O
>   - Add helper ext4_inode_buffered_iomap() to check the flag
>   - Add new address space operations ext4_iomap_aops with callbacks that
>     will use generic iomap implementations
>   - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
> 
> The following callbacks(read_folio(), readahead(), writepages()) are
> provided as placeholders and will be implemented in later patches.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Hi Zhang, looks good to me. Just a questions below:
> ---
>  fs/ext4/ext4.h  |  7 +++++++
>  fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..1e27d73d7427 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1972,6 +1972,7 @@ enum {
>  	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
>  	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
>  	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
> +	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
>  };
>  
>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
>  		!list_empty(&EXT4_I(inode)->i_orphan);
>  }
>  
> +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
> +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
> +{
> +	return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
> +}
> +
>  /*
>   * Codes for operating systems
>   */
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b1ef706987c3..178ac2be37b7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
>  	.iomap_begin = ext4_iomap_begin_report,
>  };
>  
> +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
> +{
> +	return 0;
> +}
> +
> +static void ext4_iomap_readahead(struct readahead_control *rac)
> +{
> +
> +}
> +
> +static int ext4_iomap_writepages(struct address_space *mapping,
> +				 struct writeback_control *wbc)
> +{
> +	return 0;
> +}
> +
>  /*
>   * For data=journal mode, folio should be marked dirty only when it was
>   * writeably mapped. When that happens, it was already attached to the
> @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
>  	.swap_activate		= ext4_iomap_swap_activate,
>  };
>  
> +static const struct address_space_operations ext4_iomap_aops = {
> +	.read_folio		= ext4_iomap_read_folio,
> +	.readahead		= ext4_iomap_readahead,
> +	.writepages		= ext4_iomap_writepages,
> +	.dirty_folio		= iomap_dirty_folio,
> +	.bmap			= ext4_bmap,
> +	.invalidate_folio	= iomap_invalidate_folio,
> +	.release_folio		= iomap_release_folio,
> +	.migrate_folio		= filemap_migrate_folio,
> +	.is_partially_uptodate  = iomap_is_partially_uptodate,
> +	.error_remove_folio	= generic_error_remove_folio,
> +	.swap_activate		= ext4_iomap_swap_activate,
> +};

So one question, for ->release_folio() we are using
iomap_release_folio() instead of ext4_release_folio() here which doesnt
make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
seems to be trying to clean up already checkpointed buffers.

I wanted to check if ->release_folio() can be called for folios with
ext4 metadata buffers? (from my limited understanding of
shrink_folio_list() -> filemap_release_folio() it seems we can) And if
it can be called, is it okay to skip the
jbd2_journal_try_to_free_buffers call?

Regards,
ojaswin

> +
>  static const struct address_space_operations ext4_dax_aops = {
>  	.writepages		= ext4_dax_writepages,
>  	.dirty_folio		= noop_dirty_folio,
> @@ -4015,6 +4045,8 @@ void ext4_set_aops(struct inode *inode)
>  	}
>  	if (IS_DAX(inode))
>  		inode->i_mapping->a_ops = &ext4_dax_aops;
> +	else if (ext4_inode_buffered_iomap(inode))
> +		inode->i_mapping->a_ops = &ext4_iomap_aops;
>  	else if (test_opt(inode->i_sb, DELALLOC))
>  		inode->i_mapping->a_ops = &ext4_da_aops;
>  	else
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 05/23] ext4: implement buffered read path using iomap
  2026-05-11  7:23 ` [PATCH v4 05/23] ext4: implement buffered read path using iomap Zhang Yi
@ 2026-05-19 10:00   ` Ojaswin Mujoo
  0 siblings, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-19 10:00 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:25PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Implement the iomap read path for ext4 by introducing a new
> ext4_iomap_buffered_read_ops instance. This provides the read_folio()
> and readahead() callbacks for ext4_iomap_aops. The implementation
> introduces:
> 
>  - ext4_iomap_map_blocks(): Helper function to query extent mappings for
>    a given read range using ext4_map_blocks() and convert the mapping
>    information to iomap type
>  - ext4_iomap_buffered_read_begin(): The iomap_begin callback that maps
>    blocks, validates filesystem state, and populates the iomap. It
>    returns -ERANGE for inline data which is not yet supported.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Looks good, feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/inode.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 178ac2be37b7..6c4d9137b279 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3908,14 +3908,57 @@ const struct iomap_ops ext4_iomap_report_ops = {
>  	.iomap_begin = ext4_iomap_begin_report,
>  };
>  
> +static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> +		loff_t length, struct ext4_map_blocks *map)
> +{
> +	u8 blkbits = inode->i_blkbits;
> +
> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> +		return -EINVAL;
> +
> +	/* Calculate the first and last logical blocks respectively. */
> +	map->m_lblk = offset >> blkbits;
> +	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> +			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> +
> +	return ext4_map_blocks(NULL, inode, map, 0);
> +}
> +
> +static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> +		loff_t length, unsigned int flags, struct iomap *iomap,
> +		struct iomap *srcmap)
> +{
> +	struct ext4_map_blocks map;
> +	int ret;
> +
> +	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
> +		return -EIO;
> +
> +	/* Inline data support is not yet available. */
> +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> +		return -ERANGE;
> +
> +	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> +	if (ret < 0)
> +		return ret;
> +
> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	return 0;
> +}
> +
> +const struct iomap_ops ext4_iomap_buffered_read_ops = {
> +	.iomap_begin = ext4_iomap_buffered_read_begin,
> +};
> +
>  static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
>  {
> +	iomap_bio_read_folio(folio, &ext4_iomap_buffered_read_ops);
>  	return 0;
>  }
>  
>  static void ext4_iomap_readahead(struct readahead_control *rac)
>  {
> -
> +	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
>  }
>  
>  static int ext4_iomap_writepages(struct address_space *mapping,
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
  2026-05-11  7:23 ` [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
@ 2026-05-19 10:02   ` Ojaswin Mujoo
  2026-06-16 10:04   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-19 10:02 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:26PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The iomap buffered write path does not hold any locks between querying
> inode extent mapping information and performing buffered writes. It
> relies on the sequence counter saved in the inode to detect stale
> mappings.
> 
> Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
> blocks") added the m_seq field to ext4_map_blocks to pass out extent
> sequence numbers, but it missed two callsites within
> ext4_da_map_blocks(). These callsites are on the delayed allocation
> path, which is also used in the iomap buffered write path. Pass out the
> sequence counter to ensure stale mappings can be detected.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Looks good,

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6c4d9137b279..39577a6b65b9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>  	ext4_check_map_extents_env(inode);
>  
>  	/* Lookup extent status tree firstly */
> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>  		map->m_len = min_t(unsigned int, map->m_len,
>  				   es.es_len - (map->m_lblk - es.es_lblk));
>  
> @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>  	 * is held in write mode, before inserting a new da entry in
>  	 * the extent status tree.
>  	 */
> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>  		map->m_len = min_t(unsigned int, map->m_len,
>  				   es.es_len - (map->m_lblk - es.es_lblk));
>  
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
  2026-05-11  7:23 ` [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path Zhang Yi
@ 2026-05-19 10:41   ` Ojaswin Mujoo
  2026-05-19 13:31     ` Ojaswin Mujoo
  2026-06-16 10:01   ` Jan Kara
  1 sibling, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-19 10:41 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:27PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The data=ordered mode introduces two fundamental conflicts with the
> iomap buffered write path, leading to potential deadlocks.
> 
> 1) Lock ordering conflict
>    In the iomap writeback path, each folio is processed sequentially:
>    the folio lock is acquired first, followed by starting a transaction
>    to create block mappings. In data=ordered mode, writeback triggered
>    by the journal commit process may attempt to acquire a folio lock
>    that is already held by iomap. Meanwhile, iomap, under that same
>    folio lock, may start a new transaction and wait for the currently
>    committing transaction to finish, resulting in a deadlock.

Right, makes sense.
> 
> 2) Partial folio submission not supported
>    When block size is smaller than folio size, a folio may contain both
>    mapped and unmapped blocks. In data=ordered mode, if the journal
>    waits for such a folio to be written back while the regular writeback
>    process has already started committing it (with the writeback flag
>    set), mapping the remaining unmapped blocks can deadlock. This is
>    because the writeback flag is cleared only after the entire folio is
>    processed and committed.

Okay so IIUC, if we do end up using iomap with ordered data, there are 2
codepaths with issues here:

txn_commit
  ordered data writeback (say it goes via iomap)
	  folio_lock
		iomap_writeback_folio
			folio_start_writeback
			  iomap_writeback_range
				  ext4_map_block
					  txn_start
						  wait for tnx commit - DEADLOCK

Currently we avoid this by having ext4_normal_submit_inode_buffers()
pass can_map = 0 so journal flush makese sure not to start any txn.

Then we have

txn_commit                          background writeback (via iomap)

                                    folio_lock()
  ordered data writeback
	  folio_lock
			  
                                		iomap_writeback_folio
                                			folio_start_writeback
                                			  iomap_writeback_range
                                				  ext4_map_block
                                					  txn_start
																						  wait for txn commit - DEADLOCK
	  
Currently, this is taken care because we try to start the txn before
taking any folio locks/starting writeback, and hence we cannot deadlock.

If the above description makes sense, do you think it'd be good to add
them to the commit message. The reason is that although these paths seem
obvious when we look at them a lot, it took me a good bit of time to
understand what deadlocks you are talking about here :p

Having the code traces like above makes it very clear.
> 
> To support data=ordered mode, the iomap core would need two invasive
> changes:
>  - Acquire the transaction handle before locking any folio for
>    writeback.
>  - Support partial folio submission.
> 
> Both changes are complicated and risk performance regressions.
> Therefore, we must avoid using data=ordered mode when converting to the
> iomap path.
> 
> Currently, data=ordered mode is used in three scenarios:
>  - Append write
>  - Post-EOF partial block truncate-up followed by append write
>  - Online defragmentation
> 
> We can address the first two without data=ordered mode:
>  - For append write: always allocate unwritten blocks (i.e. always
>    enable dioread_nolock), preserving the behavior of current
>    extent-type inodes.
>  - For post-EOF truncate-up + append write: postpone updating i_disksize
>    until after the zeroed partial block has been written back.

I'm still going through how we are addressing no data=ordered so will
get back on this in some time.

Thanks,
Ojaswin

> 
> Online defragmentation does not yet support iomap; this can be resolved
> separately in the future.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4_jbd2.h | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 63d17c5201b5..26999f173870 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
>  
>  static inline int ext4_should_order_data(struct inode *inode)
>  {
> -	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
> +	/*
> +	 * inodes using the iomap buffered I/O path do not use the
> +	 * data=ordered mode.
> +	 */
> +	return !ext4_inode_buffered_iomap(inode) &&
> +		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
>  }
>  
>  static inline int ext4_should_writeback_data(struct inode *inode)
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
  2026-05-19  9:28   ` Ojaswin Mujoo
@ 2026-05-19 12:35     ` Zhang Yi
  2026-05-19 16:53       ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-19 12:35 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On 5/19/2026 5:28 PM, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Introduce initial support for iomap in the buffered I/O path for regular
>> files on ext4.
>>
>>   - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
>>     inode uses iomap instead of buffer_head for buffered I/O
>>   - Add helper ext4_inode_buffered_iomap() to check the flag
>>   - Add new address space operations ext4_iomap_aops with callbacks that
>>     will use generic iomap implementations
>>   - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
>>
>> The following callbacks(read_folio(), readahead(), writepages()) are
>> provided as placeholders and will be implemented in later patches.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Hi Zhang, looks good to me. Just a questions below:

Hi, Ojaswin! Thank you for the review of this series.

>> ---
>>  fs/ext4/ext4.h  |  7 +++++++
>>  fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
>>  2 files changed, 39 insertions(+)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 94283a991e5c..1e27d73d7427 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1972,6 +1972,7 @@ enum {
>>  	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
>>  	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
>>  	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
>> +	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
>>  };
>>  
>>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
>> @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
>>  		!list_empty(&EXT4_I(inode)->i_orphan);
>>  }
>>  
>> +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
>> +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
>> +{
>> +	return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
>> +}
>> +
>>  /*
>>   * Codes for operating systems
>>   */
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index b1ef706987c3..178ac2be37b7 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
>>  	.iomap_begin = ext4_iomap_begin_report,
>>  };
>>  
>> +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
>> +{
>> +	return 0;
>> +}
>> +
>> +static void ext4_iomap_readahead(struct readahead_control *rac)
>> +{
>> +
>> +}
>> +
>> +static int ext4_iomap_writepages(struct address_space *mapping,
>> +				 struct writeback_control *wbc)
>> +{
>> +	return 0;
>> +}
>> +
>>  /*
>>   * For data=journal mode, folio should be marked dirty only when it was
>>   * writeably mapped. When that happens, it was already attached to the
>> @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
>>  	.swap_activate		= ext4_iomap_swap_activate,
>>  };
>>  
>> +static const struct address_space_operations ext4_iomap_aops = {
>> +	.read_folio		= ext4_iomap_read_folio,
>> +	.readahead		= ext4_iomap_readahead,
>> +	.writepages		= ext4_iomap_writepages,
>> +	.dirty_folio		= iomap_dirty_folio,
>> +	.bmap			= ext4_bmap,
>> +	.invalidate_folio	= iomap_invalidate_folio,
>> +	.release_folio		= iomap_release_folio,
>> +	.migrate_folio		= filemap_migrate_folio,
>> +	.is_partially_uptodate  = iomap_is_partially_uptodate,
>> +	.error_remove_folio	= generic_error_remove_folio,
>> +	.swap_activate		= ext4_iomap_swap_activate,
>> +};
> 
> So one question, for ->release_folio() we are using
> iomap_release_folio() instead of ext4_release_folio() here which doesnt
> make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
> seems to be trying to clean up already checkpointed buffers.
> 
> I wanted to check if ->release_folio() can be called for folios with
> ext4 metadata buffers? (from my limited understanding of
> shrink_folio_list() -> filemap_release_folio() it seems we can) And if
> it can be called, is it okay to skip the
> jbd2_journal_try_to_free_buffers call?

Here, in ->release_folio(), folio->mapping points to inode->i_data (the
file's pagecache), not the block device's pagecache. ext4 metadata
resides in the block device's pagecache, which is at a different layer
than this release_folio callback. So we don't need to call
jbd2_journal_try_to_free_buffers() in the iomap path here.

Thanks,
Yi.

> 
> Regards,
> ojaswin
> 
>> +
>>  static const struct address_space_operations ext4_dax_aops = {
>>  	.writepages		= ext4_dax_writepages,
>>  	.dirty_folio		= noop_dirty_folio,
>> @@ -4015,6 +4045,8 @@ void ext4_set_aops(struct inode *inode)
>>  	}
>>  	if (IS_DAX(inode))
>>  		inode->i_mapping->a_ops = &ext4_dax_aops;
>> +	else if (ext4_inode_buffered_iomap(inode))
>> +		inode->i_mapping->a_ops = &ext4_iomap_aops;
>>  	else if (test_opt(inode->i_sb, DELALLOC))
>>  		inode->i_mapping->a_ops = &ext4_da_aops;
>>  	else
>> -- 
>> 2.52.0
>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
  2026-05-19 10:41   ` Ojaswin Mujoo
@ 2026-05-19 13:31     ` Ojaswin Mujoo
  2026-05-20  8:18       ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-19 13:31 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Tue, May 19, 2026 at 04:11:30PM +0530, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:27PM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > The data=ordered mode introduces two fundamental conflicts with the
> > iomap buffered write path, leading to potential deadlocks.
> > 
> > 1) Lock ordering conflict
> >    In the iomap writeback path, each folio is processed sequentially:
> >    the folio lock is acquired first, followed by starting a transaction
> >    to create block mappings. In data=ordered mode, writeback triggered
> >    by the journal commit process may attempt to acquire a folio lock
> >    that is already held by iomap. Meanwhile, iomap, under that same
> >    folio lock, may start a new transaction and wait for the currently
> >    committing transaction to finish, resulting in a deadlock.
> 
> Right, makes sense.
> > 
> > 2) Partial folio submission not supported
> >    When block size is smaller than folio size, a folio may contain both
> >    mapped and unmapped blocks. In data=ordered mode, if the journal
> >    waits for such a folio to be written back while the regular writeback
> >    process has already started committing it (with the writeback flag
> >    set), mapping the remaining unmapped blocks can deadlock. This is
> >    because the writeback flag is cleared only after the entire folio is
> >    processed and committed.
> 
> Okay so IIUC, if we do end up using iomap with ordered data, there are 2
> codepaths with issues here:
> 
> txn_commit
>   ordered data writeback (say it goes via iomap)
> 	  folio_lock
> 		iomap_writeback_folio
> 			folio_start_writeback
> 			  iomap_writeback_range
> 				  ext4_map_block
> 					  txn_start
> 						  wait for tnx commit - DEADLOCK
> 
> Currently we avoid this by having ext4_normal_submit_inode_buffers()
> pass can_map = 0 so journal flush makese sure not to start any txn.
> 
> Then we have
> 
> txn_commit                          background writeback (via iomap)
> 
>                                     folio_lock()
>   ordered data writeback
> 	  folio_lock
> 			  
>                                 		iomap_writeback_folio
>                                 			folio_start_writeback
>                                 			  iomap_writeback_range
>                                 				  ext4_map_block
>                                 					  txn_start
> 																						  wait for txn commit - DEADLOCK

Sorry I forget to remove tabs

this is what I meant:

txn_commit
  ordered data writeback (say it goes via iomap)
    folio_lock
    iomap_writeback_folio
      folio_start_writeback
        iomap_writeback_range
          ext4_map_block
            txn_start
              wait for tnx commit - DEADLOCK

Currently we avoid this by having ext4_normal_submit_inode_buffers()
pass can_map = 0 so journal flush makese sure not to start any txn.

Then we have

txn_commit                          background writeback (via iomap)

                                    folio_lock()
  ordered data writeback
    folio_lock

                                    iomap_writeback_folio
                                      folio_start_writeback
                                        iomap_writeback_range
                                          ext4_map_block
                                            txn_start
                                              wait for txn commit - DEADLOCK


> 	  
> Currently, this is taken care because we try to start the txn before
> taking any folio locks/starting writeback, and hence we cannot deadlock.
> 
> If the above description makes sense, do you think it'd be good to add
> them to the commit message. The reason is that although these paths seem
> obvious when we look at them a lot, it took me a good bit of time to
> understand what deadlocks you are talking about here :p
> 
> Having the code traces like above makes it very clear.
> > 
> > To support data=ordered mode, the iomap core would need two invasive
> > changes:
> >  - Acquire the transaction handle before locking any folio for
> >    writeback.
> >  - Support partial folio submission.
> > 
> > Both changes are complicated and risk performance regressions.
> > Therefore, we must avoid using data=ordered mode when converting to the
> > iomap path.
> > 
> > Currently, data=ordered mode is used in three scenarios:
> >  - Append write
> >  - Post-EOF partial block truncate-up followed by append write
> >  - Online defragmentation
> > 
> > We can address the first two without data=ordered mode:
> >  - For append write: always allocate unwritten blocks (i.e. always
> >    enable dioread_nolock), preserving the behavior of current
> >    extent-type inodes.
> >  - For post-EOF truncate-up + append write: postpone updating i_disksize
> >    until after the zeroed partial block has been written back.
> 
> I'm still going through how we are addressing no data=ordered so will
> get back on this in some time.
> 
> Thanks,
> Ojaswin
> 
> > 
> > Online defragmentation does not yet support iomap; this can be resolved
> > separately in the future.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > ---
> >  fs/ext4/ext4_jbd2.h | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > index 63d17c5201b5..26999f173870 100644
> > --- a/fs/ext4/ext4_jbd2.h
> > +++ b/fs/ext4/ext4_jbd2.h
> > @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
> >  
> >  static inline int ext4_should_order_data(struct inode *inode)
> >  {
> > -	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
> > +	/*
> > +	 * inodes using the iomap buffered I/O path do not use the
> > +	 * data=ordered mode.
> > +	 */
> > +	return !ext4_inode_buffered_iomap(inode) &&
> > +		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
> >  }
> >  
> >  static inline int ext4_should_writeback_data(struct inode *inode)
> > -- 
> > 2.52.0
> > 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
  2026-05-19 12:35     ` Zhang Yi
@ 2026-05-19 16:53       ` Ojaswin Mujoo
  2026-05-20  2:49         ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-19 16:53 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Tue, May 19, 2026 at 08:35:51PM +0800, Zhang Yi wrote:
> On 5/19/2026 5:28 PM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce initial support for iomap in the buffered I/O path for regular
> >> files on ext4.
> >>
> >>   - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
> >>     inode uses iomap instead of buffer_head for buffered I/O
> >>   - Add helper ext4_inode_buffered_iomap() to check the flag
> >>   - Add new address space operations ext4_iomap_aops with callbacks that
> >>     will use generic iomap implementations
> >>   - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
> >>
> >> The following callbacks(read_folio(), readahead(), writepages()) are
> >> provided as placeholders and will be implemented in later patches.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > Hi Zhang, looks good to me. Just a questions below:
> 
> Hi, Ojaswin! Thank you for the review of this series.
> 
> >> ---
> >>  fs/ext4/ext4.h  |  7 +++++++
> >>  fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
> >>  2 files changed, 39 insertions(+)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 94283a991e5c..1e27d73d7427 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -1972,6 +1972,7 @@ enum {
> >>  	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
> >>  	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
> >>  	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
> >> +	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
> >>  };
> >>  
> >>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> >> @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
> >>  		!list_empty(&EXT4_I(inode)->i_orphan);
> >>  }
> >>  
> >> +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
> >> +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
> >> +{
> >> +	return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
> >> +}
> >> +
> >>  /*
> >>   * Codes for operating systems
> >>   */
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index b1ef706987c3..178ac2be37b7 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
> >>  	.iomap_begin = ext4_iomap_begin_report,
> >>  };
> >>  
> >> +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static void ext4_iomap_readahead(struct readahead_control *rac)
> >> +{
> >> +
> >> +}
> >> +
> >> +static int ext4_iomap_writepages(struct address_space *mapping,
> >> +				 struct writeback_control *wbc)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >>  /*
> >>   * For data=journal mode, folio should be marked dirty only when it was
> >>   * writeably mapped. When that happens, it was already attached to the
> >> @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
> >>  	.swap_activate		= ext4_iomap_swap_activate,
> >>  };
> >>  
> >> +static const struct address_space_operations ext4_iomap_aops = {
> >> +	.read_folio		= ext4_iomap_read_folio,
> >> +	.readahead		= ext4_iomap_readahead,
> >> +	.writepages		= ext4_iomap_writepages,
> >> +	.dirty_folio		= iomap_dirty_folio,
> >> +	.bmap			= ext4_bmap,
> >> +	.invalidate_folio	= iomap_invalidate_folio,
> >> +	.release_folio		= iomap_release_folio,
> >> +	.migrate_folio		= filemap_migrate_folio,
> >> +	.is_partially_uptodate  = iomap_is_partially_uptodate,
> >> +	.error_remove_folio	= generic_error_remove_folio,
> >> +	.swap_activate		= ext4_iomap_swap_activate,
> >> +};
> > 
> > So one question, for ->release_folio() we are using
> > iomap_release_folio() instead of ext4_release_folio() here which doesnt
> > make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
> > seems to be trying to clean up already checkpointed buffers.
> > 
> > I wanted to check if ->release_folio() can be called for folios with
> > ext4 metadata buffers? (from my limited understanding of
> > shrink_folio_list() -> filemap_release_folio() it seems we can) And if
> > it can be called, is it okay to skip the
> > jbd2_journal_try_to_free_buffers call?
> 
> Here, in ->release_folio(), folio->mapping points to inode->i_data (the
> file's pagecache), not the block device's pagecache. ext4 metadata
> resides in the block device's pagecache, which is at a different layer
> than this release_folio callback. So we don't need to call
> jbd2_journal_try_to_free_buffers() in the iomap path here.

Hi Yi,

Thanks for clarify and yes, thats what I was missing! So this
->release_folio() is only for data folios. So I guess the
jbd2_journal_try_to_free_buffers() is mostly to handle data=journal
case?

Regardless, with that clarification feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> 
> Thanks,
> Yi.
> 
> > 
> > Regards,
> > ojaswin
> > 
> >> +
> >>  static const struct address_space_operations ext4_dax_aops = {
> >>  	.writepages		= ext4_dax_writepages,
> >>  	.dirty_folio		= noop_dirty_folio,
> >> @@ -4015,6 +4045,8 @@ void ext4_set_aops(struct inode *inode)
> >>  	}
> >>  	if (IS_DAX(inode))
> >>  		inode->i_mapping->a_ops = &ext4_dax_aops;
> >> +	else if (ext4_inode_buffered_iomap(inode))
> >> +		inode->i_mapping->a_ops = &ext4_iomap_aops;
> >>  	else if (test_opt(inode->i_sb, DELALLOC))
> >>  		inode->i_mapping->a_ops = &ext4_da_aops;
> >>  	else
> >> -- 
> >> 2.52.0
> >>
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
  2026-05-19 16:53       ` Ojaswin Mujoo
@ 2026-05-20  2:49         ` Zhang Yi
  2026-05-26 17:11           ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-20  2:49 UTC (permalink / raw)
  To: Ojaswin Mujoo, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai

On 5/20/2026 12:53 AM, Ojaswin Mujoo wrote:
> On Tue, May 19, 2026 at 08:35:51PM +0800, Zhang Yi wrote:
>> On 5/19/2026 5:28 PM, Ojaswin Mujoo wrote:
>>> On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Introduce initial support for iomap in the buffered I/O path for regular
>>>> files on ext4.
>>>>
>>>>    - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
>>>>      inode uses iomap instead of buffer_head for buffered I/O
>>>>    - Add helper ext4_inode_buffered_iomap() to check the flag
>>>>    - Add new address space operations ext4_iomap_aops with callbacks that
>>>>      will use generic iomap implementations
>>>>    - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
>>>>
>>>> The following callbacks(read_folio(), readahead(), writepages()) are
>>>> provided as placeholders and will be implemented in later patches.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>
>>> Hi Zhang, looks good to me. Just a questions below:
>>
>> Hi, Ojaswin! Thank you for the review of this series.
>>
>>>> ---
>>>>   fs/ext4/ext4.h  |  7 +++++++
>>>>   fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
>>>>   2 files changed, 39 insertions(+)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 94283a991e5c..1e27d73d7427 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -1972,6 +1972,7 @@ enum {
>>>>   	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
>>>>   	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
>>>>   	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
>>>> +	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
>>>>   };
>>>>   
>>>>   #define EXT4_INODE_BIT_FNS(name, field, offset)				\
>>>> @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
>>>>   		!list_empty(&EXT4_I(inode)->i_orphan);
>>>>   }
>>>>   
>>>> +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
>>>> +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
>>>> +{
>>>> +	return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
>>>> +}
>>>> +
>>>>   /*
>>>>    * Codes for operating systems
>>>>    */
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index b1ef706987c3..178ac2be37b7 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
>>>>   	.iomap_begin = ext4_iomap_begin_report,
>>>>   };
>>>>   
>>>> +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static void ext4_iomap_readahead(struct readahead_control *rac)
>>>> +{
>>>> +
>>>> +}
>>>> +
>>>> +static int ext4_iomap_writepages(struct address_space *mapping,
>>>> +				 struct writeback_control *wbc)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>> +
>>>>   /*
>>>>    * For data=journal mode, folio should be marked dirty only when it was
>>>>    * writeably mapped. When that happens, it was already attached to the
>>>> @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
>>>>   	.swap_activate		= ext4_iomap_swap_activate,
>>>>   };
>>>>   
>>>> +static const struct address_space_operations ext4_iomap_aops = {
>>>> +	.read_folio		= ext4_iomap_read_folio,
>>>> +	.readahead		= ext4_iomap_readahead,
>>>> +	.writepages		= ext4_iomap_writepages,
>>>> +	.dirty_folio		= iomap_dirty_folio,
>>>> +	.bmap			= ext4_bmap,
>>>> +	.invalidate_folio	= iomap_invalidate_folio,
>>>> +	.release_folio		= iomap_release_folio,
>>>> +	.migrate_folio		= filemap_migrate_folio,
>>>> +	.is_partially_uptodate  = iomap_is_partially_uptodate,
>>>> +	.error_remove_folio	= generic_error_remove_folio,
>>>> +	.swap_activate		= ext4_iomap_swap_activate,
>>>> +};
>>>
>>> So one question, for ->release_folio() we are using
>>> iomap_release_folio() instead of ext4_release_folio() here which doesnt
>>> make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
>>> seems to be trying to clean up already checkpointed buffers.
>>>
>>> I wanted to check if ->release_folio() can be called for folios with
>>> ext4 metadata buffers? (from my limited understanding of
>>> shrink_folio_list() -> filemap_release_folio() it seems we can) And if
>>> it can be called, is it okay to skip the
>>> jbd2_journal_try_to_free_buffers call?
>>
>> Here, in ->release_folio(), folio->mapping points to inode->i_data (the
>> file's pagecache), not the block device's pagecache. ext4 metadata
>> resides in the block device's pagecache, which is at a different layer
>> than this release_folio callback. So we don't need to call
>> jbd2_journal_try_to_free_buffers() in the iomap path here.
> 
> Hi Yi,
> 
> Thanks for clarify and yes, thats what I was missing! So this
> ->release_folio() is only for data folios. So I guess the
> jbd2_journal_try_to_free_buffers() is mostly to handle data=journal
> case?

Yes, that's my understanding as well. Meanwhile, the comment for the
jbd2_journal_try_to_free_buffers() function looks quite outdated and
needs to be updated.

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 4885903bbd10..239bcf88ed1c 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2139,38 +2139,23 @@ static void __jbd2_journal_unfile_buffer(struct 
journal_head *jh)
  }

  /**
- * jbd2_journal_try_to_free_buffers() - try to free page buffers.
+ * jbd2_journal_try_to_free_buffers() - try to free folio buffers.
   * @journal: journal for operation
   * @folio: Folio to detach data from.
   *
- * For all the buffers on this page,
- * if they are fully written out ordered data, move them onto BUF_CLEAN
- * so try_to_free_buffers() can reap them.
+ * For each buffer_head on @folio, if the buffer has a journal_head but
+ * is not attached to a running or committing transaction, try to remove
+ * it from the checkpoint list.  This is needed for data=journal mode
+ * where data buffers are journaled: once they are checkpointed, the
+ * journal_head can be detached and the buffer freed.  If any buffer is
+ * still attached to a transaction, the folio cannot be released and we
+ * bail out.  Otherwise we call try_to_free_buffers() to detach all
+ * buffer_heads from the folio.
   *
- * This function returns non-zero if we wish try_to_free_buffers()
- * to be called. We do this if the page is releasable by 
try_to_free_buffers().
- * We also do it if the page has locked or dirty buffers and the caller 
wants
- * us to perform sync or async writeout.
+ * For data=ordered and writeback modes, data buffers never have
+ * journal_heads, so this degenerates to a plain try_to_free_buffers().
   *
- * This complicates JBD locking somewhat.  We aren't protected by the
- * BKL here.  We wish to remove the buffer from its committing or
- * running transaction's ->t_datalist via __jbd2_journal_unfile_buffer.
- *
- * This may *change* the value of transaction_t->t_datalist, so anyone
- * who looks at t_datalist needs to lock against this function.
- *
- * Even worse, someone may be doing a jbd2_journal_dirty_data on this
- * buffer.  So we need to lock against that.  jbd2_journal_dirty_data()
- * will come out of the lock with the buffer dirty, which makes it
- * ineligible for release here.
- *
- * Who else is affected by this?  hmm...  Really the only contender
- * is do_get_write_access() - it could be looking at the buffer while
- * journal_try_to_free_buffer() is changing its state.  But that
- * cannot happen because we never reallocate freed data as metadata
- * while the data is part of a transaction.  Yes?
- *
- * Return false on failure, true on success
+ * Return: true if the folio's buffers were freed, false otherwise
   */
  bool jbd2_journal_try_to_free_buffers(journal_t *journal, struct folio 
*folio)
  {

Thanks,
Yi.




^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
  2026-05-19 13:31     ` Ojaswin Mujoo
@ 2026-05-20  8:18       ` Zhang Yi
  2026-05-20 13:17         ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-20  8:18 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On 5/19/2026 9:31 PM, Ojaswin Mujoo wrote:
> On Tue, May 19, 2026 at 04:11:30PM +0530, Ojaswin Mujoo wrote:
>> On Mon, May 11, 2026 at 03:23:27PM +0800, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> The data=ordered mode introduces two fundamental conflicts with the
>>> iomap buffered write path, leading to potential deadlocks.
>>>
>>> 1) Lock ordering conflict
>>>    In the iomap writeback path, each folio is processed sequentially:
>>>    the folio lock is acquired first, followed by starting a transaction
>>>    to create block mappings. In data=ordered mode, writeback triggered
>>>    by the journal commit process may attempt to acquire a folio lock
>>>    that is already held by iomap. Meanwhile, iomap, under that same
>>>    folio lock, may start a new transaction and wait for the currently
>>>    committing transaction to finish, resulting in a deadlock.
>>
>> Right, makes sense.
>>>
>>> 2) Partial folio submission not supported
>>>    When block size is smaller than folio size, a folio may contain both
>>>    mapped and unmapped blocks. In data=ordered mode, if the journal
>>>    waits for such a folio to be written back while the regular writeback
>>>    process has already started committing it (with the writeback flag
>>>    set), mapping the remaining unmapped blocks can deadlock. This is
>>>    because the writeback flag is cleared only after the entire folio is
>>>    processed and committed.
>>
>> Okay so IIUC, if we do end up using iomap with ordered data, there are 2
>> codepaths with issues here:
>>
>> txn_commit
>>   ordered data writeback (say it goes via iomap)
>> 	  folio_lock
>> 		iomap_writeback_folio
>> 			folio_start_writeback
>> 			  iomap_writeback_range
>> 				  ext4_map_block
>> 					  txn_start
>> 						  wait for tnx commit - DEADLOCK
>>
>> Currently we avoid this by having ext4_normal_submit_inode_buffers()
>> pass can_map = 0 so journal flush makese sure not to start any txn.
>>
>> Then we have
>>
>> txn_commit                          background writeback (via iomap)
>>
>>                                     folio_lock()
>>   ordered data writeback
>> 	  folio_lock
>> 			  
>>                                 		iomap_writeback_folio
>>                                 			folio_start_writeback
>>                                 			  iomap_writeback_range
>>                                 				  ext4_map_block
>>                                 					  txn_start
>> 																						  wait for txn commit - DEADLOCK
> 
> Sorry I forget to remove tabs
> 
> this is what I meant:
> 
> txn_commit
>   ordered data writeback (say it goes via iomap)
>     folio_lock
>     iomap_writeback_folio
>       folio_start_writeback
>         iomap_writeback_range
>           ext4_map_block
>             txn_start
>               wait for tnx commit - DEADLOCK
> 
> Currently we avoid this by having ext4_normal_submit_inode_buffers()
> pass can_map = 0 so journal flush makese sure not to start any txn.

Yeah, but we can also solve this problem by adding similar tags. This
is not the most difficult part.

> 
> Then we have
> 
> txn_commit                          background writeback (via iomap)
> 
>                                     folio_lock()
>   ordered data writeback
>     folio_lock
> 
>                                     iomap_writeback_folio
>                                       folio_start_writeback
>                                         iomap_writeback_range
>                                           ext4_map_block
>                                             txn_start
>                                               wait for txn commit - DEADLOCK
> 
> 
>> 	  
>> Currently, this is taken care because we try to start the txn before
>> taking any folio locks/starting writeback, and hence we cannot deadlock.

Yeah. You are right! Actually, this deadlock scenario should essentially
belong to the first category: "Lock ordering conflict". This is not the
scenario I want to describe here. The problematic scenario is as
follows:

T0: Assume we have a folio contains four blocks, from front to back,
    they are A, B, C, D. The last block D is written in delalloc mode
    (the block is not allocated yet).

T1: The writeback process starts to write back data, set writeback flag
    on the folio, allocates block D, and adds it to transaction N's
    order list of jbd2 in JI_WAIT_DATA mode.

T2: This folio completes the writeback and clears the writeback flag.

T3: Before transaction N commit, we punch block B and C, and overwrite
    A-C,

T4: Transaction N commit and folio writeback are running concurrently.

Transaction N commit        folio writeback(iomap)

                            iomap_writeback_folio()
                             folio_start_writeback()  -- set writeback

jbd2_journal_finish_inode_data_buffers()
 __filemap_fdatawait_range()
  -- wait writeback flag to clear
                               iomap_writeback_range()
                                ext4_map_block() -- map block B and C
                                 start handle
                                  wait for transaction N commit
                                   - DEADLOCK

IOMAP does not support submitting partial folios during writeback.
Therefore, the writeback flag is cleared only after the entire folio
has been submitted. As a result, the commit of transaction N would never
wait for this flag to be cleared if we need to map some blocks in this
folio.

Currently, this is handled by ext4_bio_write_folio(), which supports
writing back partial folios. The writeback flag is only set after the
block has been mapped and before the bio is actually issued. There are
no other limitations that would block this flag from being cleared
after the I/O is completed.

>>
>> If the above description makes sense, do you think it'd be good to add
>> them to the commit message. The reason is that although these paths seem
>> obvious when we look at them a lot, it took me a good bit of time to
>> understand what deadlocks you are talking about here :p
>>
>> Having the code traces like above makes it very clear.

Indeed, these problematic cases are complicated and subtle. I also spent
some time recalling this scene. I can add these code traces in my next
iteration.

Thanks,
Yi.

>>>
>>> To support data=ordered mode, the iomap core would need two invasive
>>> changes:
>>>  - Acquire the transaction handle before locking any folio for
>>>    writeback.
>>>  - Support partial folio submission.
>>>
>>> Both changes are complicated and risk performance regressions.
>>> Therefore, we must avoid using data=ordered mode when converting to the
>>> iomap path.
>>>
>>> Currently, data=ordered mode is used in three scenarios:
>>>  - Append write
>>>  - Post-EOF partial block truncate-up followed by append write
>>>  - Online defragmentation
>>>
>>> We can address the first two without data=ordered mode:
>>>  - For append write: always allocate unwritten blocks (i.e. always
>>>    enable dioread_nolock), preserving the behavior of current
>>>    extent-type inodes.
>>>  - For post-EOF truncate-up + append write: postpone updating i_disksize
>>>    until after the zeroed partial block has been written back.
>>
>> I'm still going through how we are addressing no data=ordered so will
>> get back on this in some time.
>>
>> Thanks,
>> Ojaswin
>>
>>>
>>> Online defragmentation does not yet support iomap; this can be resolved
>>> separately in the future.
>>>
>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>> ---
>>>  fs/ext4/ext4_jbd2.h | 7 ++++++-
>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>> index 63d17c5201b5..26999f173870 100644
>>> --- a/fs/ext4/ext4_jbd2.h
>>> +++ b/fs/ext4/ext4_jbd2.h
>>> @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
>>>  
>>>  static inline int ext4_should_order_data(struct inode *inode)
>>>  {
>>> -	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
>>> +	/*
>>> +	 * inodes using the iomap buffered I/O path do not use the
>>> +	 * data=ordered mode.
>>> +	 */
>>> +	return !ext4_inode_buffered_iomap(inode) &&
>>> +		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
>>>  }
>>>  
>>>  static inline int ext4_should_writeback_data(struct inode *inode)
>>> -- 
>>> 2.52.0
>>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
  2026-05-20  8:18       ` Zhang Yi
@ 2026-05-20 13:17         ` Ojaswin Mujoo
  0 siblings, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-20 13:17 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Wed, May 20, 2026 at 04:18:48PM +0800, Zhang Yi wrote:
> On 5/19/2026 9:31 PM, Ojaswin Mujoo wrote:
> > On Tue, May 19, 2026 at 04:11:30PM +0530, Ojaswin Mujoo wrote:
> >> On Mon, May 11, 2026 at 03:23:27PM +0800, Zhang Yi wrote:
> >>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>
> >>> The data=ordered mode introduces two fundamental conflicts with the
> >>> iomap buffered write path, leading to potential deadlocks.
> >>>
> >>> 1) Lock ordering conflict
> >>>    In the iomap writeback path, each folio is processed sequentially:
> >>>    the folio lock is acquired first, followed by starting a transaction
> >>>    to create block mappings. In data=ordered mode, writeback triggered
> >>>    by the journal commit process may attempt to acquire a folio lock
> >>>    that is already held by iomap. Meanwhile, iomap, under that same
> >>>    folio lock, may start a new transaction and wait for the currently
> >>>    committing transaction to finish, resulting in a deadlock.
> >>
> >> Right, makes sense.
> >>>
> >>> 2) Partial folio submission not supported
> >>>    When block size is smaller than folio size, a folio may contain both
> >>>    mapped and unmapped blocks. In data=ordered mode, if the journal
> >>>    waits for such a folio to be written back while the regular writeback
> >>>    process has already started committing it (with the writeback flag
> >>>    set), mapping the remaining unmapped blocks can deadlock. This is
> >>>    because the writeback flag is cleared only after the entire folio is
> >>>    processed and committed.
> >>
> >> Okay so IIUC, if we do end up using iomap with ordered data, there are 2
> >> codepaths with issues here:
> >>
> >> txn_commit
> >>   ordered data writeback (say it goes via iomap)
> >> 	  folio_lock
> >> 		iomap_writeback_folio
> >> 			folio_start_writeback
> >> 			  iomap_writeback_range
> >> 				  ext4_map_block
> >> 					  txn_start
> >> 						  wait for tnx commit - DEADLOCK
> >>
> >> Currently we avoid this by having ext4_normal_submit_inode_buffers()
> >> pass can_map = 0 so journal flush makese sure not to start any txn.
> >>
> >> Then we have
> >>
> >> txn_commit                          background writeback (via iomap)
> >>
> >>                                     folio_lock()
> >>   ordered data writeback
> >> 	  folio_lock
> >> 			  
> >>                                 		iomap_writeback_folio
> >>                                 			folio_start_writeback
> >>                                 			  iomap_writeback_range
> >>                                 				  ext4_map_block
> >>                                 					  txn_start
> >> 																						  wait for txn commit - DEADLOCK
> > 
> > Sorry I forget to remove tabs
> > 
> > this is what I meant:
> > 
> > txn_commit
> >   ordered data writeback (say it goes via iomap)
> >     folio_lock
> >     iomap_writeback_folio
> >       folio_start_writeback
> >         iomap_writeback_range
> >           ext4_map_block
> >             txn_start
> >               wait for tnx commit - DEADLOCK
> > 
> > Currently we avoid this by having ext4_normal_submit_inode_buffers()
> > pass can_map = 0 so journal flush makese sure not to start any txn.
> 
> Yeah, but we can also solve this problem by adding similar tags. This
> is not the most difficult part.
> 
> > 
> > Then we have
> > 
> > txn_commit                          background writeback (via iomap)
> > 
> >                                     folio_lock()
> >   ordered data writeback
> >     folio_lock
> > 
> >                                     iomap_writeback_folio
> >                                       folio_start_writeback
> >                                         iomap_writeback_range
> >                                           ext4_map_block
> >                                             txn_start
> >                                               wait for txn commit - DEADLOCK
> > 
> > 
> >> 	  
> >> Currently, this is taken care because we try to start the txn before
> >> taking any folio locks/starting writeback, and hence we cannot deadlock.
> 
> Yeah. You are right! Actually, this deadlock scenario should essentially
> belong to the first category: "Lock ordering conflict". This is not the
> scenario I want to describe here. The problematic scenario is as
> follows:
> 
> T0: Assume we have a folio contains four blocks, from front to back,
>     they are A, B, C, D. The last block D is written in delalloc mode
>     (the block is not allocated yet).
> 
> T1: The writeback process starts to write back data, set writeback flag
>     on the folio, allocates block D, and adds it to transaction N's
>     order list of jbd2 in JI_WAIT_DATA mode.
> 
> T2: This folio completes the writeback and clears the writeback flag.
> 
> T3: Before transaction N commit, we punch block B and C, and overwrite
>     A-C,
> 
> T4: Transaction N commit and folio writeback are running concurrently.
> 
> Transaction N commit        folio writeback(iomap)
> 
>                             iomap_writeback_folio()
>                              folio_start_writeback()  -- set writeback
> 
> jbd2_journal_finish_inode_data_buffers()
>  __filemap_fdatawait_range()
>   -- wait writeback flag to clear
>                                iomap_writeback_range()
>                                 ext4_map_block() -- map block B and C
>                                  start handle
>                                   wait for transaction N commit
>                                    - DEADLOCK
> 
> IOMAP does not support submitting partial folios during writeback.
> Therefore, the writeback flag is cleared only after the entire folio
> has been submitted. As a result, the commit of transaction N would never
> wait for this flag to be cleared if we need to map some blocks in this
> folio.

Hey Zhang, 

thanks for the traces, it makes it so much clearer.

> 
> Currently, this is handled by ext4_bio_write_folio(), which supports
> writing back partial folios. The writeback flag is only set after the
> block has been mapped and before the bio is actually issued. There are
> no other limitations that would block this flag from being cleared
> after the I/O is completed.

So IIUC the issue with iomap is that we set the writeback before
we start a txn handle and this way a txn handle can wait for a commit
which is waiting for writeback on the same folio to be cleared.

ext4_do_writepages currently starts the txn first, allocates and then
sets writeback and starts submitting buffers. This sequence ensures that
we dont get into the above deadlock because we never wait for commit
with a folio lock/writeback held.

Thanks,
Ojaswin
> 
> >>
> >> If the above description makes sense, do you think it'd be good to add
> >> them to the commit message. The reason is that although these paths seem
> >> obvious when we look at them a lot, it took me a good bit of time to
> >> understand what deadlocks you are talking about here :p
> >>
> >> Having the code traces like above makes it very clear.
> 
> Indeed, these problematic cases are complicated and subtle. I also spent
> some time recalling this scene. I can add these code traces in my next
> iteration.
> 
> Thanks,
> Yi.
> 
> >>>
> >>> To support data=ordered mode, the iomap core would need two invasive
> >>> changes:
> >>>  - Acquire the transaction handle before locking any folio for
> >>>    writeback.
> >>>  - Support partial folio submission.
> >>>
> >>> Both changes are complicated and risk performance regressions.
> >>> Therefore, we must avoid using data=ordered mode when converting to the
> >>> iomap path.
> >>>
> >>> Currently, data=ordered mode is used in three scenarios:
> >>>  - Append write
> >>>  - Post-EOF partial block truncate-up followed by append write
> >>>  - Online defragmentation
> >>>
> >>> We can address the first two without data=ordered mode:
> >>>  - For append write: always allocate unwritten blocks (i.e. always
> >>>    enable dioread_nolock), preserving the behavior of current
> >>>    extent-type inodes.
> >>>  - For post-EOF truncate-up + append write: postpone updating i_disksize
> >>>    until after the zeroed partial block has been written back.
> >>
> >> I'm still going through how we are addressing no data=ordered so will
> >> get back on this in some time.
> >>
> >> Thanks,
> >> Ojaswin
> >>
> >>>
> >>> Online defragmentation does not yet support iomap; this can be resolved
> >>> separately in the future.
> >>>
> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>> ---
> >>>  fs/ext4/ext4_jbd2.h | 7 ++++++-
> >>>  1 file changed, 6 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >>> index 63d17c5201b5..26999f173870 100644
> >>> --- a/fs/ext4/ext4_jbd2.h
> >>> +++ b/fs/ext4/ext4_jbd2.h
> >>> @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
> >>>  
> >>>  static inline int ext4_should_order_data(struct inode *inode)
> >>>  {
> >>> -	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
> >>> +	/*
> >>> +	 * inodes using the iomap buffered I/O path do not use the
> >>> +	 * data=ordered mode.
> >>> +	 */
> >>> +	return !ext4_inode_buffered_iomap(inode) &&
> >>> +		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
> >>>  }
> >>>  
> >>>  static inline int ext4_should_writeback_data(struct inode *inode)
> >>> -- 
> >>> 2.52.0
> >>>
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
@ 2026-05-26 17:10   ` Ojaswin Mujoo
  2026-05-28 15:40     ` Darrick J. Wong
  2026-05-29  9:13     ` Zhang Yi
  2026-06-02 10:26   ` Ojaswin Mujoo
  2026-06-16 10:45   ` Jan Kara
  2 siblings, 2 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-26 17:10 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce two new iomap_ops instances for ext4 buffered writes:
> 
>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>    ext4_da_map_blocks() to map delalloc extents.
>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>    ext4_iomap_get_blocks() to directly allocate blocks.
> 
> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> validity.
> 
> Key changes and considerations:
> 
>  - Unwritten extents for new blocks (dioread_nolock always on)
>    Since data=ordered mode is not used to prevent stale data exposure in
>    the non-delayed allocation path, new blocks are always allocated as
>    unwritten extents.

Okay makes sense.

> 
>  - Short write and write failure handling
>    a. Delalloc path: On short write or failure, the stale delalloc range
>       must be dropped and its space reservation released. Otherwise, a
>       clean folio may cover leftover delalloc extents, causing
>       inaccurate space reservation accounting.

Hmm, okay so in the usual buffer head path, seems like during a short
write we still zero the new buffers we couldn't write and keep it dirty
(folio_zero_new_buffers()). This way they are still written back and
the delalloc reservations are used up.

However in iomap we don't mark the range that we couldnt write as dirty
so we need to make sure we clear up the stale delalloc mappings. Is this
correct?

Regards,
Ojaswin

>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
>       short write.
> 
>  - Lock ordering reversal
>    The folio lock and transaction start ordering is reversed compared to
>    the buffer_head buffered write path. To handle this, the journal
>    handle must be stopped in iomap_begin() callbacks. The lock ordering
>    documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4.h  |   4 ++
>  fs/ext4/file.c  |  20 +++++-
>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/super.c |  10 ++-
>  4 files changed, 192 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1e27d73d7427..4832e7f7db82 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>  				struct buffer_head *bh);
>  void ext4_set_inode_mapping_order(struct inode *inode);
> +int ext4_nonda_switch(struct super_block *sb);
>  #define FALL_BACK_TO_NONDELALLOC 1
>  #define CONVERT_INLINE_DATA	 2
>  
> @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
>  
>  extern const struct iomap_ops ext4_iomap_ops;
>  extern const struct iomap_ops ext4_iomap_report_ops;
> +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> +extern const struct iomap_write_ops ext4_iomap_write_ops;
>  
>  static inline int ext4_buffer_uptodate(struct buffer_head *bh)
>  {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index eb1a323962b1..7f9bfbbc4a4e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	return count;
>  }
>  
> +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> +					 struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	const struct iomap_ops *iomap_ops;
> +
> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> +	else
> +		iomap_ops = &ext4_iomap_buffered_write_ops;
> +
> +	return iomap_file_buffered_write(iocb, from, iomap_ops,
> +					 &ext4_iomap_write_ops, NULL);
> +}
> +
>  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>  					struct iov_iter *from)
>  {
> @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>  	if (ret <= 0)
>  		goto out;
>  
> -	ret = generic_perform_write(iocb, from);
> +	if (ext4_inode_buffered_iomap(inode))
> +		ret = ext4_iomap_buffered_write(iocb, from);
> +	else
> +		ret = generic_perform_write(iocb, from);
>  
>  out:
>  	inode_unlock(inode);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 39577a6b65b9..1ae7d3f4a1c8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
>  	return ret;
>  }
>  
> -static int ext4_nonda_switch(struct super_block *sb)
> +int ext4_nonda_switch(struct super_block *sb)
>  {
>  	s64 free_clusters, dirty_clusters;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
>  	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
>  }
>  
> +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> +{
> +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> +}
> +
> +const struct iomap_write_ops ext4_iomap_write_ops = {
> +	.iomap_valid = ext4_iomap_valid,
> +};
> +
>  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>  			   struct ext4_map_blocks *map, loff_t offset,
>  			   loff_t length, unsigned int flags)
> @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>  		iomap->flags |= IOMAP_F_MERGED;
>  
> +	iomap->validity_cookie = map->m_seq;
> +
>  	/*
>  	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
>  	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
>  	.iomap_begin = ext4_iomap_begin_report,
>  };
>  
> +/* Map blocks */
> +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> +
>  static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> -		loff_t length, struct ext4_map_blocks *map)
> +		loff_t length, ext4_get_blocks_t get_blocks,
> +		struct ext4_map_blocks *map)
>  {
>  	u8 blkbits = inode->i_blkbits;
>  
> @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
>  	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>  			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
>  
> +	if (get_blocks)
> +		return get_blocks(inode, map);
> +
>  	return ext4_map_blocks(NULL, inode, map, 0);
>  }
>  
> @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>  	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
>  		return -ERANGE;
>  
> -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>  	if (ret < 0)
>  		return ret;
>  
> @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>  	return 0;
>  }
>  
> +static int ext4_iomap_get_blocks(struct inode *inode,
> +				 struct ext4_map_blocks *map)
> +{
> +	loff_t i_size = i_size_read(inode);
> +	handle_t *handle;
> +	int ret;
> +
> +	/*
> +	 * Check if the blocks have already been allocated, this could
> +	 * avoid initiating a new journal transaction and return the
> +	 * mapping information directly.
> +	 */
> +	if ((map->m_lblk + map->m_len) <=
> +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> +		ret = ext4_map_blocks(NULL, inode, map, 0);
> +		if (ret < 0)
> +			return ret;
> +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> +				    EXT4_MAP_DELAYED))
> +			return 0;
> +	}
> +
> +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> +			ext4_chunk_trans_blocks(inode, map->m_len));
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	ret = ext4_map_blocks(handle, inode, map,
> +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> +	/*
> +	 * Stop handle here following the lock ordering of the folio lock
> +	 * and the transaction start.
> +	 */
> +	ext4_journal_stop(handle);
> +
> +	return ret;
> +}
> +
> +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> +{
> +	int ret, retries = 0;
> +	struct ext4_map_blocks map;
> +	ext4_get_blocks_t *get_blocks;
> +
> +	ret = ext4_emergency_state(inode->i_sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	/* Inline data and non-extent are not supported. */
> +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> +		return -ERANGE;
> +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> +		return -EINVAL;
> +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> +		return -EINVAL;
> +
> +	if (delalloc)
> +		get_blocks = ext4_da_map_blocks;
> +	else
> +		get_blocks = ext4_iomap_get_blocks;
> +retry:
> +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> +		goto retry;
> +	if (ret < 0)
> +		return ret;
> +
> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	return 0;
> +}
> +
> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, false);
> +}
> +
> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, true);
> +}
> +
> +/*
> + * On write failure, drop the stale delayed allocation range and release
> + * its reserved space for both start and end blocks. Otherwise, we may
> + * leave a range of delayed extents covered by a clean folio, which can
> + * result in inaccurate space reservation accounting.
> + */
> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> +				     loff_t length, struct iomap *iomap)
> +{
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> +					    loff_t length, ssize_t written,
> +					    unsigned int flags,
> +					    struct iomap *iomap)
> +{
> +	loff_t start_byte, end_byte;
> +
> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> +		return 0;
> +
> +	/* Nothing to do if we've written the entire delalloc extent */
> +	start_byte = iomap_last_written_block(inode, offset, written);
> +	end_byte = round_up(offset + length, i_blocksize(inode));
> +	if (start_byte >= end_byte)
> +		return 0;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> +				     iomap, ext4_iomap_punch_delalloc);
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return 0;
> +}
> +
> +/*
> + * Since we always allocate unwritten extents, there is no need for
> + * iomap_end to clean up allocated blocks on a short write.
> + */
> +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> +	.iomap_begin = ext4_iomap_buffered_write_begin,
> +};
> +
> +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
> +	.iomap_end = ext4_iomap_buffered_da_write_end,
> +};
> +
>  const struct iomap_ops ext4_iomap_buffered_read_ops = {
>  	.iomap_begin = ext4_iomap_buffered_read_begin,
>  };
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..9bc294b769db 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
>   *   -> page lock -> i_data_sem (rw)
>   *
>   * buffered write path:
> - * sb_start_write -> i_mutex -> mmap_lock
> - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> - *   i_data_sem (rw)
> + * sb_start_write -> i_rwsem (w) -> mmap_lock
> + * - buffer_head path:
> + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> + *     i_data_sem (rw)
> + * - iomap path:
> + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
>   *
>   * truncate:
>   * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
  2026-05-20  2:49         ` Zhang Yi
@ 2026-05-26 17:11           ` Ojaswin Mujoo
  0 siblings, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-26 17:11 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
	yi.zhang, yangerkun, yukuai

On Wed, May 20, 2026 at 10:49:50AM +0800, Zhang Yi wrote:
> On 5/20/2026 12:53 AM, Ojaswin Mujoo wrote:
> > On Tue, May 19, 2026 at 08:35:51PM +0800, Zhang Yi wrote:
> > > On 5/19/2026 5:28 PM, Ojaswin Mujoo wrote:
> > > > On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
> > > > > From: Zhang Yi <yi.zhang@huawei.com>
> > > > > 
> > > > > Introduce initial support for iomap in the buffered I/O path for regular
> > > > > files on ext4.
> > > > > 
> > > > >    - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
> > > > >      inode uses iomap instead of buffer_head for buffered I/O
> > > > >    - Add helper ext4_inode_buffered_iomap() to check the flag
> > > > >    - Add new address space operations ext4_iomap_aops with callbacks that
> > > > >      will use generic iomap implementations
> > > > >    - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
> > > > > 
> > > > > The following callbacks(read_folio(), readahead(), writepages()) are
> > > > > provided as placeholders and will be implemented in later patches.
> > > > > 
> > > > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > > 
> > > > Hi Zhang, looks good to me. Just a questions below:
> > > 
> > > Hi, Ojaswin! Thank you for the review of this series.
> > > 
> > > > > ---
> > > > >   fs/ext4/ext4.h  |  7 +++++++
> > > > >   fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
> > > > >   2 files changed, 39 insertions(+)
> > > > > 
> > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > > index 94283a991e5c..1e27d73d7427 100644
> > > > > --- a/fs/ext4/ext4.h
> > > > > +++ b/fs/ext4/ext4.h
> > > > > @@ -1972,6 +1972,7 @@ enum {
> > > > >   	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
> > > > >   	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
> > > > >   	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
> > > > > +	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
> > > > >   };
> > > > >   #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> > > > > @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
> > > > >   		!list_empty(&EXT4_I(inode)->i_orphan);
> > > > >   }
> > > > > +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
> > > > > +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
> > > > > +{
> > > > > +	return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
> > > > > +}
> > > > > +
> > > > >   /*
> > > > >    * Codes for operating systems
> > > > >    */
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index b1ef706987c3..178ac2be37b7 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
> > > > >   	.iomap_begin = ext4_iomap_begin_report,
> > > > >   };
> > > > > +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
> > > > > +{
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static void ext4_iomap_readahead(struct readahead_control *rac)
> > > > > +{
> > > > > +
> > > > > +}
> > > > > +
> > > > > +static int ext4_iomap_writepages(struct address_space *mapping,
> > > > > +				 struct writeback_control *wbc)
> > > > > +{
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > >   /*
> > > > >    * For data=journal mode, folio should be marked dirty only when it was
> > > > >    * writeably mapped. When that happens, it was already attached to the
> > > > > @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
> > > > >   	.swap_activate		= ext4_iomap_swap_activate,
> > > > >   };
> > > > > +static const struct address_space_operations ext4_iomap_aops = {
> > > > > +	.read_folio		= ext4_iomap_read_folio,
> > > > > +	.readahead		= ext4_iomap_readahead,
> > > > > +	.writepages		= ext4_iomap_writepages,
> > > > > +	.dirty_folio		= iomap_dirty_folio,
> > > > > +	.bmap			= ext4_bmap,
> > > > > +	.invalidate_folio	= iomap_invalidate_folio,
> > > > > +	.release_folio		= iomap_release_folio,
> > > > > +	.migrate_folio		= filemap_migrate_folio,
> > > > > +	.is_partially_uptodate  = iomap_is_partially_uptodate,
> > > > > +	.error_remove_folio	= generic_error_remove_folio,
> > > > > +	.swap_activate		= ext4_iomap_swap_activate,
> > > > > +};
> > > > 
> > > > So one question, for ->release_folio() we are using
> > > > iomap_release_folio() instead of ext4_release_folio() here which doesnt
> > > > make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
> > > > seems to be trying to clean up already checkpointed buffers.
> > > > 
> > > > I wanted to check if ->release_folio() can be called for folios with
> > > > ext4 metadata buffers? (from my limited understanding of
> > > > shrink_folio_list() -> filemap_release_folio() it seems we can) And if
> > > > it can be called, is it okay to skip the
> > > > jbd2_journal_try_to_free_buffers call?
> > > 
> > > Here, in ->release_folio(), folio->mapping points to inode->i_data (the
> > > file's pagecache), not the block device's pagecache. ext4 metadata
> > > resides in the block device's pagecache, which is at a different layer
> > > than this release_folio callback. So we don't need to call
> > > jbd2_journal_try_to_free_buffers() in the iomap path here.
> > 
> > Hi Yi,
> > 
> > Thanks for clarify and yes, thats what I was missing! So this
> > ->release_folio() is only for data folios. So I guess the
> > jbd2_journal_try_to_free_buffers() is mostly to handle data=journal
> > case?
> 
> Yes, that's my understanding as well. Meanwhile, the comment for the
> jbd2_journal_try_to_free_buffers() function looks quite outdated and
> needs to be updated.

Looks good, thanks for explanation and fixing it.

Regards,
ojaswin

> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 4885903bbd10..239bcf88ed1c 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -2139,38 +2139,23 @@ static void __jbd2_journal_unfile_buffer(struct
> journal_head *jh)
>  }
> 
>  /**
> - * jbd2_journal_try_to_free_buffers() - try to free page buffers.
> + * jbd2_journal_try_to_free_buffers() - try to free folio buffers.
>   * @journal: journal for operation
>   * @folio: Folio to detach data from.
>   *
> - * For all the buffers on this page,
> - * if they are fully written out ordered data, move them onto BUF_CLEAN
> - * so try_to_free_buffers() can reap them.
> + * For each buffer_head on @folio, if the buffer has a journal_head but
> + * is not attached to a running or committing transaction, try to remove
> + * it from the checkpoint list.  This is needed for data=journal mode
> + * where data buffers are journaled: once they are checkpointed, the
> + * journal_head can be detached and the buffer freed.  If any buffer is
> + * still attached to a transaction, the folio cannot be released and we
> + * bail out.  Otherwise we call try_to_free_buffers() to detach all
> + * buffer_heads from the folio.
>   *
> - * This function returns non-zero if we wish try_to_free_buffers()
> - * to be called. We do this if the page is releasable by
> try_to_free_buffers().
> - * We also do it if the page has locked or dirty buffers and the caller
> wants
> - * us to perform sync or async writeout.
> + * For data=ordered and writeback modes, data buffers never have
> + * journal_heads, so this degenerates to a plain try_to_free_buffers().
>   *
> - * This complicates JBD locking somewhat.  We aren't protected by the
> - * BKL here.  We wish to remove the buffer from its committing or
> - * running transaction's ->t_datalist via __jbd2_journal_unfile_buffer.
> - *
> - * This may *change* the value of transaction_t->t_datalist, so anyone
> - * who looks at t_datalist needs to lock against this function.
> - *
> - * Even worse, someone may be doing a jbd2_journal_dirty_data on this
> - * buffer.  So we need to lock against that.  jbd2_journal_dirty_data()
> - * will come out of the lock with the buffer dirty, which makes it
> - * ineligible for release here.
> - *
> - * Who else is affected by this?  hmm...  Really the only contender
> - * is do_get_write_access() - it could be looking at the buffer while
> - * journal_try_to_free_buffer() is changing its state.  But that
> - * cannot happen because we never reallocate freed data as metadata
> - * while the data is part of a transaction.  Yes?
> - *
> - * Return false on failure, true on success
> + * Return: true if the folio's buffers were freed, false otherwise
>   */
>  bool jbd2_journal_try_to_free_buffers(journal_t *journal, struct folio
> *folio)
>  {
> 
> Thanks,
> Yi.
> 
> 
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
  2026-05-11  7:23 ` [PATCH v4 09/23] ext4: implement writeback " Zhang Yi
@ 2026-05-27 12:49   ` Ojaswin Mujoo
  2026-05-29 14:12     ` Zhang Yi
  2026-06-16 11:47   ` Jan Kara
  1 sibling, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-27 12:49 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:29PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Add the iomap writeback path for ext4 buffered I/O. This introduces:
> 
>  - ext4_iomap_writepages(): the main writeback entry point.
>  - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
>    block mapping and I/O submission.
>  - A new end I/O worker for converting unwritten extents, updating file
>    size, and handling DATA_ERR_ABORT after I/O completion.
> 
> Core implementation details:
> 
>  - ->writeback_range() callback
>    Calls ext4_iomap_map_writeback_range() to query the longest range of
>    existing mapped extents. For performance, when a block range is not
>    yet allocated, it allocates based on the writeback length and delalloc
>    extent length, rather than allocating for a single folio at a time.
>    The folio is then added to an iomap_ioend instance.
> 
>  - ->writeback_submit() callback
>    Registers ext4_iomap_end_bio() as the end bio callback. This callback
>    schedules a worker to handle:
>    - Unwritten extent conversion.
>    - i_disksize update after data is written back.
>    - Journal abort on writeback I/O failure.

Hi Zhang, the changes look good. I have a few comments below:
> 
> Key changes and considerations:
> 
> - Append write and unwritten extents
>   Since data=ordered mode is not used to prevent stale data exposure
>   during append writebacks, new blocks are always allocated as unwritten
>   extents (i.e. always enable dioread_nolock), and i_disksize update is
>   postponed until I/O completion. 

Makes sense.

>   Additionally, the deadlock that the
>   reserve handle was expected to resolve does not occur anymore.

I guess this is since we don't use ordered data so we can't block on
starting a txn in end io.

>   Therefore, the end I/O worker can start a normal journal handle
>   instead of a reserve handle when converting unwritten extents.
> 
> - Lock ordering
>   The ->writeback_range() callback runs under the folio lock, requiring
>   the journal handle to be started under that same lock. This reverses
>   the order compared to the buffer_head writeback path. The lock ordering
>   documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4.h        |   4 +
>  fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
>  fs/ext4/super.c       |   7 +-
>  fs/iomap/ioend.c      |   3 +-
>  include/linux/iomap.h |   1 +
>  6 files changed, 346 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4832e7f7db82..078feda47e36 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
>  	 */
>  	struct list_head i_rsv_conversion_list;
>  	struct work_struct i_rsv_conversion_work;
> +	struct list_head i_iomap_ioend_list;
> +	struct work_struct i_iomap_ioend_work;
>  
>  	/*
>  	 * Transactions that contain inode's metadata needed to complete
> @@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
>  		size_t len);
>  extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
>  extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
> +extern void ext4_iomap_end_io(struct work_struct *work);
> +extern void ext4_iomap_end_bio(struct bio *bio);
>  
>  /* mmp.c */
>  extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1ae7d3f4a1c8..a80195bd6f20 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -44,6 +44,7 @@
>  #include <linux/iversion.h>
>  
>  #include "ext4_jbd2.h"
> +#include "ext4_extents.h"
>  #include "xattr.h"
>  #include "acl.h"
>  #include "truncate.h"
> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
>  	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
>  }
>  
> +static int ext4_iomap_map_one_extent(struct inode *inode,
> +				     struct ext4_map_blocks *map)
> +{
> +	struct extent_status es;
> +	handle_t *handle = NULL;
> +	int credits, map_flags;
> +	int retval;
> +
> +	credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	map->m_flags = 0;
> +	/*
> +	 * It is necessary to look up extent and map blocks under i_data_sem
> +	 * in write mode, otherwise, the delalloc extent may become stale
> +	 * during concurrent truncate operations.
> +	 */
> +	ext4_fc_track_inode(handle, inode);
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> +		retval = es.es_len - (map->m_lblk - es.es_lblk);
> +		map->m_len = min_t(unsigned int, retval, map->m_len);
> +
> +		if (ext4_es_is_delayed(&es)) {

I understand that it is okay for us to rely on extent status ==
delayed here because we never reclaim delayed es entries and hence we
are sure to not skip any delayed block allocations here.

> +			map->m_flags |= EXT4_MAP_DELAYED;
> +			trace_ext4_da_write_pages_extent(inode, map);
> +			/*
> +			 * Call ext4_map_create_blocks() to allocate any
> +			 * delayed allocation blocks. It is possible that
> +			 * we're going to need more metadata blocks, however
> +			 * we must not fail because we're in writeback and
> +			 * there is nothing we can do so it might result in
> +			 * data loss. So use reserved blocks to allocate
> +			 * metadata if possible.
> +			 */
> +			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> +				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
> +				    EXT4_EX_NOCACHE;
> +
> +			retval = ext4_map_create_blocks(handle, inode, map,
> +							map_flags);
> +			if (retval > 0)
> +				ext4_fc_track_range(handle, inode, map->m_lblk,
> +						map->m_lblk + map->m_len - 1);
> +			goto out;
> +		} else if (unlikely(ext4_es_is_hole(&es)))

Now that you've fixed the partial invalidate in iomap (patch 12/23)
can we still hit this hole case? 

> +			goto out;
> +
> +		/* Found written or unwritten extent. */
> +		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> +		map->m_flags = ext4_es_is_written(&es) ?
> +			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> +		goto out;
> +	}
> +
> +	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> +out:
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +	ext4_journal_stop(handle);
> +	return retval < 0 ? retval : 0;
> +}
> +
> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> +					  loff_t offset, unsigned int dirty_len)
> +{
> +	struct inode *inode = wpc->inode;
> +	struct super_block *sb = inode->i_sb;
> +	struct journal_s *journal = EXT4_SB(sb)->s_journal;
> +	struct ext4_map_blocks map;
> +	unsigned int blkbits = inode->i_blkbits;
> +	unsigned int index = offset >> blkbits;
> +	unsigned int blk_end, blk_len;
> +	int ret;
> +
> +	ret = ext4_emergency_state(sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	/* Check validity of the cached writeback mapping. */
> +	if (offset >= wpc->iomap.offset &&
> +	    offset < wpc->iomap.offset + wpc->iomap.length &&
> +	    ext4_iomap_valid(inode, &wpc->iomap))
> +		return 0;
> +
> +	blk_len = dirty_len >> blkbits;
> +	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> +				      (UINT_MAX - 1));

This is an interesting idea. I'm just a bit worried when we have
range_end == LLONG_MAX (bg flush) and we will always be trying to allocate
MAX_WRITEPAGES, incase of a slightly fragmented FS, we might keep
falling into slower mballoc criterias and might waste a lot of time
scanning the groups.

> +	if (blk_end > index + blk_len)
> +		blk_len = blk_end - index + 1;
> +
> +retry:
> +	map.m_lblk = index;
> +	map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
> +	ret = ext4_map_blocks(NULL, inode, &map,
> +			      EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);

Do we really need the IO_SUBMIT flag here now that we are:
1. Not using ordered data
2. We anyways don't use it in ext4_iomap_map_one_extent().

I think we can drop it.

> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * The map is not a delalloc extent, it must either be a hole
> +	 * or an extent which have already been allocated.
> +	 */
> +	if (!(map.m_flags & EXT4_MAP_DELAYED))
> +		goto out;
> +
> +	/* Map one delalloc extent. */
> +	ret = ext4_iomap_map_one_extent(inode, &map);
> +	if (ret < 0) {
> +		if (ext4_emergency_state(sb))
> +			return ret;
> +
> +		/*
> +		 * Retry transient ENOSPC errors, if
> +		 * ext4_count_free_blocks() is non-zero, a commit
> +		 * should free up blocks.
> +		 */
> +		if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
> +			jbd2_journal_force_commit_nested(journal);
> +			goto retry;
> +		}
> +
> +		ext4_msg(sb, KERN_CRIT,
> +			 "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
> +			 inode->i_ino, (unsigned long long)map.m_lblk,
> +			 (unsigned int)map.m_len, -ret);
> +		ext4_msg(sb, KERN_CRIT,
> +			 "This should not happen!! Data will be lost\n");
> +		if (ret == -ENOSPC)
> +			ext4_print_free_blocks(inode);
> +		return ret;
> +	}
> +out:
> +	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
> +	return 0;
> +}
> +

<snip>
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 10/23] ext4: implement mmap path using iomap
  2026-05-11  7:23 ` [PATCH v4 10/23] ext4: implement mmap " Zhang Yi
@ 2026-05-27 12:56   ` Ojaswin Mujoo
  2026-06-16 11:56   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-27 12:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:30PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce ext4_iomap_page_mkwrite() to implement the mmap iomap path
> for ext4. The heavy lifting is delegated to iomap_page_mkwrite(), which
> only requires ext4_iomap_buffered_write_ops and
> ext4_iomap_buffered_da_write_ops to allocate and map blocks.
> 
> Note that the lock ordering between folio lock and transaction start in
> this path is reversed compared to the buffer_head buffered write path.
> The lock ordering documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/inode.c | 32 +++++++++++++++++++++++++++++++-
>  fs/ext4/super.c |  8 ++++++--
>  2 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index a80195bd6f20..c6fe42d012fc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4020,7 +4020,7 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
>  		return -ERANGE;
>  	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>  		return -EINVAL;
> -	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> +	if (WARN_ON_ONCE(!(flags & (IOMAP_WRITE | IOMAP_FAULT))))
>  		return -EINVAL;
>  
>  	if (delalloc)
> @@ -4080,6 +4080,14 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
>  		return 0;
>  
> +	/*
> +	 * iomap_page_mkwrite() will never fail in a way that requires delalloc
> +	 * extents that it allocated to be revoked.  Hence never try to release
> +	 * them here.
> +	 */
> +	if (flags & IOMAP_FAULT)
> +		return 0;
> +
>  	/* Nothing to do if we've written the entire delalloc extent */
>  	start_byte = iomap_last_written_block(inode, offset, written);
>  	end_byte = round_up(offset + length, i_blocksize(inode));
> @@ -7191,6 +7199,23 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
>  	return ret;
>  }
>  
> +static vm_fault_t ext4_iomap_page_mkwrite(struct vm_fault *vmf)
> +{
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	const struct iomap_ops *iomap_ops;
> +
> +	/*
> +	 * ext4_nonda_switch() could writeback this folio, so have to
> +	 * call it before lock folio.
> +	 */
> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> +	else
> +		iomap_ops = &ext4_iomap_buffered_write_ops;
> +
> +	return iomap_page_mkwrite(vmf, iomap_ops, NULL);
> +}
> +
>  vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> @@ -7213,6 +7238,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  
>  	filemap_invalidate_lock_shared(mapping);
>  
> +	if (ext4_inode_buffered_iomap(inode)) {
> +		ret = ext4_iomap_page_mkwrite(vmf);
> +		goto out;
> +	}
> +
>  	err = ext4_convert_inline_data(inode);
>  	if (err)
>  		goto out_ret;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 51d87db53543..62bfe05a64bc 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -100,8 +100,12 @@ static const struct fs_parameter_spec ext4_param_specs[];
>   * Lock ordering
>   *
>   * page fault path:
> - * mmap_lock -> sb_start_pagefault -> invalidate_lock (r) -> transaction start
> - *   -> page lock -> i_data_sem (rw)
> + * - buffer_head path:
> + *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
> + *     transaction start -> folio lock -> i_data_sem (rw)
> + * - iomap path:
> + *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
> + *     folio lock -> transaction start -> i_data_sem (rw)
>   *
>   * buffered write path:
>   * sb_start_write -> i_rwsem (w) -> mmap_lock
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
  2026-05-11  7:23 ` [PATCH v4 14/23] ext4: implement partial block zero range path using iomap Zhang Yi
@ 2026-05-27 13:13   ` Ojaswin Mujoo
  2026-06-16 12:28   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-27 13:13 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:34PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
> infrastructure for ext4.
> 
> ext4_iomap_block_zero_range() calls iomap_zero_range() with
> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
> out either a mapped partial block or a dirty, unwritten partial block.
> 
> Important constraints:
> 
> Zeroing out under an active journal handle can cause deadlock, because
> the order of acquiring the folio lock and starting a handle is
> inconsistent with the iomap writeback path.
> 
> Therefore, ext4_iomap_block_zero_range():
> - Must NOT be called under an active handle.
> - Cannot rely on data=ordered mode to ensure zeroed data persistence
>   before updating i_disksize (for the cases of post-EOF append write,
>   post-EOF fallocate, and truncate up). In subsequent patches, we will
>   address this by synchronizing commit I/O but doesn't waiting for
>   completion, and updating i_disksize to i_size only after the zeroed
>   data has been written back.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good in itself. Feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 92 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c6fe42d012fc..e0dae2501292 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	return 0;
>  }
>  
> +static int ext4_iomap_zero_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> +	struct ext4_map_blocks map;
> +	u8 blkbits = inode->i_blkbits;
> +	unsigned int iomap_flags = 0;
> +	int ret;
> +
> +	ret = ext4_emergency_state(inode->i_sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
> +		return -EINVAL;
> +
> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> +	 * this bypasses the flush iomap uses to trigger extent conversion
> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> +	 */
> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
> +
> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
> +			map.m_len = (start >> blkbits) - map.m_lblk;
> +	}
> +
> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	iomap->flags |= iomap_flags;
> +
> +	return 0;
> +}
> +
> +static const struct iomap_ops ext4_iomap_zero_ops = {
> +	.iomap_begin = ext4_iomap_zero_begin,
> +};
> +
>  /*
>   * Since we always allocate unwritten extents, there is no need for
>   * iomap_end to clean up allocated blocks on a short write.
> @@ -4616,6 +4661,47 @@ static int ext4_block_journalled_zero_range(struct inode *inode, loff_t from,
>  	return err;
>  }
>  
> +static int ext4_block_iomap_zero_range(struct inode *inode, loff_t from,
> +				       loff_t length, bool *did_zero,
> +				       bool *zero_written)
> +{
> +	int ret;
> +
> +	/*
> +	 * Zeroing out under an active handle can cause deadlock since
> +	 * the order of acquiring the folio lock and starting a handle is
> +	 * inconsistent with the iomap writeback procedure.
> +	 */
> +	if (WARN_ON_ONCE(ext4_handle_valid(journal_current_handle())))
> +		return -EINVAL;
> +
> +	/* The zeroing scope should not extend across a block. */
> +	if (WARN_ON_ONCE((from >> inode->i_blkbits) !=
> +			 ((from + length - 1) >> inode->i_blkbits)))
> +		return -EINVAL;
> +
> +	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS) &&
> +	    !(inode_state_read_once(inode) & (I_NEW | I_FREEING)))
> +		WARN_ON_ONCE(!inode_is_locked(inode) &&
> +			!rwsem_is_locked(&inode->i_mapping->invalidate_lock));
> +
> +	ret = iomap_zero_range(inode, from, length, did_zero,
> +			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
> +			       NULL);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * TODO: The iomap does not distinguish between different types of
> +	 * zeroing and always sets zero_written if a zeroing operation is
> +	 * performed, which may result in unnecessary order operations.
> +	 */
> +	if (did_zero && zero_written)
> +		*zero_written = *did_zero;
> +
> +	return 0;
> +}
> +
>  /*
>   * Zeros out a mapping of length 'length' starting from file offset
>   * 'from'.  The range to be zero'd must be contained with in one block.
> @@ -4642,6 +4728,9 @@ static int ext4_block_zero_range(struct inode *inode,
>  	} else if (ext4_should_journal_data(inode)) {
>  		return ext4_block_journalled_zero_range(inode, from, length,
>  							did_zero);
> +	} else if (ext4_inode_buffered_iomap(inode)) {
> +		return ext4_block_iomap_zero_range(inode, from, length,
> +						   did_zero, zero_written);
>  	}
>  	return ext4_block_do_zero_range(inode, from, length, did_zero,
>  					zero_written);
> @@ -4682,6 +4771,9 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  	 * truncating up or performing an append write, because there might be
>  	 * exposing stale on-disk data which may caused by concurrent post-EOF
>  	 * mmap write during folio writeback.
> +	 *
> +	 * TODO: In the iomap path, handle this by updating i_disksize to
> +	 * i_size after the zeroed data has been written back.
>  	 */
>  	if (ext4_should_order_data(inode) &&
>  	    did_zero && zero_written && !IS_DAX(inode)) {
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path
  2026-05-11  7:23 ` [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
@ 2026-05-27 13:14   ` Ojaswin Mujoo
  2026-06-16 12:29   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-27 13:14 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:35PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Add tracepoints for iomap buffered read, write, partial block zeroing,
> and writeback operations to help debug the iomap buffered I/O path.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/inode.c             |  6 +++++
>  include/trace/events/ext4.h | 45 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 51 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e0dae2501292..239d387ffaf2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3961,6 +3961,8 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>  	if (ret < 0)
>  		return ret;
>  
> +	trace_ext4_iomap_buffered_read_begin(inode, &map, offset, length,
> +					     flags);
>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>  	return 0;
>  }
> @@ -4034,6 +4036,8 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
>  	if (ret < 0)
>  		return ret;
>  
> +	trace_ext4_iomap_buffered_write_begin(inode, &map, offset, length,
> +					      flags);
>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>  	return 0;
>  }
> @@ -4136,6 +4140,7 @@ static int ext4_iomap_zero_begin(struct inode *inode,
>  			map.m_len = (start >> blkbits) - map.m_lblk;
>  	}
>  
> +	trace_ext4_iomap_zero_begin(inode, &map, offset, length, flags);
>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>  	iomap->flags |= iomap_flags;
>  
> @@ -4308,6 +4313,7 @@ static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
>  		return ret;
>  	}
>  out:
> +	trace_ext4_iomap_map_writeback_range(inode, &map, offset, dirty_len, 0);
>  	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
>  	return 0;
>  }
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index f493642cf121..ebafa06cd191 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -3096,6 +3096,51 @@ TRACE_EVENT(ext4_move_extent_exit,
>  		  __entry->ret)
>  );
>  
> +DECLARE_EVENT_CLASS(ext4_set_iomap_class,
> +	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map,
> +		 loff_t offset, loff_t length, unsigned int flags),
> +	TP_ARGS(inode, map, offset, length, flags),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(u64, ino)
> +		__field(ext4_lblk_t, m_lblk)
> +		__field(unsigned int, m_len)
> +		__field(unsigned int, m_flags)
> +		__field(u64, m_seq)
> +		__field(loff_t, offset)
> +		__field(loff_t, length)
> +		__field(unsigned int, iomap_flags)
> +	),
> +	TP_fast_assign(
> +		__entry->dev		= inode->i_sb->s_dev;
> +		__entry->ino		= inode->i_ino;
> +		__entry->m_lblk		= map->m_lblk;
> +		__entry->m_len		= map->m_len;
> +		__entry->m_flags	= map->m_flags;
> +		__entry->m_seq		= map->m_seq;
> +		__entry->offset		= offset;
> +		__entry->length		= length;
> +		__entry->iomap_flags	= flags;
> +
> +	),
> +	TP_printk("dev %d:%d ino %llu m_lblk %u m_len %u m_flags %s m_seq %llu orig_off 0x%llx orig_len 0x%llx iomap_flags 0x%x",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->ino, __entry->m_lblk, __entry->m_len,
> +		  show_mflags(__entry->m_flags), __entry->m_seq,
> +		  __entry->offset, __entry->length, __entry->iomap_flags)
> +)
> +
> +#define DEFINE_SET_IOMAP_EVENT(name) \
> +DEFINE_EVENT(ext4_set_iomap_class, name, \
> +	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map, \
> +		 loff_t offset, loff_t length, unsigned int flags), \
> +	TP_ARGS(inode, map, offset, length, flags))
> +
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_read_begin);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_write_begin);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_map_writeback_range);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_zero_begin);
> +
>  #endif /* _TRACE_EXT4_H */
>  
>  /* This part must be outside protection */
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 16/23] ext4: disable online defrag when inode using iomap buffered I/O path
  2026-05-11  7:23 ` [PATCH v4 16/23] ext4: disable online defrag when inode using " Zhang Yi
@ 2026-05-27 13:14   ` Ojaswin Mujoo
  2026-06-16 12:30   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-27 13:14 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:36PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Online defragmentation does not currently support inodes using the
> iomap buffered I/O path. The existing implementation relies on
> buffer_head for sub-folio block management and data=ordered mode for
> data consistency, both of which are incompatible with the iomap path.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/move_extent.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index 3329b7ad5dbd..f707a1096544 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -476,6 +476,17 @@ static int mext_check_validity(struct inode *orig_inode,
>  		return -EOPNOTSUPP;
>  	}
>  
> +	/*
> +	 * TODO: support online defrag for inodes that using the buffered
> +	 * I/O iomap path.
> +	 */
> +	if (ext4_inode_buffered_iomap(orig_inode) ||
> +	    ext4_inode_buffered_iomap(donor_inode)) {
> +		ext4_msg(sb, KERN_ERR,
> +			 "Online defrag not supported for inode with iomap buffered IO path");
> +		return -EOPNOTSUPP;
> +	}
> +
>  	if (donor_inode->i_mode & (S_ISUID|S_ISGID)) {
>  		ext4_debug("ext4 move extent: suid or sgid is set to donor file [ino:orig %llu, donor %llu]\n",
>  			   orig_inode->i_ino, donor_inode->i_ino);
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
  2026-05-11  7:23 ` [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the " Zhang Yi
@ 2026-05-27 13:41   ` Ojaswin Mujoo
  2026-05-30  2:53     ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-27 13:41 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> In the generic buffered_head I/O path, we rely on the data=order mode to
> ensure that the zeroed EOF block data is written before updating
> i_disksize, thus preventing stale data from being exposed.
> 
> However, the iomap buffered I/O path cannot use this mechanism. Instead,
> we issue the I/O immediately after performing the zero operation
> (without synchronous waiting for performance). This can reduce the risk
> of exposing stale data, but it does not guarantee that the zero data
> will be flushed to disk before the metadata of i_disksize is updated.
> The subsequent patches will wait for this I/O to complete before
> updating i_disksize.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

I think we discussed that we may not need to do this [1] but I guess
you've decided to make the tradeoff of issuing the IO to avoid having to
wait for bg flush to complete the tail page zeroing 

However, I think one side effect might be many threads calling the
writeback mechanism to issue zero IOs which might not scale well. I
don't know if it'll be a huge problem though, I guess it's a sort of
thing we will have to deal with in case we see it in real world
workloads.

[1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/

> ---
>  fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 55 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 239d387ffaf2..e013aeb03d7b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>  					zero_written);
>  }
>  
> +static int ext4_iomap_submit_zero_block(struct inode *inode,
> +					loff_t from, loff_t end)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	struct folio *folio;
> +	bool do_submit = false;
> +
> +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> +	if (IS_ERR(folio))
> +		/* Already writeback and clear? */
> +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
> +
> +	folio_wait_writeback(folio);
> +	WARN_ON_ONCE(folio_test_writeback(folio));
> +
> +	if (likely(folio_test_dirty(folio)))
> +		do_submit = true;
> +	folio_unlock(folio);
> +	folio_put(folio);
> +
> +	/* Submit zeroed block. */
> +	if (do_submit)
> +		return filemap_fdatawrite_range(mapping, from, end - 1);
> +	return 0;
> +}
> +
>  /*
>   * Zero out a mapping from file offset 'from' up to the end of the block
>   * which corresponds to 'from' or to the given 'end' inside this block.
> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>  		return 0;
>  
> -	if (length > blocksize - offset)
> +	if (length > blocksize - offset) {
>  		length = blocksize - offset;
> +		end = from + length;
> +	}
>  
>  	err = ext4_block_zero_range(inode, from, length,
>  				    &did_zero, &zero_written);
> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  	 * TODO: In the iomap path, handle this by updating i_disksize to
>  	 * i_size after the zeroed data has been written back.
>  	 */
> -	if (ext4_should_order_data(inode) &&
> -	    did_zero && zero_written && !IS_DAX(inode)) {
> -		handle_t *handle;
> +	if (did_zero && zero_written && !IS_DAX(inode)) {
> +		if (ext4_should_order_data(inode)) {
> +			handle_t *handle;
>  
> -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> -		if (IS_ERR(handle))
> -			return PTR_ERR(handle);
> +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> +			if (IS_ERR(handle))
> +				return PTR_ERR(handle);
>  
> -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
> -		ext4_journal_stop(handle);
> -		if (err)
> -			return err;
> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
> +							length);
> +			ext4_journal_stop(handle);
> +			if (err)
> +				return err;
> +		/*
> +		 * inodes using the iomap buffered I/O path do not use the
> +		 * data=ordered mode. We submit zeroed range directly here.
> +		 * Do not wait for I/O completion for performance.
> +		 *
> +		 * TODO: Any operation that extends i_disksize (including
> +		 * append write end io past the zeroed boundary, truncate up,
> +		 * and append fallocate) must wait for the relevant I/O to
> +		 * complete before updating i_disksize.
> +		 */
> +		} else if (ext4_inode_buffered_iomap(inode)) {
> +			err = ext4_iomap_submit_zero_block(inode, from, end);
> +			if (err)
> +				return err;
> +		}
>  	}
>  
>  	return 0;
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-05-11  7:23 ` [PATCH v4 18/23] ext4: wait for ordered I/O " Zhang Yi
@ 2026-05-27 15:58   ` Ojaswin Mujoo
  2026-05-28 13:34     ` Ojaswin Mujoo
  2026-05-30  7:22     ` Zhang Yi
  0 siblings, 2 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-27 15:58 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> For append writes, wait for ordered I/O to complete before updating
> i_disksize. This ensures that zeroed data is flushed to disk before the
> metadata update, preventing stale data from being exposed during
> unaligned post-EOF append writes.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4.h    | 11 +++++++
>  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>  fs/ext4/super.c   | 23 ++++++++++----
>  4 files changed, 161 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 078feda47e36..9ce2128eea3e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
>  #ifdef CONFIG_FS_ENCRYPTION
>  	struct fscrypt_inode_info *i_crypt_info;
>  #endif
> +
> +	/*
> +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
> +	 * and truncate-up operations. These parameters are used only in the
> +	 * iomap buffered I/O path.
> +	 */
> +	ext4_lblk_t i_ordered_lblk;
> +	ext4_lblk_t i_ordered_len;
> +	wait_queue_head_t i_ordered_wq;
>  };
>  
>  /*
> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>  			     __u64 len, __u64 *moved_len);
>  
>  /* page-io.c */
> +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
> +
>  extern int __init ext4_init_pageio(void);
>  extern void ext4_exit_pageio(void);
>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e013aeb03d7b..11fb369efeb1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>  {
>  	struct iomap_ioend *ioend = wpc->wb_ctx;
>  	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> +	ext4_lblk_t start, end, order_lblk, order_len;
>  
>  	/*
>  	 * After I/O completion, a worker needs to be scheduled when:
> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>  	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
>  		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>  
> +	/*
> +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
> +	 * handling and must not be merged with regular I/O operations.
> +	 */
> +	order_len = READ_ONCE(ei->i_ordered_len);
> +	if (order_len) {
> +		/*
> +		 * Pair with smp_store_release() in ext4_block_zero_eof().
> +		 * Ensure we see the updated i_ordered_lblk that was written
> +		 * before the release store to i_ordered_len.
> +		 */
> +		smp_rmb();
> +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
> +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> +		end = EXT4_B_TO_LBLK(ioend->io_inode,
> +				     ioend->io_offset + ioend->io_size);
> +
> +		if (start <= order_lblk && end >= order_lblk + order_len) {

Hi Zhang,

I guess this check is enough cause ordered_lblk and ordered_len will
always be  contained in a single block.

> +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;

FWIU, we are wanting the ordered IO to not be merged and submitted asap
since we want to wake up the waiters. Is there any other reason?

Adding the boundary in ->writeback_submit() only affects
iomap_ioend_can_merge() which happens after we have woken up the waiters
and deferred the IO to the wq. We ideally want it affect
iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
->writeback_range().

Secondly, I don't think boundary is the right flag here. It ensures
that everything before the ordered iomap gets submitted and the ordered
iomap starts a new ioend. This can still keep getting merged with the
newer ioends untils we decide to submit the IO, which can delay waking
up the waiters. If we really want the "no merge" behavior, we'll have to
do something like [1] (Check the 2 NOMERGE flag patches).

> +		}
> +	}
> +
>  	return iomap_ioend_writeback_submit(wpc, error);
>  }
>  
> @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>  					loff_t from, loff_t end)
>  {
>  	struct address_space *mapping = inode->i_mapping;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
>  	struct folio *folio;
>  	bool do_submit = false;
> +	int ret;
>  
>  	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>  	if (IS_ERR(folio))
> @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>  	folio_wait_writeback(folio);
>  	WARN_ON_ONCE(folio_test_writeback(folio));
>  
> -	if (likely(folio_test_dirty(folio)))
> +	/*
> +	 * Mark the ordered range. It will be cleared upon I/O completion
> +	 * in ext4_iomap_end_bio(). Any operation that extends i_disksize
> +	 * (including append write end io past the zeroed boundary,
> +	 * truncate up and append fallocate) must wait for this I/O to
> +	 * complete before updating i_disksize.
> +	 *
> +	 * When multiple overlapping unaligned EOF writes are in flight, we
> +	 * only need to track and wait for the first one. Subsequent writes
> +	 * will zero the gap in memory and ensure that the zeroed data is
> +	 * written out along with the valid data in the same block before
> +	 * i_disksize is updated.
> +	 */
> +	if (likely(folio_test_dirty(folio) &&
> +		   READ_ONCE(ei->i_ordered_len) == 0)) {
> +		WRITE_ONCE(ei->i_ordered_lblk,
> +			   from >> inode->i_blkbits);
> +		/*
> +		 * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
> +		 * and ext4_iomap_wb_ordered_wait(). Ensure the updated
> +		 * i_ordered_lblk is visible when i_ordered_len becomes
> +		 * non-zero.
> +		 */
> +		smp_store_release(&ei->i_ordered_len, 1);
>  		do_submit = true;
> +	}
>  	folio_unlock(folio);
>  	folio_put(folio);
>  
>  	/* Submit zeroed block. */
> -	if (do_submit)
> -		return filemap_fdatawrite_range(mapping, from, end - 1);
> +	if (do_submit) {
> +		ret = filemap_fdatawrite_range(mapping, from, end - 1);
> +		if (ret) {
> +			/*
> +			 * Pairs with wait_event() in
> +			 * ext4_iomap_wb_ordered_wait(). Ensure
> +			 * i_ordered_len = 0 is visible before waking up
> +			 * waiters.
> +			 */
> +			smp_store_release(&ei->i_ordered_len, 0);
> +			wake_up_all(&ei->i_ordered_wq);
> +			return ret;
> +		}
> +	}
>  	return 0;
>  }
>  
> @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  		 * data=ordered mode. We submit zeroed range directly here.
>  		 * Do not wait for I/O completion for performance.
>  		 *
> -		 * TODO: Any operation that extends i_disksize (including
> -		 * append write end io past the zeroed boundary, truncate up,
> -		 * and append fallocate) must wait for the relevant I/O to
> -		 * complete before updating i_disksize.
> +		 * The end_io handler ext4_iomap_wb_ordered_wait() will wait
> +		 * for I/O completion before updating i_disksize if the write
> +		 * extends beyond the zeroed boundary.
> +		 *
> +		 * TODO: Any other operation that extends i_disksize
> +		 * (including truncate up and append fallocate) must wait for
> +		 * the relevant I/O to complete before updating i_disksize.
>  		 */
>  		} else if (ext4_inode_buffered_iomap(inode)) {
>  			err = ext4_iomap_submit_zero_block(inode, from, end);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 3050c887329f..ad05ebb49bf6 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
>  	return 0;
>  }
>  
> +/*
> + * If the old disk size is not block size aligned and the current
> + * writeback range is entirely beyond the old EOF block, we should
> + * wait for the zeroed data written in ext4_block_zero_eof() to be
> + * written out, otherwise, it may expose stale data in that block.
> + */
> +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
> +				       loff_t pos, loff_t end)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	unsigned int blocksize = i_blocksize(inode);
> +	loff_t disksize = READ_ONCE(ei->i_disksize);
> +	ext4_lblk_t order_lblk, order_len;
> +
> +	/*
> +	 * Waiting for ordered I/O is unnecessary when:
> +	 * - The on-disk size is block-aligned (no stale data exists).
> +	 * - The write start is within the block of the old EOF
> +	 *   (overwriting, or appending to a block that already contains
> +	 *   valid data).
> +	 */
> +	if (!(disksize & (blocksize - 1)) ||
> +	    pos < round_up(disksize, blocksize))
> +		return;
> +
> +	order_len = READ_ONCE(ei->i_ordered_len);
> +	if (!order_len)
> +		return;
> +
> +	/*
> +	 * Pair with smp_store_release() in ext4_iomap_end_bio() and
> +	 * ext4_block_zero_eof(). Ensure we see the updated i_ordered_lblk
> +	 * that was written before the release store to i_ordered_len.
> +	 */
> +	smp_rmb();
> +	order_lblk = READ_ONCE(ei->i_ordered_lblk);
> +	if ((pos >> inode->i_blkbits) >= order_lblk + order_len)
> +		wait_event(ei->i_ordered_wq, READ_ONCE(ei->i_ordered_len) == 0);
> +}
> +
>  static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
>  					 loff_t end)
>  {
> @@ -656,6 +696,9 @@ static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
>  		goto out;
>  	}
>  
> +	/* Wait ordered zero data to be written out. */
> +	ext4_iomap_wb_ordered_wait(inode, pos, pos + size);
> +
>  	/* We may need to convert one extent and dirty the inode. */
>  	credits = ext4_chunk_trans_blocks(inode,
>  			EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
> @@ -717,8 +760,25 @@ void ext4_iomap_end_bio(struct bio *bio)
>  	struct inode *inode = ioend->io_inode;
>  	struct ext4_inode_info *ei = EXT4_I(inode);
>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +	unsigned long io_mode = (unsigned long)ioend->io_private;
>  	unsigned long flags;
>  
> +	/*
> +	 * This is an ordered I/O, clear the ordered range set in
> +	 * ext4_block_zero_eof() and wake up all waiters that will update
> +	 * the inode i_disksize.
> +	 */
> +	if (io_mode == EXT4_IOMAP_IOEND_ORDER_IO) {
> +		/*
> +		 * Pairs with wait_event() in ext4_iomap_wb_ordered_wait().
> +		 * Ensure i_ordered_len = 0 is visible before waking up
> +		 * waiters.
> +		 */
> +		smp_store_release(&ei->i_ordered_len, 0);
> +		wake_up_all(&ei->i_ordered_wq);
> +		goto defer;
> +	}
> +
>  	/* Needs to convert unwritten extents or update the i_disksize. */
>  	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
>  	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 62bfe05a64bc..9c0a00e716f3 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1444,6 +1444,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>  	ext4_fc_init_inode(&ei->vfs_inode);
>  	spin_lock_init(&ei->i_fc_lock);
>  	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
> +	ei->i_ordered_lblk = 0;
> +	ei->i_ordered_len = 0;
> +	init_waitqueue_head(&ei->i_ordered_wq);
>  	return &ei->vfs_inode;
>  }
>  
> @@ -1480,12 +1483,20 @@ static void ext4_destroy_inode(struct inode *inode)
>  		dump_stack();
>  	}
>  
> -	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS) &&
> -	    WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
> -		ext4_msg(inode->i_sb, KERN_ERR,
> -			 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
> -			 inode->i_ino, EXT4_I(inode),
> -			 EXT4_I(inode)->i_reserved_data_blocks);
> +	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS)) {
> +		if (WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
> +			ext4_msg(inode->i_sb, KERN_ERR,
> +				 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
> +				 inode->i_ino, EXT4_I(inode),
> +				 EXT4_I(inode)->i_reserved_data_blocks);
> +
> +		if (WARN_ON_ONCE(EXT4_I(inode)->i_ordered_len))
> +			ext4_msg(inode->i_sb, KERN_ERR,
> +				 "Inode %llu (%p): i_ordered_lblk (%u) and i_ordered_len (%u) not cleared!",
> +				 inode->i_ino, EXT4_I(inode),
> +				 EXT4_I(inode)->i_ordered_lblk,
> +				 EXT4_I(inode)->i_ordered_len);
> +	}
>  }
>  
>  static void ext4_shutdown(struct super_block *sb)
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-05-27 15:58   ` Ojaswin Mujoo
@ 2026-05-28 13:34     ` Ojaswin Mujoo
  2026-05-30  9:32       ` Zhang Yi
  2026-05-30  7:22     ` Zhang Yi
  1 sibling, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-28 13:34 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Wed, May 27, 2026 at 09:28:28PM +0530, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > For append writes, wait for ordered I/O to complete before updating
> > i_disksize. This ensures that zeroed data is flushed to disk before the
> > metadata update, preventing stale data from being exposed during
> > unaligned post-EOF append writes.
> > 
> > Suggested-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > ---
> >  fs/ext4/ext4.h    | 11 +++++++
> >  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> >  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> >  fs/ext4/super.c   | 23 ++++++++++----
> >  4 files changed, 161 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 078feda47e36..9ce2128eea3e 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> >  #ifdef CONFIG_FS_ENCRYPTION
> >  	struct fscrypt_inode_info *i_crypt_info;
> >  #endif
> > +
> > +	/*
> > +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
> > +	 * and truncate-up operations. These parameters are used only in the
> > +	 * iomap buffered I/O path.
> > +	 */
> > +	ext4_lblk_t i_ordered_lblk;
> > +	ext4_lblk_t i_ordered_len;
> > +	wait_queue_head_t i_ordered_wq;
> >  };
> >  
> >  /*
> > @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> >  			     __u64 len, __u64 *moved_len);
> >  
> >  /* page-io.c */
> > +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
> > +
> >  extern int __init ext4_init_pageio(void);
> >  extern void ext4_exit_pageio(void);
> >  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index e013aeb03d7b..11fb369efeb1 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >  {
> >  	struct iomap_ioend *ioend = wpc->wb_ctx;
> >  	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> > +	ext4_lblk_t start, end, order_lblk, order_len;
> >  
> >  	/*
> >  	 * After I/O completion, a worker needs to be scheduled when:
> > @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >  	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> >  		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> >  
> > +	/*
> > +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
> > +	 * handling and must not be merged with regular I/O operations.
> > +	 */
> > +	order_len = READ_ONCE(ei->i_ordered_len);
> > +	if (order_len) {
> > +		/*
> > +		 * Pair with smp_store_release() in ext4_block_zero_eof().
> > +		 * Ensure we see the updated i_ordered_lblk that was written
> > +		 * before the release store to i_ordered_len.
> > +		 */
> > +		smp_rmb();
> > +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
> > +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> > +		end = EXT4_B_TO_LBLK(ioend->io_inode,
> > +				     ioend->io_offset + ioend->io_size);
> > +
> > +		if (start <= order_lblk && end >= order_lblk + order_len) {
> 
> Hi Zhang,
> 
> I guess this check is enough cause ordered_lblk and ordered_len will
> always be  contained in a single block.
> 
> > +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> > +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> > +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> 
> FWIU, we are wanting the ordered IO to not be merged and submitted asap
> since we want to wake up the waiters. Is there any other reason?
> 
> Adding the boundary in ->writeback_submit() only affects
> iomap_ioend_can_merge() which happens after we have woken up the waiters
> and deferred the IO to the wq. We ideally want it affect
> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> ->writeback_range().
> 
> Secondly, I don't think boundary is the right flag here. It ensures
> that everything before the ordered iomap gets submitted and the ordered
> iomap starts a new ioend. This can still keep getting merged with the
> newer ioends untils we decide to submit the IO, which can delay waking
> up the waiters. If we really want the "no merge" behavior, we'll have to
> do something like [1] (Check the 2 NOMERGE flag patches).

Hi Zhang, forgot to add this.

[1] https://github.com/OjaswinM/linux/commits/iomap-buffered-atomic-rfc2.3/

Also some more comments below:
> 
> > +		}
> > +	}
> > +
> >  	return iomap_ioend_writeback_submit(wpc, error);
> >  }
> >  
> > @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> >  					loff_t from, loff_t end)
> >  {
> >  	struct address_space *mapping = inode->i_mapping;
> > +	struct ext4_inode_info *ei = EXT4_I(inode);
> >  	struct folio *folio;
> >  	bool do_submit = false;
> > +	int ret;
> >  
> >  	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> >  	if (IS_ERR(folio))
> > @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> >  	folio_wait_writeback(folio);
> >  	WARN_ON_ONCE(folio_test_writeback(folio));
> >  
> > -	if (likely(folio_test_dirty(folio)))
> > +	/*
> > +	 * Mark the ordered range. It will be cleared upon I/O completion
> > +	 * in ext4_iomap_end_bio(). Any operation that extends i_disksize
> > +	 * (including append write end io past the zeroed boundary,
> > +	 * truncate up and append fallocate) must wait for this I/O to
> > +	 * complete before updating i_disksize.
> > +	 *
> > +	 * When multiple overlapping unaligned EOF writes are in flight, we
> > +	 * only need to track and wait for the first one. Subsequent writes
> > +	 * will zero the gap in memory and ensure that the zeroed data is
> > +	 * written out along with the valid data in the same block before
> > +	 * i_disksize is updated.
> > +	 */
> > +	if (likely(folio_test_dirty(folio) &&
> > +		   READ_ONCE(ei->i_ordered_len) == 0)) {
> > +		WRITE_ONCE(ei->i_ordered_lblk,
> > +			   from >> inode->i_blkbits);
> > +		/*
> > +		 * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
> > +		 * and ext4_iomap_wb_ordered_wait(). Ensure the updated
> > +		 * i_ordered_lblk is visible when i_ordered_len becomes
> > +		 * non-zero.
> > +		 */
> > +		smp_store_release(&ei->i_ordered_len, 1);
> >  		do_submit = true;
> > +	}
> >  	folio_unlock(folio);
> >  	folio_put(folio);
> >  
> >  	/* Submit zeroed block. */
> > -	if (do_submit)
> > -		return filemap_fdatawrite_range(mapping, from, end - 1);
> > +	if (do_submit) {
> > +		ret = filemap_fdatawrite_range(mapping, from, end - 1);
> > +		if (ret) {
> > +			/*
> > +			 * Pairs with wait_event() in
> > +			 * ext4_iomap_wb_ordered_wait(). Ensure
> > +			 * i_ordered_len = 0 is visible before waking up
> > +			 * waiters.
> > +			 */
> > +			smp_store_release(&ei->i_ordered_len, 0);
> > +			wake_up_all(&ei->i_ordered_wq);
> > +			return ret;

Okay so even if the ordered IO fails we still let the i_disksize updates
go ahead? I think this is a deviation from the current behavior where we
abort the journal. If this is acceptable we should atleast add a comment
on why its okay.

> > +		}
> > +	}
> >  	return 0;
> >  }
> >  
> > @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> >  		 * data=ordered mode. We submit zeroed range directly here.
> >  		 * Do not wait for I/O completion for performance.
> >  		 *
> > -		 * TODO: Any operation that extends i_disksize (including
> > -		 * append write end io past the zeroed boundary, truncate up,
> > -		 * and append fallocate) must wait for the relevant I/O to
> > -		 * complete before updating i_disksize.
> > +		 * The end_io handler ext4_iomap_wb_ordered_wait() will wait
> > +		 * for I/O completion before updating i_disksize if the write
> > +		 * extends beyond the zeroed boundary.
> > +		 *
> > +		 * TODO: Any other operation that extends i_disksize
> > +		 * (including truncate up and append fallocate) must wait for
> > +		 * the relevant I/O to complete before updating i_disksize.
> >  		 */
> >  		} else if (ext4_inode_buffered_iomap(inode)) {
> >  			err = ext4_iomap_submit_zero_block(inode, from, end);
> > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> > index 3050c887329f..ad05ebb49bf6 100644
> > --- a/fs/ext4/page-io.c
> > +++ b/fs/ext4/page-io.c
> > @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
> >  	return 0;
> >  }
> >  
> > +/*
> > + * If the old disk size is not block size aligned and the current
> > + * writeback range is entirely beyond the old EOF block, we should
> > + * wait for the zeroed data written in ext4_block_zero_eof() to be
> > + * written out, otherwise, it may expose stale data in that block.
> > + */
> > +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
> > +				       loff_t pos, loff_t end)
> > +{
> > +	struct ext4_inode_info *ei = EXT4_I(inode);
> > +	unsigned int blocksize = i_blocksize(inode);
> > +	loff_t disksize = READ_ONCE(ei->i_disksize);
> > +	ext4_lblk_t order_lblk, order_len;
> > +
> > +	/*
> > +	 * Waiting for ordered I/O is unnecessary when:
> > +	 * - The on-disk size is block-aligned (no stale data exists).
> > +	 * - The write start is within the block of the old EOF
> > +	 *   (overwriting, or appending to a block that already contains
> > +	 *   valid data).
> > +	 */
> > +	if (!(disksize & (blocksize - 1)) ||
> > +	    pos < round_up(disksize, blocksize))
> > +		return;

Okay these checks are pretty confusing. I was intially thinking that
i_disksize's block would always be equal to i_ordered_lblk but seems
like that is not true because ext4_block_zero_eof() uses from=i_size.

So we could have a sequence where

1. truncate 4k (i_disksize = i_size = 4k)
2. write 8k,10k (i_disksize = 4k i_size = 10k, i_ordered_len = 0 (old isisze  is block aligned)) 
3. write 16k,18k (i_disksize = 4k i_size = 10k, i_ordered_len = 1, lblk=4)

Here we issue ordered IO even though it' probably not needed.  Now if
write 3 finishes first we see disksize as 4k so we don't wait for
ordered write. Which seems okay since we don't risk any stale data
exposure. However, this flow is pretty confuing.

Can't we somehow avoid having to issue/set ordered len/lblk in case it
is not really needed, like only issue it if i_disksize (and not i_size) 
is unaligned. That can simplify some of our check and avoid extra IO
overhead.

> > +
> > +	order_len = READ_ONCE(ei->i_ordered_len);
> > +	if (!order_len)
> > +		return;
> > +
> > +	/*
> > +	 * Pair with smp_store_release() in ext4_iomap_end_bio() and
> > +	 * ext4_block_zero_eof(). Ensure we see the updated i_ordered_lblk
> > +	 * that was written before the release store to i_ordered_len.
> > +	 */
> > +	smp_rmb();
> > +	order_lblk = READ_ONCE(ei->i_ordered_lblk);
> > +	if ((pos >> inode->i_blkbits) >= order_lblk + order_len)
> > +		wait_event(ei->i_ordered_wq, READ_ONCE(ei->i_ordered_len) == 0);
> > +}
> > +
> >  static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
> >  					 loff_t end)
> >  {
> > @@ -656,6 +696,9 @@ static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
> >  		goto out;
> >  	}
> >  
> > +	/* Wait ordered zero data to be written out. */
> > +	ext4_iomap_wb_ordered_wait(inode, pos, pos + size);
> > +
> >  	/* We may need to convert one extent and dirty the inode. */
> >  	credits = ext4_chunk_trans_blocks(inode,
> >  			EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
> > @@ -717,8 +760,25 @@ void ext4_iomap_end_bio(struct bio *bio)
> >  	struct inode *inode = ioend->io_inode;
> >  	struct ext4_inode_info *ei = EXT4_I(inode);
> >  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > +	unsigned long io_mode = (unsigned long)ioend->io_private;
> >  	unsigned long flags;
> >  
> > +	/*
> > +	 * This is an ordered I/O, clear the ordered range set in
> > +	 * ext4_block_zero_eof() and wake up all waiters that will update
> > +	 * the inode i_disksize.
> > +	 */
> > +	if (io_mode == EXT4_IOMAP_IOEND_ORDER_IO) {
> > +		/*
> > +		 * Pairs with wait_event() in ext4_iomap_wb_ordered_wait().
> > +		 * Ensure i_ordered_len = 0 is visible before waking up
> > +		 * waiters.
> > +		 */
> > +		smp_store_release(&ei->i_ordered_len, 0);
> > +		wake_up_all(&ei->i_ordered_wq);
> > +		goto defer;
> > +	}
> > +
> >  	/* Needs to convert unwritten extents or update the i_disksize. */
> >  	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
> >  	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 62bfe05a64bc..9c0a00e716f3 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -1444,6 +1444,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
> >  	ext4_fc_init_inode(&ei->vfs_inode);
> >  	spin_lock_init(&ei->i_fc_lock);
> >  	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
> > +	ei->i_ordered_lblk = 0;
> > +	ei->i_ordered_len = 0;
> > +	init_waitqueue_head(&ei->i_ordered_wq);
> >  	return &ei->vfs_inode;
> >  }
> >  
> > @@ -1480,12 +1483,20 @@ static void ext4_destroy_inode(struct inode *inode)
> >  		dump_stack();
> >  	}
> >  
> > -	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS) &&
> > -	    WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
> > -		ext4_msg(inode->i_sb, KERN_ERR,
> > -			 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
> > -			 inode->i_ino, EXT4_I(inode),
> > -			 EXT4_I(inode)->i_reserved_data_blocks);
> > +	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS)) {
> > +		if (WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
> > +			ext4_msg(inode->i_sb, KERN_ERR,
> > +				 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
> > +				 inode->i_ino, EXT4_I(inode),
> > +				 EXT4_I(inode)->i_reserved_data_blocks);
> > +
> > +		if (WARN_ON_ONCE(EXT4_I(inode)->i_ordered_len))
> > +			ext4_msg(inode->i_sb, KERN_ERR,
> > +				 "Inode %llu (%p): i_ordered_lblk (%u) and i_ordered_len (%u) not cleared!",
> > +				 inode->i_ino, EXT4_I(inode),
> > +				 EXT4_I(inode)->i_ordered_lblk,
> > +				 EXT4_I(inode)->i_ordered_len);
> > +	}
> >  }
> >  
> >  static void ext4_shutdown(struct super_block *sb)
> > -- 
> > 2.52.0
> > 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-05-26 17:10   ` Ojaswin Mujoo
@ 2026-05-28 15:40     ` Darrick J. Wong
  2026-06-02  7:02       ` Ojaswin Mujoo
  2026-05-29  9:13     ` Zhang Yi
  1 sibling, 1 reply; 85+ messages in thread
From: Darrick J. Wong @ 2026-05-28 15:40 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Tue, May 26, 2026 at 10:40:30PM +0530, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > Introduce two new iomap_ops instances for ext4 buffered writes:
> > 
> >  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> >    ext4_da_map_blocks() to map delalloc extents.
> >  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> >    ext4_iomap_get_blocks() to directly allocate blocks.
> > 
> > Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> > validity.
> > 
> > Key changes and considerations:
> > 
> >  - Unwritten extents for new blocks (dioread_nolock always on)
> >    Since data=ordered mode is not used to prevent stale data exposure in
> >    the non-delayed allocation path, new blocks are always allocated as
> >    unwritten extents.
> 
> Okay makes sense.
> 
> > 
> >  - Short write and write failure handling
> >    a. Delalloc path: On short write or failure, the stale delalloc range
> >       must be dropped and its space reservation released. Otherwise, a
> >       clean folio may cover leftover delalloc extents, causing
> >       inaccurate space reservation accounting.
> 
> Hmm, okay so in the usual buffer head path, seems like during a short
> write we still zero the new buffers we couldn't write and keep it dirty
> (folio_zero_new_buffers()). This way they are still written back and
> the delalloc reservations are used up.
> 
> However in iomap we don't mark the range that we couldnt write as dirty
> so we need to make sure we clear up the stale delalloc mappings. Is this
> correct?

Yes, that's true of iomap's pagecache handling.

--D

> Regards,
> Ojaswin
> 
> >    b. Non-delalloc path: No cleanup of allocated blocks is needed on
> >       short write.
> > 
> >  - Lock ordering reversal
> >    The folio lock and transaction start ordering is reversed compared to
> >    the buffer_head buffered write path. To handle this, the journal
> >    handle must be stopped in iomap_begin() callbacks. The lock ordering
> >    documentation in super.c has been updated accordingly.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > ---
> >  fs/ext4/ext4.h  |   4 ++
> >  fs/ext4/file.c  |  20 +++++-
> >  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  fs/ext4/super.c |  10 ++-
> >  4 files changed, 192 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 1e27d73d7427..4832e7f7db82 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> >  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> >  				struct buffer_head *bh);
> >  void ext4_set_inode_mapping_order(struct inode *inode);
> > +int ext4_nonda_switch(struct super_block *sb);
> >  #define FALL_BACK_TO_NONDELALLOC 1
> >  #define CONVERT_INLINE_DATA	 2
> >  
> > @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> >  
> >  extern const struct iomap_ops ext4_iomap_ops;
> >  extern const struct iomap_ops ext4_iomap_report_ops;
> > +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> > +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> > +extern const struct iomap_write_ops ext4_iomap_write_ops;
> >  
> >  static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> >  {
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index eb1a323962b1..7f9bfbbc4a4e 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  	return count;
> >  }
> >  
> > +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> > +					 struct iov_iter *from)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	const struct iomap_ops *iomap_ops;
> > +
> > +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> > +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> > +	else
> > +		iomap_ops = &ext4_iomap_buffered_write_ops;
> > +
> > +	return iomap_file_buffered_write(iocb, from, iomap_ops,
> > +					 &ext4_iomap_write_ops, NULL);
> > +}
> > +
> >  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >  					struct iov_iter *from)
> >  {
> > @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >  	if (ret <= 0)
> >  		goto out;
> >  
> > -	ret = generic_perform_write(iocb, from);
> > +	if (ext4_inode_buffered_iomap(inode))
> > +		ret = ext4_iomap_buffered_write(iocb, from);
> > +	else
> > +		ret = generic_perform_write(iocb, from);
> >  
> >  out:
> >  	inode_unlock(inode);
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 39577a6b65b9..1ae7d3f4a1c8 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
> >  	return ret;
> >  }
> >  
> > -static int ext4_nonda_switch(struct super_block *sb)
> > +int ext4_nonda_switch(struct super_block *sb)
> >  {
> >  	s64 free_clusters, dirty_clusters;
> >  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> >  	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
> >  }
> >  
> > +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> > +{
> > +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> > +}
> > +
> > +const struct iomap_write_ops ext4_iomap_write_ops = {
> > +	.iomap_valid = ext4_iomap_valid,
> > +};
> > +
> >  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >  			   struct ext4_map_blocks *map, loff_t offset,
> >  			   loff_t length, unsigned int flags)
> > @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >  		iomap->flags |= IOMAP_F_MERGED;
> >  
> > +	iomap->validity_cookie = map->m_seq;
> > +
> >  	/*
> >  	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
> >  	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> > @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
> >  	.iomap_begin = ext4_iomap_begin_report,
> >  };
> >  
> > +/* Map blocks */
> > +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> > +
> >  static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> > -		loff_t length, struct ext4_map_blocks *map)
> > +		loff_t length, ext4_get_blocks_t get_blocks,
> > +		struct ext4_map_blocks *map)
> >  {
> >  	u8 blkbits = inode->i_blkbits;
> >  
> > @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> >  	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >  			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> >  
> > +	if (get_blocks)
> > +		return get_blocks(inode, map);
> > +
> >  	return ext4_map_blocks(NULL, inode, map, 0);
> >  }
> >  
> > @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >  	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> >  		return -ERANGE;
> >  
> > -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> > +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> >  	if (ret < 0)
> >  		return ret;
> >  
> > @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >  	return 0;
> >  }
> >  
> > +static int ext4_iomap_get_blocks(struct inode *inode,
> > +				 struct ext4_map_blocks *map)
> > +{
> > +	loff_t i_size = i_size_read(inode);
> > +	handle_t *handle;
> > +	int ret;
> > +
> > +	/*
> > +	 * Check if the blocks have already been allocated, this could
> > +	 * avoid initiating a new journal transaction and return the
> > +	 * mapping information directly.
> > +	 */
> > +	if ((map->m_lblk + map->m_len) <=
> > +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> > +		ret = ext4_map_blocks(NULL, inode, map, 0);
> > +		if (ret < 0)
> > +			return ret;
> > +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> > +				    EXT4_MAP_DELAYED))
> > +			return 0;
> > +	}
> > +
> > +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> > +			ext4_chunk_trans_blocks(inode, map->m_len));
> > +	if (IS_ERR(handle))
> > +		return PTR_ERR(handle);
> > +
> > +	ret = ext4_map_blocks(handle, inode, map,
> > +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> > +	/*
> > +	 * Stop handle here following the lock ordering of the folio lock
> > +	 * and the transaction start.
> > +	 */
> > +	ext4_journal_stop(handle);
> > +
> > +	return ret;
> > +}
> > +
> > +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> > +		loff_t offset, loff_t length, unsigned int flags,
> > +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> > +{
> > +	int ret, retries = 0;
> > +	struct ext4_map_blocks map;
> > +	ext4_get_blocks_t *get_blocks;
> > +
> > +	ret = ext4_emergency_state(inode->i_sb);
> > +	if (unlikely(ret))
> > +		return ret;
> > +
> > +	/* Inline data and non-extent are not supported. */
> > +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> > +		return -ERANGE;
> > +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> > +		return -EINVAL;
> > +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> > +		return -EINVAL;
> > +
> > +	if (delalloc)
> > +		get_blocks = ext4_da_map_blocks;
> > +	else
> > +		get_blocks = ext4_iomap_get_blocks;
> > +retry:
> > +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> > +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > +		goto retry;
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> > +	return 0;
> > +}
> > +
> > +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> > +		loff_t offset, loff_t length, unsigned int flags,
> > +		struct iomap *iomap, struct iomap *srcmap)
> > +{
> > +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> > +						  iomap, srcmap, false);
> > +}
> > +
> > +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> > +		loff_t offset, loff_t length, unsigned int flags,
> > +		struct iomap *iomap, struct iomap *srcmap)
> > +{
> > +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> > +						  iomap, srcmap, true);
> > +}
> > +
> > +/*
> > + * On write failure, drop the stale delayed allocation range and release
> > + * its reserved space for both start and end blocks. Otherwise, we may
> > + * leave a range of delayed extents covered by a clean folio, which can
> > + * result in inaccurate space reservation accounting.
> > + */
> > +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> > +				     loff_t length, struct iomap *iomap)
> > +{
> > +	down_write(&EXT4_I(inode)->i_data_sem);
> > +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> > +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> > +	up_write(&EXT4_I(inode)->i_data_sem);
> > +}
> > +
> > +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> > +					    loff_t length, ssize_t written,
> > +					    unsigned int flags,
> > +					    struct iomap *iomap)
> > +{
> > +	loff_t start_byte, end_byte;
> > +
> > +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> > +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> > +		return 0;
> > +
> > +	/* Nothing to do if we've written the entire delalloc extent */
> > +	start_byte = iomap_last_written_block(inode, offset, written);
> > +	end_byte = round_up(offset + length, i_blocksize(inode));
> > +	if (start_byte >= end_byte)
> > +		return 0;
> > +
> > +	filemap_invalidate_lock(inode->i_mapping);
> > +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> > +				     iomap, ext4_iomap_punch_delalloc);
> > +	filemap_invalidate_unlock(inode->i_mapping);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Since we always allocate unwritten extents, there is no need for
> > + * iomap_end to clean up allocated blocks on a short write.
> > + */
> > +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> > +	.iomap_begin = ext4_iomap_buffered_write_begin,
> > +};
> > +
> > +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> > +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
> > +	.iomap_end = ext4_iomap_buffered_da_write_end,
> > +};
> > +
> >  const struct iomap_ops ext4_iomap_buffered_read_ops = {
> >  	.iomap_begin = ext4_iomap_buffered_read_begin,
> >  };
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 6a77db4d3124..9bc294b769db 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
> >   *   -> page lock -> i_data_sem (rw)
> >   *
> >   * buffered write path:
> > - * sb_start_write -> i_mutex -> mmap_lock
> > - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> > - *   i_data_sem (rw)
> > + * sb_start_write -> i_rwsem (w) -> mmap_lock
> > + * - buffer_head path:
> > + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> > + *     i_data_sem (rw)
> > + * - iomap path:
> > + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> > + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
> >   *
> >   * truncate:
> >   * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> > -- 
> > 2.52.0
> > 
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-05-26 17:10   ` Ojaswin Mujoo
  2026-05-28 15:40     ` Darrick J. Wong
@ 2026-05-29  9:13     ` Zhang Yi
  2026-06-02 10:05       ` Ojaswin Mujoo
  1 sibling, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-29  9:13 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

Hi, Ojaswin!

On 5/27/2026 1:10 AM, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Introduce two new iomap_ops instances for ext4 buffered writes:
>>
>>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>>    ext4_da_map_blocks() to map delalloc extents.
>>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>>    ext4_iomap_get_blocks() to directly allocate blocks.
>>
>> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
>> validity.
>>
>> Key changes and considerations:
>>
>>  - Unwritten extents for new blocks (dioread_nolock always on)
>>    Since data=ordered mode is not used to prevent stale data exposure in
>>    the non-delayed allocation path, new blocks are always allocated as
>>    unwritten extents.
> 
> Okay makes sense.
> 
>>
>>  - Short write and write failure handling
>>    a. Delalloc path: On short write or failure, the stale delalloc range
>>       must be dropped and its space reservation released. Otherwise, a
>>       clean folio may cover leftover delalloc extents, causing
>>       inaccurate space reservation accounting.
> 
> Hmm, okay so in the usual buffer head path, seems like during a short
> write we still zero the new buffers we couldn't write and keep it dirty
> (folio_zero_new_buffers()). This way they are still written back and
> the delalloc reservations are used up.
> 

In fact, in the normal buffer head path, writeback does not consume
delalloc reservations. Instead, the reservations are retained until the
inode is released or the area is written again using delalloc. This is
because i_size is not updated during short writes. Therefore, when a
zeroed dirty folio is written back, no block mapping is created for it.
For details, please see the lblk >= blocks judgment in
mpage_process_page_bufs().

This will not lead to duplicate space statistics, because
ext4_da_map_blocks() only reserves space when inserting a new delalloc
extent. Therefore, this does not pose a serious issue. However, It may
cause some temporary and minor space leaks. Nevertheless, I think it
would be better if delalloc extents could be released for the buffer
head path when short writes occur.

> However in iomap we don't mark the range that we couldnt write as dirty
> so we need to make sure we clear up the stale delalloc mappings. Is this
> correct?
> 
Yeah.

Thanks,
Yi.

> Regards,
> Ojaswin
> 
>>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
>>       short write.
>>
>>  - Lock ordering reversal
>>    The folio lock and transaction start ordering is reversed compared to
>>    the buffer_head buffered write path. To handle this, the journal
>>    handle must be stopped in iomap_begin() callbacks. The lock ordering
>>    documentation in super.c has been updated accordingly.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/ext4.h  |   4 ++
>>  fs/ext4/file.c  |  20 +++++-
>>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  fs/ext4/super.c |  10 ++-
>>  4 files changed, 192 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 1e27d73d7427..4832e7f7db82 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>>  				struct buffer_head *bh);
>>  void ext4_set_inode_mapping_order(struct inode *inode);
>> +int ext4_nonda_switch(struct super_block *sb);
>>  #define FALL_BACK_TO_NONDELALLOC 1
>>  #define CONVERT_INLINE_DATA	 2
>>  
>> @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
>>  
>>  extern const struct iomap_ops ext4_iomap_ops;
>>  extern const struct iomap_ops ext4_iomap_report_ops;
>> +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
>> +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
>> +extern const struct iomap_write_ops ext4_iomap_write_ops;
>>  
>>  static inline int ext4_buffer_uptodate(struct buffer_head *bh)
>>  {
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index eb1a323962b1..7f9bfbbc4a4e 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>>  	return count;
>>  }
>>  
>> +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
>> +					 struct iov_iter *from)
>> +{
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +	const struct iomap_ops *iomap_ops;
>> +
>> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
>> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
>> +	else
>> +		iomap_ops = &ext4_iomap_buffered_write_ops;
>> +
>> +	return iomap_file_buffered_write(iocb, from, iomap_ops,
>> +					 &ext4_iomap_write_ops, NULL);
>> +}
>> +
>>  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>>  					struct iov_iter *from)
>>  {
>> @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>>  	if (ret <= 0)
>>  		goto out;
>>  
>> -	ret = generic_perform_write(iocb, from);
>> +	if (ext4_inode_buffered_iomap(inode))
>> +		ret = ext4_iomap_buffered_write(iocb, from);
>> +	else
>> +		ret = generic_perform_write(iocb, from);
>>  
>>  out:
>>  	inode_unlock(inode);
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 39577a6b65b9..1ae7d3f4a1c8 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
>>  	return ret;
>>  }
>>  
>> -static int ext4_nonda_switch(struct super_block *sb)
>> +int ext4_nonda_switch(struct super_block *sb)
>>  {
>>  	s64 free_clusters, dirty_clusters;
>>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>> @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
>>  	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
>>  }
>>  
>> +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
>> +{
>> +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
>> +}
>> +
>> +const struct iomap_write_ops ext4_iomap_write_ops = {
>> +	.iomap_valid = ext4_iomap_valid,
>> +};
>> +
>>  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>>  			   struct ext4_map_blocks *map, loff_t offset,
>>  			   loff_t length, unsigned int flags)
>> @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>>  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>>  		iomap->flags |= IOMAP_F_MERGED;
>>  
>> +	iomap->validity_cookie = map->m_seq;
>> +
>>  	/*
>>  	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
>>  	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
>> @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
>>  	.iomap_begin = ext4_iomap_begin_report,
>>  };
>>  
>> +/* Map blocks */
>> +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
>> +
>>  static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
>> -		loff_t length, struct ext4_map_blocks *map)
>> +		loff_t length, ext4_get_blocks_t get_blocks,
>> +		struct ext4_map_blocks *map)
>>  {
>>  	u8 blkbits = inode->i_blkbits;
>>  
>> @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
>>  	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>>  			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
>>  
>> +	if (get_blocks)
>> +		return get_blocks(inode, map);
>> +
>>  	return ext4_map_blocks(NULL, inode, map, 0);
>>  }
>>  
>> @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>>  	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
>>  		return -ERANGE;
>>  
>> -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
>> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>>  	if (ret < 0)
>>  		return ret;
>>  
>> @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>>  	return 0;
>>  }
>>  
>> +static int ext4_iomap_get_blocks(struct inode *inode,
>> +				 struct ext4_map_blocks *map)
>> +{
>> +	loff_t i_size = i_size_read(inode);
>> +	handle_t *handle;
>> +	int ret;
>> +
>> +	/*
>> +	 * Check if the blocks have already been allocated, this could
>> +	 * avoid initiating a new journal transaction and return the
>> +	 * mapping information directly.
>> +	 */
>> +	if ((map->m_lblk + map->m_len) <=
>> +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
>> +		ret = ext4_map_blocks(NULL, inode, map, 0);
>> +		if (ret < 0)
>> +			return ret;
>> +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
>> +				    EXT4_MAP_DELAYED))
>> +			return 0;
>> +	}
>> +
>> +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
>> +			ext4_chunk_trans_blocks(inode, map->m_len));
>> +	if (IS_ERR(handle))
>> +		return PTR_ERR(handle);
>> +
>> +	ret = ext4_map_blocks(handle, inode, map,
>> +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
>> +	/*
>> +	 * Stop handle here following the lock ordering of the folio lock
>> +	 * and the transaction start.
>> +	 */
>> +	ext4_journal_stop(handle);
>> +
>> +	return ret;
>> +}
>> +
>> +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
>> +{
>> +	int ret, retries = 0;
>> +	struct ext4_map_blocks map;
>> +	ext4_get_blocks_t *get_blocks;
>> +
>> +	ret = ext4_emergency_state(inode->i_sb);
>> +	if (unlikely(ret))
>> +		return ret;
>> +
>> +	/* Inline data and non-extent are not supported. */
>> +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
>> +		return -ERANGE;
>> +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>> +		return -EINVAL;
>> +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
>> +		return -EINVAL;
>> +
>> +	if (delalloc)
>> +		get_blocks = ext4_da_map_blocks;
>> +	else
>> +		get_blocks = ext4_iomap_get_blocks;
>> +retry:
>> +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
>> +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
>> +		goto retry;
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>> +	return 0;
>> +}
>> +
>> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap)
>> +{
>> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> +						  iomap, srcmap, false);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap)
>> +{
>> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> +						  iomap, srcmap, true);
>> +}
>> +
>> +/*
>> + * On write failure, drop the stale delayed allocation range and release
>> + * its reserved space for both start and end blocks. Otherwise, we may
>> + * leave a range of delayed extents covered by a clean folio, which can
>> + * result in inaccurate space reservation accounting.
>> + */
>> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
>> +				     loff_t length, struct iomap *iomap)
>> +{
>> +	down_write(&EXT4_I(inode)->i_data_sem);
>> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
>> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
>> +	up_write(&EXT4_I(inode)->i_data_sem);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>> +					    loff_t length, ssize_t written,
>> +					    unsigned int flags,
>> +					    struct iomap *iomap)
>> +{
>> +	loff_t start_byte, end_byte;
>> +
>> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
>> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
>> +		return 0;
>> +
>> +	/* Nothing to do if we've written the entire delalloc extent */
>> +	start_byte = iomap_last_written_block(inode, offset, written);
>> +	end_byte = round_up(offset + length, i_blocksize(inode));
>> +	if (start_byte >= end_byte)
>> +		return 0;
>> +
>> +	filemap_invalidate_lock(inode->i_mapping);
>> +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
>> +				     iomap, ext4_iomap_punch_delalloc);
>> +	filemap_invalidate_unlock(inode->i_mapping);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Since we always allocate unwritten extents, there is no need for
>> + * iomap_end to clean up allocated blocks on a short write.
>> + */
>> +const struct iomap_ops ext4_iomap_buffered_write_ops = {
>> +	.iomap_begin = ext4_iomap_buffered_write_begin,
>> +};
>> +
>> +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
>> +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
>> +	.iomap_end = ext4_iomap_buffered_da_write_end,
>> +};
>> +
>>  const struct iomap_ops ext4_iomap_buffered_read_ops = {
>>  	.iomap_begin = ext4_iomap_buffered_read_begin,
>>  };
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 6a77db4d3124..9bc294b769db 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
>>   *   -> page lock -> i_data_sem (rw)
>>   *
>>   * buffered write path:
>> - * sb_start_write -> i_mutex -> mmap_lock
>> - * sb_start_write -> i_mutex -> transaction start -> page lock ->
>> - *   i_data_sem (rw)
>> + * sb_start_write -> i_rwsem (w) -> mmap_lock
>> + * - buffer_head path:
>> + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
>> + *     i_data_sem (rw)
>> + * - iomap path:
>> + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
>> + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
>>   *
>>   * truncate:
>>   * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
>> -- 
>> 2.52.0
>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
  2026-05-27 12:49   ` Ojaswin Mujoo
@ 2026-05-29 14:12     ` Zhang Yi
  2026-05-29 15:32       ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-29 14:12 UTC (permalink / raw)
  To: Ojaswin Mujoo, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai

Hi, Ojaswin!

On 5/27/2026 8:49 PM, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:29PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Add the iomap writeback path for ext4 buffered I/O. This introduces:
>>
>>   - ext4_iomap_writepages(): the main writeback entry point.
>>   - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
>>     block mapping and I/O submission.
>>   - A new end I/O worker for converting unwritten extents, updating file
>>     size, and handling DATA_ERR_ABORT after I/O completion.
>>
>> Core implementation details:
>>
>>   - ->writeback_range() callback
>>     Calls ext4_iomap_map_writeback_range() to query the longest range of
>>     existing mapped extents. For performance, when a block range is not
>>     yet allocated, it allocates based on the writeback length and delalloc
>>     extent length, rather than allocating for a single folio at a time.
>>     The folio is then added to an iomap_ioend instance.
>>
>>   - ->writeback_submit() callback
>>     Registers ext4_iomap_end_bio() as the end bio callback. This callback
>>     schedules a worker to handle:
>>     - Unwritten extent conversion.
>>     - i_disksize update after data is written back.
>>     - Journal abort on writeback I/O failure.
> 
> Hi Zhang, the changes look good. I have a few comments below:
>>
>> Key changes and considerations:
>>
>> - Append write and unwritten extents
>>    Since data=ordered mode is not used to prevent stale data exposure
>>    during append writebacks, new blocks are always allocated as unwritten
>>    extents (i.e. always enable dioread_nolock), and i_disksize update is
>>    postponed until I/O completion.
> 
> Makes sense.
> 
>>    Additionally, the deadlock that the
>>    reserve handle was expected to resolve does not occur anymore.
> 
> I guess this is since we don't use ordered data so we can't block on
> starting a txn in end io.

Yep.

> 
>>    Therefore, the end I/O worker can start a normal journal handle
>>    instead of a reserve handle when converting unwritten extents.
>>
>> - Lock ordering
>>    The ->writeback_range() callback runs under the folio lock, requiring
>>    the journal handle to be started under that same lock. This reverses
>>    the order compared to the buffer_head writeback path. The lock ordering
>>    documentation in super.c has been updated accordingly.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>   fs/ext4/ext4.h        |   4 +
>>   fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
>>   fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
>>   fs/ext4/super.c       |   7 +-
>>   fs/iomap/ioend.c      |   3 +-
>>   include/linux/iomap.h |   1 +
>>   6 files changed, 346 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 4832e7f7db82..078feda47e36 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
>>   	 */
>>   	struct list_head i_rsv_conversion_list;
>>   	struct work_struct i_rsv_conversion_work;
>> +	struct list_head i_iomap_ioend_list;
>> +	struct work_struct i_iomap_ioend_work;
>>   
>>   	/*
>>   	 * Transactions that contain inode's metadata needed to complete
>> @@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
>>   		size_t len);
>>   extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
>>   extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
>> +extern void ext4_iomap_end_io(struct work_struct *work);
>> +extern void ext4_iomap_end_bio(struct bio *bio);
>>   
>>   /* mmp.c */
>>   extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 1ae7d3f4a1c8..a80195bd6f20 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -44,6 +44,7 @@
>>   #include <linux/iversion.h>
>>   
>>   #include "ext4_jbd2.h"
>> +#include "ext4_extents.h"
>>   #include "xattr.h"
>>   #include "acl.h"
>>   #include "truncate.h"
>> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
>>   	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
>>   }
>>   
>> +static int ext4_iomap_map_one_extent(struct inode *inode,
>> +				     struct ext4_map_blocks *map)
>> +{
>> +	struct extent_status es;
>> +	handle_t *handle = NULL;
>> +	int credits, map_flags;
>> +	int retval;
>> +
>> +	credits = ext4_chunk_trans_blocks(inode, map->m_len);
>> +	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
>> +	if (IS_ERR(handle))
>> +		return PTR_ERR(handle);
>> +
>> +	map->m_flags = 0;
>> +	/*
>> +	 * It is necessary to look up extent and map blocks under i_data_sem
>> +	 * in write mode, otherwise, the delalloc extent may become stale
>> +	 * during concurrent truncate operations.
>> +	 */
>> +	ext4_fc_track_inode(handle, inode);
>> +	down_write(&EXT4_I(inode)->i_data_sem);
>> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>> +		retval = es.es_len - (map->m_lblk - es.es_lblk);
>> +		map->m_len = min_t(unsigned int, retval, map->m_len);
>> +
>> +		if (ext4_es_is_delayed(&es)) {
> 
> I understand that it is okay for us to rely on extent status ==
> delayed here because we never reclaim delayed es entries and hence we
> are sure to not skip any delayed block allocations here.

Yeah, right.

> 
>> +			map->m_flags |= EXT4_MAP_DELAYED;
>> +			trace_ext4_da_write_pages_extent(inode, map);
>> +			/*
>> +			 * Call ext4_map_create_blocks() to allocate any
>> +			 * delayed allocation blocks. It is possible that
>> +			 * we're going to need more metadata blocks, however
>> +			 * we must not fail because we're in writeback and
>> +			 * there is nothing we can do so it might result in
>> +			 * data loss. So use reserved blocks to allocate
>> +			 * metadata if possible.
>> +			 */
>> +			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
>> +				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
>> +				    EXT4_EX_NOCACHE;
>> +
>> +			retval = ext4_map_create_blocks(handle, inode, map,
>> +							map_flags);
>> +			if (retval > 0)
>> +				ext4_fc_track_range(handle, inode, map->m_lblk,
>> +						map->m_lblk + map->m_len - 1);
>> +			goto out;
>> +		} else if (unlikely(ext4_es_is_hole(&es)))
> 
> Now that you've fixed the partial invalidate in iomap (patch 12/23)
> can we still hit this hole case?

Theoretically, no hole should be encountered; this is just defensive
programming.

> 
>> +			goto out;
>> +
>> +		/* Found written or unwritten extent. */
>> +		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
>> +		map->m_flags = ext4_es_is_written(&es) ?
>> +			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
>> +		goto out;
>> +	}
>> +
>> +	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
>> +out:
>> +	up_write(&EXT4_I(inode)->i_data_sem);
>> +	ext4_journal_stop(handle);
>> +	return retval < 0 ? retval : 0;
>> +}
>> +
>> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
>> +					  loff_t offset, unsigned int dirty_len)
>> +{
>> +	struct inode *inode = wpc->inode;
>> +	struct super_block *sb = inode->i_sb;
>> +	struct journal_s *journal = EXT4_SB(sb)->s_journal;
>> +	struct ext4_map_blocks map;
>> +	unsigned int blkbits = inode->i_blkbits;
>> +	unsigned int index = offset >> blkbits;
>> +	unsigned int blk_end, blk_len;
>> +	int ret;
>> +
>> +	ret = ext4_emergency_state(sb);
>> +	if (unlikely(ret))
>> +		return ret;
>> +
>> +	/* Check validity of the cached writeback mapping. */
>> +	if (offset >= wpc->iomap.offset &&
>> +	    offset < wpc->iomap.offset + wpc->iomap.length &&
>> +	    ext4_iomap_valid(inode, &wpc->iomap))
>> +		return 0;
>> +
>> +	blk_len = dirty_len >> blkbits;
>> +	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
>> +				      (UINT_MAX - 1));
> 
> This is an interesting idea. I'm just a bit worried when we have
> range_end == LLONG_MAX (bg flush) and we will always be trying to allocate
> MAX_WRITEPAGES, incase of a slightly fragmented FS, we might keep
> falling into slower mballoc criterias and might waste a lot of time
> scanning the groups.

Actually, MAX_WRITEPAGES is not allocated every time; the allocated
length also depends on the length of data that has already been delayed
for writing, and the smaller value is taken. If the user has indeed
performed delalloc writes on data of up to MAX_WRITEPAGES in length,
then regardless of how fragmented the file system is, I will need to
allocate blocks of that length. Reducing the number of calls is always
beneficial.

> 
>> +	if (blk_end > index + blk_len)
>> +		blk_len = blk_end - index + 1;
>> +
>> +retry:
>> +	map.m_lblk = index;
>> +	map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
>> +	ret = ext4_map_blocks(NULL, inode, &map,
>> +			      EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);
> 
> Do we really need the IO_SUBMIT flag here now that we are:
> 1. Not using ordered data
> 2. We anyways don't use it in ext4_iomap_map_one_extent().
> 
> I think we can drop it.

We can't drop it, because IO_SUBMIT is also used to avoid the check of
ext4_check_map_extents_env() in ext4_map_blocks() under the writeback
path.

Cheers,
Yi.

> 
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	/*
>> +	 * The map is not a delalloc extent, it must either be a hole
>> +	 * or an extent which have already been allocated.
>> +	 */
>> +	if (!(map.m_flags & EXT4_MAP_DELAYED))
>> +		goto out;
>> +
>> +	/* Map one delalloc extent. */
>> +	ret = ext4_iomap_map_one_extent(inode, &map);
>> +	if (ret < 0) {
>> +		if (ext4_emergency_state(sb))
>> +			return ret;
>> +
>> +		/*
>> +		 * Retry transient ENOSPC errors, if
>> +		 * ext4_count_free_blocks() is non-zero, a commit
>> +		 * should free up blocks.
>> +		 */
>> +		if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
>> +			jbd2_journal_force_commit_nested(journal);
>> +			goto retry;
>> +		}
>> +
>> +		ext4_msg(sb, KERN_CRIT,
>> +			 "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
>> +			 inode->i_ino, (unsigned long long)map.m_lblk,
>> +			 (unsigned int)map.m_len, -ret);
>> +		ext4_msg(sb, KERN_CRIT,
>> +			 "This should not happen!! Data will be lost\n");
>> +		if (ret == -ENOSPC)
>> +			ext4_print_free_blocks(inode);
>> +		return ret;
>> +	}
>> +out:
>> +	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
>> +	return 0;
>> +}
>> +
> 
> <snip>
>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
  2026-05-29 14:12     ` Zhang Yi
@ 2026-05-29 15:32       ` Ojaswin Mujoo
  2026-05-30  1:21         ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-05-29 15:32 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
	yi.zhang, yangerkun, yukuai

On Fri, May 29, 2026 at 10:12:12PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
> 
> On 5/27/2026 8:49 PM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:29PM +0800, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@huawei.com>
> > > 
> > > Add the iomap writeback path for ext4 buffered I/O. This introduces:
> > > 
> > >   - ext4_iomap_writepages(): the main writeback entry point.
> > >   - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
> > >     block mapping and I/O submission.
> > >   - A new end I/O worker for converting unwritten extents, updating file
> > >     size, and handling DATA_ERR_ABORT after I/O completion.
> > > 
> > > Core implementation details:
> > > 
> > >   - ->writeback_range() callback
> > >     Calls ext4_iomap_map_writeback_range() to query the longest range of
> > >     existing mapped extents. For performance, when a block range is not
> > >     yet allocated, it allocates based on the writeback length and delalloc
> > >     extent length, rather than allocating for a single folio at a time.
> > >     The folio is then added to an iomap_ioend instance.
> > > 
> > >   - ->writeback_submit() callback
> > >     Registers ext4_iomap_end_bio() as the end bio callback. This callback
> > >     schedules a worker to handle:
> > >     - Unwritten extent conversion.
> > >     - i_disksize update after data is written back.
> > >     - Journal abort on writeback I/O failure.
> > 
> > Hi Zhang, the changes look good. I have a few comments below:
> > > 
> > > Key changes and considerations:
> > > 
> > > - Append write and unwritten extents
> > >    Since data=ordered mode is not used to prevent stale data exposure
> > >    during append writebacks, new blocks are always allocated as unwritten
> > >    extents (i.e. always enable dioread_nolock), and i_disksize update is
> > >    postponed until I/O completion.
> > 
> > Makes sense.
> > 
> > >    Additionally, the deadlock that the
> > >    reserve handle was expected to resolve does not occur anymore.
> > 
> > I guess this is since we don't use ordered data so we can't block on
> > starting a txn in end io.
> 
> Yep.
> 
> > 
> > >    Therefore, the end I/O worker can start a normal journal handle
> > >    instead of a reserve handle when converting unwritten extents.
> > > 
> > > - Lock ordering
> > >    The ->writeback_range() callback runs under the folio lock, requiring
> > >    the journal handle to be started under that same lock. This reverses
> > >    the order compared to the buffer_head writeback path. The lock ordering
> > >    documentation in super.c has been updated accordingly.
> > > 
> > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > ---
> > >   fs/ext4/ext4.h        |   4 +
> > >   fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
> > >   fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
> > >   fs/ext4/super.c       |   7 +-
> > >   fs/iomap/ioend.c      |   3 +-
> > >   include/linux/iomap.h |   1 +
> > >   6 files changed, 346 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 4832e7f7db82..078feda47e36 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
> > >   	 */
> > >   	struct list_head i_rsv_conversion_list;
> > >   	struct work_struct i_rsv_conversion_work;
> > > +	struct list_head i_iomap_ioend_list;
> > > +	struct work_struct i_iomap_ioend_work;
> > >   	/*
> > >   	 * Transactions that contain inode's metadata needed to complete
> > > @@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
> > >   		size_t len);
> > >   extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
> > >   extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
> > > +extern void ext4_iomap_end_io(struct work_struct *work);
> > > +extern void ext4_iomap_end_bio(struct bio *bio);
> > >   /* mmp.c */
> > >   extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 1ae7d3f4a1c8..a80195bd6f20 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -44,6 +44,7 @@
> > >   #include <linux/iversion.h>
> > >   #include "ext4_jbd2.h"
> > > +#include "ext4_extents.h"
> > >   #include "xattr.h"
> > >   #include "acl.h"
> > >   #include "truncate.h"
> > > @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
> > >   	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
> > >   }
> > > +static int ext4_iomap_map_one_extent(struct inode *inode,
> > > +				     struct ext4_map_blocks *map)
> > > +{
> > > +	struct extent_status es;
> > > +	handle_t *handle = NULL;
> > > +	int credits, map_flags;
> > > +	int retval;
> > > +
> > > +	credits = ext4_chunk_trans_blocks(inode, map->m_len);
> > > +	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> > > +	if (IS_ERR(handle))
> > > +		return PTR_ERR(handle);
> > > +
> > > +	map->m_flags = 0;
> > > +	/*
> > > +	 * It is necessary to look up extent and map blocks under i_data_sem
> > > +	 * in write mode, otherwise, the delalloc extent may become stale
> > > +	 * during concurrent truncate operations.
> > > +	 */
> > > +	ext4_fc_track_inode(handle, inode);
> > > +	down_write(&EXT4_I(inode)->i_data_sem);
> > > +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> > > +		retval = es.es_len - (map->m_lblk - es.es_lblk);
> > > +		map->m_len = min_t(unsigned int, retval, map->m_len);
> > > +
> > > +		if (ext4_es_is_delayed(&es)) {
> > 
> > I understand that it is okay for us to rely on extent status ==
> > delayed here because we never reclaim delayed es entries and hence we
> > are sure to not skip any delayed block allocations here.
> 
> Yeah, right.
> 
> > 
> > > +			map->m_flags |= EXT4_MAP_DELAYED;
> > > +			trace_ext4_da_write_pages_extent(inode, map);
> > > +			/*
> > > +			 * Call ext4_map_create_blocks() to allocate any
> > > +			 * delayed allocation blocks. It is possible that
> > > +			 * we're going to need more metadata blocks, however
> > > +			 * we must not fail because we're in writeback and
> > > +			 * there is nothing we can do so it might result in
> > > +			 * data loss. So use reserved blocks to allocate
> > > +			 * metadata if possible.
> > > +			 */
> > > +			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> > > +				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
> > > +				    EXT4_EX_NOCACHE;
> > > +
> > > +			retval = ext4_map_create_blocks(handle, inode, map,
> > > +							map_flags);
> > > +			if (retval > 0)
> > > +				ext4_fc_track_range(handle, inode, map->m_lblk,
> > > +						map->m_lblk + map->m_len - 1);
> > > +			goto out;
> > > +		} else if (unlikely(ext4_es_is_hole(&es)))
> > 
> > Now that you've fixed the partial invalidate in iomap (patch 12/23)
> > can we still hit this hole case?
> 
> Theoretically, no hole should be encountered; this is just defensive
> programming.
> 
> > 
> > > +			goto out;
> > > +
> > > +		/* Found written or unwritten extent. */
> > > +		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> > > +		map->m_flags = ext4_es_is_written(&es) ?
> > > +			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> > > +		goto out;
> > > +	}
> > > +
> > > +	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> > > +out:
> > > +	up_write(&EXT4_I(inode)->i_data_sem);
> > > +	ext4_journal_stop(handle);
> > > +	return retval < 0 ? retval : 0;
> > > +}
> > > +
> > > +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> > > +					  loff_t offset, unsigned int dirty_len)
> > > +{
> > > +	struct inode *inode = wpc->inode;
> > > +	struct super_block *sb = inode->i_sb;
> > > +	struct journal_s *journal = EXT4_SB(sb)->s_journal;
> > > +	struct ext4_map_blocks map;
> > > +	unsigned int blkbits = inode->i_blkbits;
> > > +	unsigned int index = offset >> blkbits;
> > > +	unsigned int blk_end, blk_len;
> > > +	int ret;
> > > +
> > > +	ret = ext4_emergency_state(sb);
> > > +	if (unlikely(ret))
> > > +		return ret;
> > > +
> > > +	/* Check validity of the cached writeback mapping. */
> > > +	if (offset >= wpc->iomap.offset &&
> > > +	    offset < wpc->iomap.offset + wpc->iomap.length &&
> > > +	    ext4_iomap_valid(inode, &wpc->iomap))
> > > +		return 0;
> > > +
> > > +	blk_len = dirty_len >> blkbits;
> > > +	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> > > +				      (UINT_MAX - 1));
> > 
> > This is an interesting idea. I'm just a bit worried when we have
> > range_end == LLONG_MAX (bg flush) and we will always be trying to allocate
> > MAX_WRITEPAGES, incase of a slightly fragmented FS, we might keep
> > falling into slower mballoc criterias and might waste a lot of time
> > scanning the groups.
> 
> Actually, MAX_WRITEPAGES is not allocated every time; the allocated
> length also depends on the length of data that has already been delayed
> for writing, and the smaller value is taken. If the user has indeed

Hmm so we take the blk_end based on range_end (which is LLONG_MAX for bg
flusher) and then our blk_len will be set accordingly and would become a
large number as well. Then we will set map.m_len based on this blk_len
and MAX_WRITEPAGES. Am I missing something that clamps our m_len?

> performed delalloc writes on data of up to MAX_WRITEPAGES in length,
> then regardless of how fragmented the file system is, I will need to
> allocate blocks of that length. Reducing the number of calls is always
> beneficial.
> 
> > 
> > > +	if (blk_end > index + blk_len)
> > > +		blk_len = blk_end - index + 1;
> > > +
> > > +retry:
> > > +	map.m_lblk = index;
> > > +	map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
> > > +	ret = ext4_map_blocks(NULL, inode, &map,
> > > +			      EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);
> > 
> > Do we really need the IO_SUBMIT flag here now that we are:
> > 1. Not using ordered data
> > 2. We anyways don't use it in ext4_iomap_map_one_extent().
> > 
> > I think we can drop it.
> 
> We can't drop it, because IO_SUBMIT is also used to avoid the check of
> ext4_check_map_extents_env() in ext4_map_blocks() under the writeback
> path.

Yes, noted :) 

Thanks,
Ojaswin

> 
> Cheers,
> Yi.
> 
> > 
> > > +	if (ret < 0)
> > > +		return ret;
> > > +
> > > +	/*
> > > +	 * The map is not a delalloc extent, it must either be a hole
> > > +	 * or an extent which have already been allocated.
> > > +	 */
> > > +	if (!(map.m_flags & EXT4_MAP_DELAYED))
> > > +		goto out;
> > > +
> > > +	/* Map one delalloc extent. */
> > > +	ret = ext4_iomap_map_one_extent(inode, &map);
> > > +	if (ret < 0) {
> > > +		if (ext4_emergency_state(sb))
> > > +			return ret;
> > > +
> > > +		/*
> > > +		 * Retry transient ENOSPC errors, if
> > > +		 * ext4_count_free_blocks() is non-zero, a commit
> > > +		 * should free up blocks.
> > > +		 */
> > > +		if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
> > > +			jbd2_journal_force_commit_nested(journal);
> > > +			goto retry;
> > > +		}
> > > +
> > > +		ext4_msg(sb, KERN_CRIT,
> > > +			 "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
> > > +			 inode->i_ino, (unsigned long long)map.m_lblk,
> > > +			 (unsigned int)map.m_len, -ret);
> > > +		ext4_msg(sb, KERN_CRIT,
> > > +			 "This should not happen!! Data will be lost\n");
> > > +		if (ret == -ENOSPC)
> > > +			ext4_print_free_blocks(inode);
> > > +		return ret;
> > > +	}
> > > +out:
> > > +	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
> > > +	return 0;
> > > +}
> > > +
> > 
> > <snip>
> > > 
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
  2026-05-29 15:32       ` Ojaswin Mujoo
@ 2026-05-30  1:21         ` Zhang Yi
  2026-06-01  6:26           ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-30  1:21 UTC (permalink / raw)
  To: Ojaswin Mujoo, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai

On 5/29/2026 11:32 PM, Ojaswin Mujoo wrote:
> On Fri, May 29, 2026 at 10:12:12PM +0800, Zhang Yi wrote:
>> Hi, Ojaswin!
>>
>> On 5/27/2026 8:49 PM, Ojaswin Mujoo wrote:
>>> On Mon, May 11, 2026 at 03:23:29PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Add the iomap writeback path for ext4 buffered I/O. This introduces:
>>>>
>>>>   - ext4_iomap_writepages(): the main writeback entry point.
>>>>   - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
>>>>     block mapping and I/O submission.
>>>>   - A new end I/O worker for converting unwritten extents, updating file
>>>>     size, and handling DATA_ERR_ABORT after I/O completion.
>>>>
>>>> Core implementation details:
>>>>
>>>>   - ->writeback_range() callback
>>>>     Calls ext4_iomap_map_writeback_range() to query the longest range of
>>>>     existing mapped extents. For performance, when a block range is not
>>>>     yet allocated, it allocates based on the writeback length and delalloc
>>>>     extent length, rather than allocating for a single folio at a time.
>>>>     The folio is then added to an iomap_ioend instance.
>>>>
>>>>   - ->writeback_submit() callback
>>>>     Registers ext4_iomap_end_bio() as the end bio callback. This callback
>>>>     schedules a worker to handle:
>>>>     - Unwritten extent conversion.
>>>>     - i_disksize update after data is written back.
>>>>     - Journal abort on writeback I/O failure.
>>>
>>> Hi Zhang, the changes look good. I have a few comments below:
>>>>
>>>> Key changes and considerations:
>>>>
>>>> - Append write and unwritten extents
>>>>    Since data=ordered mode is not used to prevent stale data exposure
>>>>    during append writebacks, new blocks are always allocated as unwritten
>>>>    extents (i.e. always enable dioread_nolock), and i_disksize update is
>>>>    postponed until I/O completion.
>>>
>>> Makes sense.
>>>
>>>>    Additionally, the deadlock that the
>>>>    reserve handle was expected to resolve does not occur anymore.
>>>
>>> I guess this is since we don't use ordered data so we can't block on
>>> starting a txn in end io.
>>
>> Yep.
>>
>>>
>>>>    Therefore, the end I/O worker can start a normal journal handle
>>>>    instead of a reserve handle when converting unwritten extents.
>>>>
>>>> - Lock ordering
>>>>    The ->writeback_range() callback runs under the folio lock, requiring
>>>>    the journal handle to be started under that same lock. This reverses
>>>>    the order compared to the buffer_head writeback path. The lock ordering
>>>>    documentation in super.c has been updated accordingly.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>   fs/ext4/ext4.h        |   4 +
>>>>   fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
>>>>   fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
>>>>   fs/ext4/super.c       |   7 +-
>>>>   fs/iomap/ioend.c      |   3 +-
>>>>   include/linux/iomap.h |   1 +
>>>>   6 files changed, 346 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 4832e7f7db82..078feda47e36 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
>>>>   	 */
>>>>   	struct list_head i_rsv_conversion_list;
>>>>   	struct work_struct i_rsv_conversion_work;
>>>> +	struct list_head i_iomap_ioend_list;
>>>> +	struct work_struct i_iomap_ioend_work;
>>>>   	/*
>>>>   	 * Transactions that contain inode's metadata needed to complete
>>>> @@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
>>>>   		size_t len);
>>>>   extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
>>>>   extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
>>>> +extern void ext4_iomap_end_io(struct work_struct *work);
>>>> +extern void ext4_iomap_end_bio(struct bio *bio);
>>>>   /* mmp.c */
>>>>   extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 1ae7d3f4a1c8..a80195bd6f20 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -44,6 +44,7 @@
>>>>   #include <linux/iversion.h>
>>>>   #include "ext4_jbd2.h"
>>>> +#include "ext4_extents.h"
>>>>   #include "xattr.h"
>>>>   #include "acl.h"
>>>>   #include "truncate.h"
>>>> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
>>>>   	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
>>>>   }
>>>> +static int ext4_iomap_map_one_extent(struct inode *inode,
>>>> +				     struct ext4_map_blocks *map)
>>>> +{
>>>> +	struct extent_status es;
>>>> +	handle_t *handle = NULL;
>>>> +	int credits, map_flags;
>>>> +	int retval;
>>>> +
>>>> +	credits = ext4_chunk_trans_blocks(inode, map->m_len);
>>>> +	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
>>>> +	if (IS_ERR(handle))
>>>> +		return PTR_ERR(handle);
>>>> +
>>>> +	map->m_flags = 0;
>>>> +	/*
>>>> +	 * It is necessary to look up extent and map blocks under i_data_sem
>>>> +	 * in write mode, otherwise, the delalloc extent may become stale
>>>> +	 * during concurrent truncate operations.
>>>> +	 */
>>>> +	ext4_fc_track_inode(handle, inode);
>>>> +	down_write(&EXT4_I(inode)->i_data_sem);
>>>> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>>>> +		retval = es.es_len - (map->m_lblk - es.es_lblk);
>>>> +		map->m_len = min_t(unsigned int, retval, map->m_len);
>>>> +
>>>> +		if (ext4_es_is_delayed(&es)) {
>>>
>>> I understand that it is okay for us to rely on extent status ==
>>> delayed here because we never reclaim delayed es entries and hence we
>>> are sure to not skip any delayed block allocations here.
>>
>> Yeah, right.
>>
>>>
>>>> +			map->m_flags |= EXT4_MAP_DELAYED;
>>>> +			trace_ext4_da_write_pages_extent(inode, map);
>>>> +			/*
>>>> +			 * Call ext4_map_create_blocks() to allocate any
>>>> +			 * delayed allocation blocks. It is possible that
>>>> +			 * we're going to need more metadata blocks, however
>>>> +			 * we must not fail because we're in writeback and
>>>> +			 * there is nothing we can do so it might result in
>>>> +			 * data loss. So use reserved blocks to allocate
>>>> +			 * metadata if possible.
>>>> +			 */
>>>> +			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
>>>> +				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
>>>> +				    EXT4_EX_NOCACHE;
>>>> +
>>>> +			retval = ext4_map_create_blocks(handle, inode, map,
>>>> +							map_flags);
>>>> +			if (retval > 0)
>>>> +				ext4_fc_track_range(handle, inode, map->m_lblk,
>>>> +						map->m_lblk + map->m_len - 1);
>>>> +			goto out;
>>>> +		} else if (unlikely(ext4_es_is_hole(&es)))
>>>
>>> Now that you've fixed the partial invalidate in iomap (patch 12/23)
>>> can we still hit this hole case?
>>
>> Theoretically, no hole should be encountered; this is just defensive
>> programming.
>>
>>>
>>>> +			goto out;
>>>> +
>>>> +		/* Found written or unwritten extent. */
>>>> +		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
>>>> +		map->m_flags = ext4_es_is_written(&es) ?
>>>> +			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
>>>> +out:
>>>> +	up_write(&EXT4_I(inode)->i_data_sem);
>>>> +	ext4_journal_stop(handle);
>>>> +	return retval < 0 ? retval : 0;
>>>> +}
>>>> +
>>>> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
>>>> +					  loff_t offset, unsigned int dirty_len)
>>>> +{
>>>> +	struct inode *inode = wpc->inode;
>>>> +	struct super_block *sb = inode->i_sb;
>>>> +	struct journal_s *journal = EXT4_SB(sb)->s_journal;
>>>> +	struct ext4_map_blocks map;
>>>> +	unsigned int blkbits = inode->i_blkbits;
>>>> +	unsigned int index = offset >> blkbits;
>>>> +	unsigned int blk_end, blk_len;
>>>> +	int ret;
>>>> +
>>>> +	ret = ext4_emergency_state(sb);
>>>> +	if (unlikely(ret))
>>>> +		return ret;
>>>> +
>>>> +	/* Check validity of the cached writeback mapping. */
>>>> +	if (offset >= wpc->iomap.offset &&
>>>> +	    offset < wpc->iomap.offset + wpc->iomap.length &&
>>>> +	    ext4_iomap_valid(inode, &wpc->iomap))
>>>> +		return 0;
>>>> +
>>>> +	blk_len = dirty_len >> blkbits;
>>>> +	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
>>>> +				      (UINT_MAX - 1));
>>>
>>> This is an interesting idea. I'm just a bit worried when we have
>>> range_end == LLONG_MAX (bg flush) and we will always be trying to allocate
>>> MAX_WRITEPAGES, incase of a slightly fragmented FS, we might keep
>>> falling into slower mballoc criterias and might waste a lot of time
>>> scanning the groups.
>>
>> Actually, MAX_WRITEPAGES is not allocated every time; the allocated
>> length also depends on the length of data that has already been delayed
>> for writing, and the smaller value is taken. If the user has indeed
> 
> Hmm so we take the blk_end based on range_end (which is LLONG_MAX for bg
> flusher) and then our blk_len will be set accordingly and would become a
> large number as well. Then we will set map.m_len based on this blk_len
> and MAX_WRITEPAGES. Am I missing something that clamps our m_len?
> 
Please take a look at ext4_iomap_map_one_extent():

+		retval = es.es_len - (map->m_lblk - es.es_lblk);
+		map->m_len = min_t(unsigned int, retval, map->m_len);

In this case, m_len is truncated to the length of the delalloc extent.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
  2026-05-27 13:41   ` Ojaswin Mujoo
@ 2026-05-30  2:53     ` Zhang Yi
  2026-06-01  9:08       ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-30  2:53 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> In the generic buffered_head I/O path, we rely on the data=order mode to
>> ensure that the zeroed EOF block data is written before updating
>> i_disksize, thus preventing stale data from being exposed.
>>
>> However, the iomap buffered I/O path cannot use this mechanism. Instead,
>> we issue the I/O immediately after performing the zero operation
>> (without synchronous waiting for performance). This can reduce the risk
>> of exposing stale data, but it does not guarantee that the zero data
>> will be flushed to disk before the metadata of i_disksize is updated.
>> The subsequent patches will wait for this I/O to complete before
>> updating i_disksize.
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> I think we discussed that we may not need to do this [1] but I guess
> you've decided to make the tradeoff of issuing the IO to avoid having to
> wait for bg flush to complete the tail page zeroing 
> 

Yes. For truncate up and append fallocate, originally i_disksize would
be updated immediately, and the change would be persisted via the
journal within default 5 seconds. But now, if the tail page is not
committed immediately, the update to i_disksize will be delayed by about
30 seconds, and persistence will be postponed to around 35 seconds. I'm
not sure what impact this change might have — I just don't really want
to introduce it.

For normal append writes, the impact is minimal, unless we call
sync_range to sync the portion of data that extends beyond EOF.

In addition, if the zeroed page is not issued here immediately, the
logic will become more complex because we need to more careful about the
order of write-back IOs to prevent deadlock issues caused by mutual
waiting.

> However, I think one side effect might be many threads calling the
> writeback mechanism to issue zero IOs which might not scale well. I
> don't know if it'll be a huge problem though, I guess it's a sort of
> thing we will have to deal with in case we see it in real world
> workloads.
> 

I agree with you. However, I suspect that unless we run some specific
benchmark tests, it should be difficult to encounter a large number of
post-EOF append writes and truncate up operations in real-world usage
scenarios — and I haven't come across such scenarios yet. For
simplicity, I'd like to proceed with this implementation for now. If we
do run into actual problems later, we can consider not issuing I/O
directly here, but instead: 1) find the ordered block in
ext4_sync_file() and perform writeback; 2) ensure writeback ordering
for normal background writeback as well — otherwise, there is a risk of
deadlock (mutual waiting). What do you think?

Cheers,
Yi.

> [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
> 
>> ---
>>  fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 55 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 239d387ffaf2..e013aeb03d7b 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>>  					zero_written);
>>  }
>>  
>> +static int ext4_iomap_submit_zero_block(struct inode *inode,
>> +					loff_t from, loff_t end)
>> +{
>> +	struct address_space *mapping = inode->i_mapping;
>> +	struct folio *folio;
>> +	bool do_submit = false;
>> +
>> +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>> +	if (IS_ERR(folio))
>> +		/* Already writeback and clear? */
>> +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
>> +
>> +	folio_wait_writeback(folio);
>> +	WARN_ON_ONCE(folio_test_writeback(folio));
>> +
>> +	if (likely(folio_test_dirty(folio)))
>> +		do_submit = true;
>> +	folio_unlock(folio);
>> +	folio_put(folio);
>> +
>> +	/* Submit zeroed block. */
>> +	if (do_submit)
>> +		return filemap_fdatawrite_range(mapping, from, end - 1);
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Zero out a mapping from file offset 'from' up to the end of the block
>>   * which corresponds to 'from' or to the given 'end' inside this block.
>> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>  	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>>  		return 0;
>>  
>> -	if (length > blocksize - offset)
>> +	if (length > blocksize - offset) {
>>  		length = blocksize - offset;
>> +		end = from + length;
>> +	}
>>  
>>  	err = ext4_block_zero_range(inode, from, length,
>>  				    &did_zero, &zero_written);
>> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>  	 * TODO: In the iomap path, handle this by updating i_disksize to
>>  	 * i_size after the zeroed data has been written back.
>>  	 */
>> -	if (ext4_should_order_data(inode) &&
>> -	    did_zero && zero_written && !IS_DAX(inode)) {
>> -		handle_t *handle;
>> +	if (did_zero && zero_written && !IS_DAX(inode)) {
>> +		if (ext4_should_order_data(inode)) {
>> +			handle_t *handle;
>>  
>> -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>> -		if (IS_ERR(handle))
>> -			return PTR_ERR(handle);
>> +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>> +			if (IS_ERR(handle))
>> +				return PTR_ERR(handle);
>>  
>> -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
>> -		ext4_journal_stop(handle);
>> -		if (err)
>> -			return err;
>> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
>> +							length);
>> +			ext4_journal_stop(handle);
>> +			if (err)
>> +				return err;
>> +		/*
>> +		 * inodes using the iomap buffered I/O path do not use the
>> +		 * data=ordered mode. We submit zeroed range directly here.
>> +		 * Do not wait for I/O completion for performance.
>> +		 *
>> +		 * TODO: Any operation that extends i_disksize (including
>> +		 * append write end io past the zeroed boundary, truncate up,
>> +		 * and append fallocate) must wait for the relevant I/O to
>> +		 * complete before updating i_disksize.
>> +		 */
>> +		} else if (ext4_inode_buffered_iomap(inode)) {
>> +			err = ext4_iomap_submit_zero_block(inode, from, end);
>> +			if (err)
>> +				return err;
>> +		}
>>  	}
>>  
>>  	return 0;
>> -- 
>> 2.52.0
>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-05-27 15:58   ` Ojaswin Mujoo
  2026-05-28 13:34     ` Ojaswin Mujoo
@ 2026-05-30  7:22     ` Zhang Yi
  2026-05-30  8:24       ` Zhang Yi
  1 sibling, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-30  7:22 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

Hi, Ojaswin!

On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> For append writes, wait for ordered I/O to complete before updating
>> i_disksize. This ensures that zeroed data is flushed to disk before the
>> metadata update, preventing stale data from being exposed during
>> unaligned post-EOF append writes.
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/ext4.h    | 11 +++++++
>>  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>>  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>>  fs/ext4/super.c   | 23 ++++++++++----
>>  4 files changed, 161 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 078feda47e36..9ce2128eea3e 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
>>  #ifdef CONFIG_FS_ENCRYPTION
>>  	struct fscrypt_inode_info *i_crypt_info;
>>  #endif
>> +
>> +	/*
>> +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
>> +	 * and truncate-up operations. These parameters are used only in the
>> +	 * iomap buffered I/O path.
>> +	 */
>> +	ext4_lblk_t i_ordered_lblk;
>> +	ext4_lblk_t i_ordered_len;
>> +	wait_queue_head_t i_ordered_wq;
>>  };
>>  
>>  /*
>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>>  			     __u64 len, __u64 *moved_len);
>>  
>>  /* page-io.c */
>> +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
>> +
>>  extern int __init ext4_init_pageio(void);
>>  extern void ext4_exit_pageio(void);
>>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index e013aeb03d7b..11fb369efeb1 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>  {
>>  	struct iomap_ioend *ioend = wpc->wb_ctx;
>>  	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
>> +	ext4_lblk_t start, end, order_lblk, order_len;
>>  
>>  	/*
>>  	 * After I/O completion, a worker needs to be scheduled when:
>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>  	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
>>  		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>  
>> +	/*
>> +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
>> +	 * handling and must not be merged with regular I/O operations.
>> +	 */
>> +	order_len = READ_ONCE(ei->i_ordered_len);
>> +	if (order_len) {
>> +		/*
>> +		 * Pair with smp_store_release() in ext4_block_zero_eof().
>> +		 * Ensure we see the updated i_ordered_lblk that was written
>> +		 * before the release store to i_ordered_len.
>> +		 */
>> +		smp_rmb();
>> +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
>> +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
>> +		end = EXT4_B_TO_LBLK(ioend->io_inode,
>> +				     ioend->io_offset + ioend->io_size);
>> +
>> +		if (start <= order_lblk && end >= order_lblk + order_len) {
> 
> Hi Zhang,
> 
> I guess this check is enough cause ordered_lblk and ordered_len will
> always be  contained in a single block.

Yeah.

> 
>> +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>> +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
>> +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> 
> FWIU, we are wanting the ordered IO to not be merged and submitted asap
> since we want to wake up the waiters. Is there any other reason?

My original intention was to prevent the loss of the
EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
completion, which could be caused by merging an ordered ioend with a
normal ioend.  In patch 19, we need to determine the flag to update
i_disksize to the correct position.

> 
> Adding the boundary in ->writeback_submit() only affects
> iomap_ioend_can_merge() which happens after we have woken up the waiters
> and deferred the IO to the wq. We ideally want it affect
> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> ->writeback_range().

IIUC, merging into the same ioend during the submission stage doesn't
seem to cause any problems.

> 
> Secondly, I don't think boundary is the right flag here. It ensures
> that everything before the ordered iomap gets submitted and the ordered
> iomap starts a new ioend. This can still keep getting merged with the
> newer ioends untils we decide to submit the IO, which can delay waking
> up the waiters. If we really want the "no merge" behavior, we'll have to
> do something like [1] (Check the 2 NOMERGE flag patches).

Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
still cannot prevent merging. I missed this, thank you for pointing this
out. However, I think perhaps we should change iomap_ioend_can_merge()
to check the iomap_ioend->io_private. Something like:

	if (ioend->io_private || next->io_private)
		return false;

What do you think?

Thanks,
Yi.

> 
>> +		}
>> +	}
>> +
>>  	return iomap_ioend_writeback_submit(wpc, error);
>>  }
>>  
>> @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>>  					loff_t from, loff_t end)
>>  {
>>  	struct address_space *mapping = inode->i_mapping;
>> +	struct ext4_inode_info *ei = EXT4_I(inode);
>>  	struct folio *folio;
>>  	bool do_submit = false;
>> +	int ret;
>>  
>>  	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>>  	if (IS_ERR(folio))
>> @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>>  	folio_wait_writeback(folio);
>>  	WARN_ON_ONCE(folio_test_writeback(folio));
>>  
>> -	if (likely(folio_test_dirty(folio)))
>> +	/*
>> +	 * Mark the ordered range. It will be cleared upon I/O completion
>> +	 * in ext4_iomap_end_bio(). Any operation that extends i_disksize
>> +	 * (including append write end io past the zeroed boundary,
>> +	 * truncate up and append fallocate) must wait for this I/O to
>> +	 * complete before updating i_disksize.
>> +	 *
>> +	 * When multiple overlapping unaligned EOF writes are in flight, we
>> +	 * only need to track and wait for the first one. Subsequent writes
>> +	 * will zero the gap in memory and ensure that the zeroed data is
>> +	 * written out along with the valid data in the same block before
>> +	 * i_disksize is updated.
>> +	 */
>> +	if (likely(folio_test_dirty(folio) &&
>> +		   READ_ONCE(ei->i_ordered_len) == 0)) {
>> +		WRITE_ONCE(ei->i_ordered_lblk,
>> +			   from >> inode->i_blkbits);
>> +		/*
>> +		 * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
>> +		 * and ext4_iomap_wb_ordered_wait(). Ensure the updated
>> +		 * i_ordered_lblk is visible when i_ordered_len becomes
>> +		 * non-zero.
>> +		 */
>> +		smp_store_release(&ei->i_ordered_len, 1);
>>  		do_submit = true;
>> +	}
>>  	folio_unlock(folio);
>>  	folio_put(folio);
>>  
>>  	/* Submit zeroed block. */
>> -	if (do_submit)
>> -		return filemap_fdatawrite_range(mapping, from, end - 1);
>> +	if (do_submit) {
>> +		ret = filemap_fdatawrite_range(mapping, from, end - 1);
>> +		if (ret) {
>> +			/*
>> +			 * Pairs with wait_event() in
>> +			 * ext4_iomap_wb_ordered_wait(). Ensure
>> +			 * i_ordered_len = 0 is visible before waking up
>> +			 * waiters.
>> +			 */
>> +			smp_store_release(&ei->i_ordered_len, 0);
>> +			wake_up_all(&ei->i_ordered_wq);
>> +			return ret;
>> +		}
>> +	}
>>  	return 0;
>>  }
>>  
>> @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>  		 * data=ordered mode. We submit zeroed range directly here.
>>  		 * Do not wait for I/O completion for performance.
>>  		 *
>> -		 * TODO: Any operation that extends i_disksize (including
>> -		 * append write end io past the zeroed boundary, truncate up,
>> -		 * and append fallocate) must wait for the relevant I/O to
>> -		 * complete before updating i_disksize.
>> +		 * The end_io handler ext4_iomap_wb_ordered_wait() will wait
>> +		 * for I/O completion before updating i_disksize if the write
>> +		 * extends beyond the zeroed boundary.
>> +		 *
>> +		 * TODO: Any other operation that extends i_disksize
>> +		 * (including truncate up and append fallocate) must wait for
>> +		 * the relevant I/O to complete before updating i_disksize.
>>  		 */
>>  		} else if (ext4_inode_buffered_iomap(inode)) {
>>  			err = ext4_iomap_submit_zero_block(inode, from, end);
>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
>> index 3050c887329f..ad05ebb49bf6 100644
>> --- a/fs/ext4/page-io.c
>> +++ b/fs/ext4/page-io.c
>> @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
>>  	return 0;
>>  }
>>  
>> +/*
>> + * If the old disk size is not block size aligned and the current
>> + * writeback range is entirely beyond the old EOF block, we should
>> + * wait for the zeroed data written in ext4_block_zero_eof() to be
>> + * written out, otherwise, it may expose stale data in that block.
>> + */
>> +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
>> +				       loff_t pos, loff_t end)
>> +{
>> +	struct ext4_inode_info *ei = EXT4_I(inode);
>> +	unsigned int blocksize = i_blocksize(inode);
>> +	loff_t disksize = READ_ONCE(ei->i_disksize);
>> +	ext4_lblk_t order_lblk, order_len;
>> +
>> +	/*
>> +	 * Waiting for ordered I/O is unnecessary when:
>> +	 * - The on-disk size is block-aligned (no stale data exists).
>> +	 * - The write start is within the block of the old EOF
>> +	 *   (overwriting, or appending to a block that already contains
>> +	 *   valid data).
>> +	 */
>> +	if (!(disksize & (blocksize - 1)) ||
>> +	    pos < round_up(disksize, blocksize))
>> +		return;
>> +
>> +	order_len = READ_ONCE(ei->i_ordered_len);
>> +	if (!order_len)
>> +		return;
>> +
>> +	/*
>> +	 * Pair with smp_store_release() in ext4_iomap_end_bio() and
>> +	 * ext4_block_zero_eof(). Ensure we see the updated i_ordered_lblk
>> +	 * that was written before the release store to i_ordered_len.
>> +	 */
>> +	smp_rmb();
>> +	order_lblk = READ_ONCE(ei->i_ordered_lblk);
>> +	if ((pos >> inode->i_blkbits) >= order_lblk + order_len)
>> +		wait_event(ei->i_ordered_wq, READ_ONCE(ei->i_ordered_len) == 0);
>> +}
>> +
>>  static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
>>  					 loff_t end)
>>  {
>> @@ -656,6 +696,9 @@ static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
>>  		goto out;
>>  	}
>>  
>> +	/* Wait ordered zero data to be written out. */
>> +	ext4_iomap_wb_ordered_wait(inode, pos, pos + size);
>> +
>>  	/* We may need to convert one extent and dirty the inode. */
>>  	credits = ext4_chunk_trans_blocks(inode,
>>  			EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
>> @@ -717,8 +760,25 @@ void ext4_iomap_end_bio(struct bio *bio)
>>  	struct inode *inode = ioend->io_inode;
>>  	struct ext4_inode_info *ei = EXT4_I(inode);
>>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>> +	unsigned long io_mode = (unsigned long)ioend->io_private;
>>  	unsigned long flags;
>>  
>> +	/*
>> +	 * This is an ordered I/O, clear the ordered range set in
>> +	 * ext4_block_zero_eof() and wake up all waiters that will update
>> +	 * the inode i_disksize.
>> +	 */
>> +	if (io_mode == EXT4_IOMAP_IOEND_ORDER_IO) {
>> +		/*
>> +		 * Pairs with wait_event() in ext4_iomap_wb_ordered_wait().
>> +		 * Ensure i_ordered_len = 0 is visible before waking up
>> +		 * waiters.
>> +		 */
>> +		smp_store_release(&ei->i_ordered_len, 0);
>> +		wake_up_all(&ei->i_ordered_wq);
>> +		goto defer;
>> +	}
>> +
>>  	/* Needs to convert unwritten extents or update the i_disksize. */
>>  	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
>>  	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 62bfe05a64bc..9c0a00e716f3 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -1444,6 +1444,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>>  	ext4_fc_init_inode(&ei->vfs_inode);
>>  	spin_lock_init(&ei->i_fc_lock);
>>  	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
>> +	ei->i_ordered_lblk = 0;
>> +	ei->i_ordered_len = 0;
>> +	init_waitqueue_head(&ei->i_ordered_wq);
>>  	return &ei->vfs_inode;
>>  }
>>  
>> @@ -1480,12 +1483,20 @@ static void ext4_destroy_inode(struct inode *inode)
>>  		dump_stack();
>>  	}
>>  
>> -	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS) &&
>> -	    WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
>> -		ext4_msg(inode->i_sb, KERN_ERR,
>> -			 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
>> -			 inode->i_ino, EXT4_I(inode),
>> -			 EXT4_I(inode)->i_reserved_data_blocks);
>> +	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS)) {
>> +		if (WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
>> +			ext4_msg(inode->i_sb, KERN_ERR,
>> +				 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
>> +				 inode->i_ino, EXT4_I(inode),
>> +				 EXT4_I(inode)->i_reserved_data_blocks);
>> +
>> +		if (WARN_ON_ONCE(EXT4_I(inode)->i_ordered_len))
>> +			ext4_msg(inode->i_sb, KERN_ERR,
>> +				 "Inode %llu (%p): i_ordered_lblk (%u) and i_ordered_len (%u) not cleared!",
>> +				 inode->i_ino, EXT4_I(inode),
>> +				 EXT4_I(inode)->i_ordered_lblk,
>> +				 EXT4_I(inode)->i_ordered_len);
>> +	}
>>  }
>>  
>>  static void ext4_shutdown(struct super_block *sb)
>> -- 
>> 2.52.0
>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-05-30  7:22     ` Zhang Yi
@ 2026-05-30  8:24       ` Zhang Yi
  2026-06-01 18:33         ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-30  8:24 UTC (permalink / raw)
  To: Zhang Yi, Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yizhang089, yangerkun,
	yukuai

On 5/30/2026 3:22 PM, Zhang Yi wrote:
> Hi, Ojaswin!
> 
> On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
>> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> For append writes, wait for ordered I/O to complete before updating
>>> i_disksize. This ensures that zeroed data is flushed to disk before the
>>> metadata update, preventing stale data from being exposed during
>>> unaligned post-EOF append writes.
>>>
>>> Suggested-by: Jan Kara <jack@suse.cz>
>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>> ---
>>>  fs/ext4/ext4.h    | 11 +++++++
>>>  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>>>  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>>>  fs/ext4/super.c   | 23 ++++++++++----
>>>  4 files changed, 161 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>> index 078feda47e36..9ce2128eea3e 100644
>>> --- a/fs/ext4/ext4.h
>>> +++ b/fs/ext4/ext4.h
>>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
>>>  #ifdef CONFIG_FS_ENCRYPTION
>>>  	struct fscrypt_inode_info *i_crypt_info;
>>>  #endif
>>> +
>>> +	/*
>>> +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
>>> +	 * and truncate-up operations. These parameters are used only in the
>>> +	 * iomap buffered I/O path.
>>> +	 */
>>> +	ext4_lblk_t i_ordered_lblk;
>>> +	ext4_lblk_t i_ordered_len;
>>> +	wait_queue_head_t i_ordered_wq;
>>>  };
>>>  
>>>  /*
>>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>>>  			     __u64 len, __u64 *moved_len);
>>>  
>>>  /* page-io.c */
>>> +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
>>> +
>>>  extern int __init ext4_init_pageio(void);
>>>  extern void ext4_exit_pageio(void);
>>>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index e013aeb03d7b..11fb369efeb1 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>  {
>>>  	struct iomap_ioend *ioend = wpc->wb_ctx;
>>>  	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
>>> +	ext4_lblk_t start, end, order_lblk, order_len;
>>>  
>>>  	/*
>>>  	 * After I/O completion, a worker needs to be scheduled when:
>>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>  	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
>>>  		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>  
>>> +	/*
>>> +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
>>> +	 * handling and must not be merged with regular I/O operations.
>>> +	 */
>>> +	order_len = READ_ONCE(ei->i_ordered_len);
>>> +	if (order_len) {
>>> +		/*
>>> +		 * Pair with smp_store_release() in ext4_block_zero_eof().
>>> +		 * Ensure we see the updated i_ordered_lblk that was written
>>> +		 * before the release store to i_ordered_len.
>>> +		 */
>>> +		smp_rmb();
>>> +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
>>> +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
>>> +		end = EXT4_B_TO_LBLK(ioend->io_inode,
>>> +				     ioend->io_offset + ioend->io_size);
>>> +
>>> +		if (start <= order_lblk && end >= order_lblk + order_len) {
>>
>> Hi Zhang,
>>
>> I guess this check is enough cause ordered_lblk and ordered_len will
>> always be  contained in a single block.
> 
> Yeah.
> 
>>
>>> +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>> +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
>>> +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
>>
>> FWIU, we are wanting the ordered IO to not be merged and submitted asap
>> since we want to wake up the waiters. Is there any other reason?
> 
> My original intention was to prevent the loss of the
> EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> completion, which could be caused by merging an ordered ioend with a
> normal ioend.  In patch 19, we need to determine the flag to update
> i_disksize to the correct position.
> 
>>
>> Adding the boundary in ->writeback_submit() only affects
>> iomap_ioend_can_merge() which happens after we have woken up the waiters
>> and deferred the IO to the wq. We ideally want it affect
>> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
>> ->writeback_range().
> 
> IIUC, merging into the same ioend during the submission stage doesn't
> seem to cause any problems.
> 
>>
>> Secondly, I don't think boundary is the right flag here. It ensures
>> that everything before the ordered iomap gets submitted and the ordered
>> iomap starts a new ioend. This can still keep getting merged with the
>> newer ioends untils we decide to submit the IO, which can delay waking
>> up the waiters. If we really want the "no merge" behavior, we'll have to
>> do something like [1] (Check the 2 NOMERGE flag patches).
> 
> Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> still cannot prevent merging. I missed this, thank you for pointing this
> out. However, I think perhaps we should change iomap_ioend_can_merge()
> to check the iomap_ioend->io_private. Something like:
> 
> 	if (ioend->io_private || next->io_private)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            ioend->io_private != next->io_private


> 		return false;
> 
> What do you think?
> 
> Thanks,
> Yi.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-05-28 13:34     ` Ojaswin Mujoo
@ 2026-05-30  9:32       ` Zhang Yi
  2026-06-02  5:56         ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-05-30  9:32 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On 5/28/2026 9:34 PM, Ojaswin Mujoo wrote:
> On Wed, May 27, 2026 at 09:28:28PM +0530, Ojaswin Mujoo wrote:
>> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> For append writes, wait for ordered I/O to complete before updating
>>> i_disksize. This ensures that zeroed data is flushed to disk before the
>>> metadata update, preventing stale data from being exposed during
>>> unaligned post-EOF append writes.
>>>
>>> Suggested-by: Jan Kara <jack@suse.cz>
>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>> ---
>>>  fs/ext4/ext4.h    | 11 +++++++
>>>  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>>>  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>>>  fs/ext4/super.c   | 23 ++++++++++----
>>>  4 files changed, 161 insertions(+), 13 deletions(-)
>>>
[...]
>>> @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>>>  					loff_t from, loff_t end)
>>>  {
>>>  	struct address_space *mapping = inode->i_mapping;
>>> +	struct ext4_inode_info *ei = EXT4_I(inode);
>>>  	struct folio *folio;
>>>  	bool do_submit = false;
>>> +	int ret;
>>>  
>>>  	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>>>  	if (IS_ERR(folio))
>>> @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>>>  	folio_wait_writeback(folio);
>>>  	WARN_ON_ONCE(folio_test_writeback(folio));
>>>  
>>> -	if (likely(folio_test_dirty(folio)))
>>> +	/*
>>> +	 * Mark the ordered range. It will be cleared upon I/O completion
>>> +	 * in ext4_iomap_end_bio(). Any operation that extends i_disksize
>>> +	 * (including append write end io past the zeroed boundary,
>>> +	 * truncate up and append fallocate) must wait for this I/O to
>>> +	 * complete before updating i_disksize.
>>> +	 *
>>> +	 * When multiple overlapping unaligned EOF writes are in flight, we
>>> +	 * only need to track and wait for the first one. Subsequent writes
>>> +	 * will zero the gap in memory and ensure that the zeroed data is
>>> +	 * written out along with the valid data in the same block before
>>> +	 * i_disksize is updated.
>>> +	 */
>>> +	if (likely(folio_test_dirty(folio) &&
>>> +		   READ_ONCE(ei->i_ordered_len) == 0)) {
>>> +		WRITE_ONCE(ei->i_ordered_lblk,
>>> +			   from >> inode->i_blkbits);
>>> +		/*
>>> +		 * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
>>> +		 * and ext4_iomap_wb_ordered_wait(). Ensure the updated
>>> +		 * i_ordered_lblk is visible when i_ordered_len becomes
>>> +		 * non-zero.
>>> +		 */
>>> +		smp_store_release(&ei->i_ordered_len, 1);
>>>  		do_submit = true;
>>> +	}
>>>  	folio_unlock(folio);
>>>  	folio_put(folio);
>>>  
>>>  	/* Submit zeroed block. */
>>> -	if (do_submit)
>>> -		return filemap_fdatawrite_range(mapping, from, end - 1);
>>> +	if (do_submit) {
>>> +		ret = filemap_fdatawrite_range(mapping, from, end - 1);
>>> +		if (ret) {
>>> +			/*
>>> +			 * Pairs with wait_event() in
>>> +			 * ext4_iomap_wb_ordered_wait(). Ensure
>>> +			 * i_ordered_len = 0 is visible before waking up
>>> +			 * waiters.
>>> +			 */
>>> +			smp_store_release(&ei->i_ordered_len, 0);
>>> +			wake_up_all(&ei->i_ordered_wq);
>>> +			return ret;
> 
> Okay so even if the ordered IO fails we still let the i_disksize updates
> go ahead? 

Yes when data_err=ignore, no when data_err=abort.

> I think this is a deviation from the current behavior where we
> abort the journal. If this is acceptable we should atleast add a comment
> on why its okay.
> 

I think this behavior is consistent with the current data=ordered mode.
In the data_err=ignore mode, if an I/O write fails, ext4_end_io_end()
does not abort the journal, so i_disksize is still updated normally.
Conversely, in the data_err=abort mode, the journal is aborted, and
since i_disksize is not updated, it cannot be updated afterwards. Am I
missing something?

>>> +		}
>>> +	}
>>>  	return 0;
>>>  }
>>>  
>>> @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>  		 * data=ordered mode. We submit zeroed range directly here.
>>>  		 * Do not wait for I/O completion for performance.
>>>  		 *
>>> -		 * TODO: Any operation that extends i_disksize (including
>>> -		 * append write end io past the zeroed boundary, truncate up,
>>> -		 * and append fallocate) must wait for the relevant I/O to
>>> -		 * complete before updating i_disksize.
>>> +		 * The end_io handler ext4_iomap_wb_ordered_wait() will wait
>>> +		 * for I/O completion before updating i_disksize if the write
>>> +		 * extends beyond the zeroed boundary.
>>> +		 *
>>> +		 * TODO: Any other operation that extends i_disksize
>>> +		 * (including truncate up and append fallocate) must wait for
>>> +		 * the relevant I/O to complete before updating i_disksize.
>>>  		 */
>>>  		} else if (ext4_inode_buffered_iomap(inode)) {
>>>  			err = ext4_iomap_submit_zero_block(inode, from, end);
>>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
>>> index 3050c887329f..ad05ebb49bf6 100644
>>> --- a/fs/ext4/page-io.c
>>> +++ b/fs/ext4/page-io.c
>>> @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
>>>  	return 0;
>>>  }
>>>  
>>> +/*
>>> + * If the old disk size is not block size aligned and the current
>>> + * writeback range is entirely beyond the old EOF block, we should
>>> + * wait for the zeroed data written in ext4_block_zero_eof() to be
>>> + * written out, otherwise, it may expose stale data in that block.
>>> + */
>>> +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
>>> +				       loff_t pos, loff_t end)
>>> +{
>>> +	struct ext4_inode_info *ei = EXT4_I(inode);
>>> +	unsigned int blocksize = i_blocksize(inode);
>>> +	loff_t disksize = READ_ONCE(ei->i_disksize);
>>> +	ext4_lblk_t order_lblk, order_len;
>>> +
>>> +	/*
>>> +	 * Waiting for ordered I/O is unnecessary when:
>>> +	 * - The on-disk size is block-aligned (no stale data exists).
>>> +	 * - The write start is within the block of the old EOF
>>> +	 *   (overwriting, or appending to a block that already contains
>>> +	 *   valid data).
>>> +	 */
>>> +	if (!(disksize & (blocksize - 1)) ||
>>> +	    pos < round_up(disksize, blocksize))
>>> +		return;
> 
> Okay these checks are pretty confusing. I was intially thinking that
> i_disksize's block would always be equal to i_ordered_lblk but seems
> like that is not true because ext4_block_zero_eof() uses from=i_size.

Yeah, this is the key point that I was a bit confused about as
well.

> 
> So we could have a sequence where
> 
> 1. truncate 4k (i_disksize = i_size = 4k)
> 2. write 8k,10k (i_disksize = 4k i_size = 10k, i_ordered_len = 0 (old isisze  is block aligned)) 
> 3. write 16k,18k (i_disksize = 4k i_size = 10k, i_ordered_len = 1, lblk=4)
                                             18k                     lblk=2, (10k >> 12)
> 
> Here we issue ordered IO even though it' probably not needed.  Now if
> write 3 finishes first we see disksize as 4k so we don't wait for
> ordered write. Which seems okay since we don't risk any stale data
> exposure. However, this flow is pretty confuing.

Indeed!

> 
> Can't we somehow avoid having to issue/set ordered len/lblk in case it
> is not really needed, like only issue it if i_disksize (and not i_size) 
> is unaligned. That can simplify some of our check and avoid extra IO
> overhead.
> 

I was also planning to explore optimizations on this point next.
However, since the original logic in buffer_head already works this way,
keeping the same logic in the iomap path will not introduce any
additional side effects. To avoid unnecessary waiting, I simply added
the disksize alignment check in ext4_iomap_wb_ordered_wait().

Therefore, I do not plan to implement this optimization in this series.
I can open a separate series later to address this optimization — perhaps
by checking i_disksize in ext4_block_zero_eof() before issuing or adding
ordered I/O, and the buffer_head path might also benefit from optimization.
Meanwhile, to avoid confusion, I can add a TODO comment in this patch.

What do you think?

Cheers,
Yi.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
  2026-05-30  1:21         ` Zhang Yi
@ 2026-06-01  6:26           ` Ojaswin Mujoo
  0 siblings, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-01  6:26 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
	yi.zhang, yangerkun, yukuai

On Sat, May 30, 2026 at 09:21:31AM +0800, Zhang Yi wrote:
> On 5/29/2026 11:32 PM, Ojaswin Mujoo wrote:
> > On Fri, May 29, 2026 at 10:12:12PM +0800, Zhang Yi wrote:
> >> Hi, Ojaswin!
> >>
> >> On 5/27/2026 8:49 PM, Ojaswin Mujoo wrote:
> >>> On Mon, May 11, 2026 at 03:23:29PM +0800, Zhang Yi wrote:
> >>>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>>
> >>>> Add the iomap writeback path for ext4 buffered I/O. This introduces:
> >>>>
> >>>>   - ext4_iomap_writepages(): the main writeback entry point.
> >>>>   - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
> >>>>     block mapping and I/O submission.
> >>>>   - A new end I/O worker for converting unwritten extents, updating file
> >>>>     size, and handling DATA_ERR_ABORT after I/O completion.
> >>>>
> >>>> Core implementation details:
> >>>>
> >>>>   - ->writeback_range() callback
> >>>>     Calls ext4_iomap_map_writeback_range() to query the longest range of
> >>>>     existing mapped extents. For performance, when a block range is not
> >>>>     yet allocated, it allocates based on the writeback length and delalloc
> >>>>     extent length, rather than allocating for a single folio at a time.
> >>>>     The folio is then added to an iomap_ioend instance.
> >>>>
> >>>>   - ->writeback_submit() callback
> >>>>     Registers ext4_iomap_end_bio() as the end bio callback. This callback
> >>>>     schedules a worker to handle:
> >>>>     - Unwritten extent conversion.
> >>>>     - i_disksize update after data is written back.
> >>>>     - Journal abort on writeback I/O failure.
> >>>
> >>> Hi Zhang, the changes look good. I have a few comments below:
> >>>>
> >>>> Key changes and considerations:
> >>>>
> >>>> - Append write and unwritten extents
> >>>>    Since data=ordered mode is not used to prevent stale data exposure
> >>>>    during append writebacks, new blocks are always allocated as unwritten
> >>>>    extents (i.e. always enable dioread_nolock), and i_disksize update is
> >>>>    postponed until I/O completion.
> >>>
> >>> Makes sense.
> >>>
> >>>>    Additionally, the deadlock that the
> >>>>    reserve handle was expected to resolve does not occur anymore.
> >>>
> >>> I guess this is since we don't use ordered data so we can't block on
> >>> starting a txn in end io.
> >>
> >> Yep.
> >>
> >>>
> >>>>    Therefore, the end I/O worker can start a normal journal handle
> >>>>    instead of a reserve handle when converting unwritten extents.
> >>>>
> >>>> - Lock ordering
> >>>>    The ->writeback_range() callback runs under the folio lock, requiring
> >>>>    the journal handle to be started under that same lock. This reverses
> >>>>    the order compared to the buffer_head writeback path. The lock ordering
> >>>>    documentation in super.c has been updated accordingly.
> >>>>
> >>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>>> ---
> >>>>   fs/ext4/ext4.h        |   4 +
> >>>>   fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
> >>>>   fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
> >>>>   fs/ext4/super.c       |   7 +-
> >>>>   fs/iomap/ioend.c      |   3 +-
> >>>>   include/linux/iomap.h |   1 +
> >>>>   6 files changed, 346 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>>> index 4832e7f7db82..078feda47e36 100644
> >>>> --- a/fs/ext4/ext4.h
> >>>> +++ b/fs/ext4/ext4.h
> >>>> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
> >>>>   	 */
> >>>>   	struct list_head i_rsv_conversion_list;
> >>>>   	struct work_struct i_rsv_conversion_work;
> >>>> +	struct list_head i_iomap_ioend_list;
> >>>> +	struct work_struct i_iomap_ioend_work;
> >>>>   	/*
> >>>>   	 * Transactions that contain inode's metadata needed to complete
> >>>> @@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
> >>>>   		size_t len);
> >>>>   extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
> >>>>   extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
> >>>> +extern void ext4_iomap_end_io(struct work_struct *work);
> >>>> +extern void ext4_iomap_end_bio(struct bio *bio);
> >>>>   /* mmp.c */
> >>>>   extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >>>> index 1ae7d3f4a1c8..a80195bd6f20 100644
> >>>> --- a/fs/ext4/inode.c
> >>>> +++ b/fs/ext4/inode.c
> >>>> @@ -44,6 +44,7 @@
> >>>>   #include <linux/iversion.h>
> >>>>   #include "ext4_jbd2.h"
> >>>> +#include "ext4_extents.h"
> >>>>   #include "xattr.h"
> >>>>   #include "acl.h"
> >>>>   #include "truncate.h"
> >>>> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
> >>>>   	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
> >>>>   }
> >>>> +static int ext4_iomap_map_one_extent(struct inode *inode,
> >>>> +				     struct ext4_map_blocks *map)
> >>>> +{
> >>>> +	struct extent_status es;
> >>>> +	handle_t *handle = NULL;
> >>>> +	int credits, map_flags;
> >>>> +	int retval;
> >>>> +
> >>>> +	credits = ext4_chunk_trans_blocks(inode, map->m_len);
> >>>> +	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> >>>> +	if (IS_ERR(handle))
> >>>> +		return PTR_ERR(handle);
> >>>> +
> >>>> +	map->m_flags = 0;
> >>>> +	/*
> >>>> +	 * It is necessary to look up extent and map blocks under i_data_sem
> >>>> +	 * in write mode, otherwise, the delalloc extent may become stale
> >>>> +	 * during concurrent truncate operations.
> >>>> +	 */
> >>>> +	ext4_fc_track_inode(handle, inode);
> >>>> +	down_write(&EXT4_I(inode)->i_data_sem);
> >>>> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> >>>> +		retval = es.es_len - (map->m_lblk - es.es_lblk);
> >>>> +		map->m_len = min_t(unsigned int, retval, map->m_len);
> >>>> +
> >>>> +		if (ext4_es_is_delayed(&es)) {
> >>>
> >>> I understand that it is okay for us to rely on extent status ==
> >>> delayed here because we never reclaim delayed es entries and hence we
> >>> are sure to not skip any delayed block allocations here.
> >>
> >> Yeah, right.
> >>
> >>>
> >>>> +			map->m_flags |= EXT4_MAP_DELAYED;
> >>>> +			trace_ext4_da_write_pages_extent(inode, map);
> >>>> +			/*
> >>>> +			 * Call ext4_map_create_blocks() to allocate any
> >>>> +			 * delayed allocation blocks. It is possible that
> >>>> +			 * we're going to need more metadata blocks, however
> >>>> +			 * we must not fail because we're in writeback and
> >>>> +			 * there is nothing we can do so it might result in
> >>>> +			 * data loss. So use reserved blocks to allocate
> >>>> +			 * metadata if possible.
> >>>> +			 */
> >>>> +			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> >>>> +				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
> >>>> +				    EXT4_EX_NOCACHE;
> >>>> +
> >>>> +			retval = ext4_map_create_blocks(handle, inode, map,
> >>>> +							map_flags);
> >>>> +			if (retval > 0)
> >>>> +				ext4_fc_track_range(handle, inode, map->m_lblk,
> >>>> +						map->m_lblk + map->m_len - 1);
> >>>> +			goto out;
> >>>> +		} else if (unlikely(ext4_es_is_hole(&es)))
> >>>
> >>> Now that you've fixed the partial invalidate in iomap (patch 12/23)
> >>> can we still hit this hole case?
> >>
> >> Theoretically, no hole should be encountered; this is just defensive
> >> programming.
> >>
> >>>
> >>>> +			goto out;
> >>>> +
> >>>> +		/* Found written or unwritten extent. */
> >>>> +		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> >>>> +		map->m_flags = ext4_es_is_written(&es) ?
> >>>> +			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> >>>> +		goto out;
> >>>> +	}
> >>>> +
> >>>> +	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> >>>> +out:
> >>>> +	up_write(&EXT4_I(inode)->i_data_sem);
> >>>> +	ext4_journal_stop(handle);
> >>>> +	return retval < 0 ? retval : 0;
> >>>> +}
> >>>> +
> >>>> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> >>>> +					  loff_t offset, unsigned int dirty_len)
> >>>> +{
> >>>> +	struct inode *inode = wpc->inode;
> >>>> +	struct super_block *sb = inode->i_sb;
> >>>> +	struct journal_s *journal = EXT4_SB(sb)->s_journal;
> >>>> +	struct ext4_map_blocks map;
> >>>> +	unsigned int blkbits = inode->i_blkbits;
> >>>> +	unsigned int index = offset >> blkbits;
> >>>> +	unsigned int blk_end, blk_len;
> >>>> +	int ret;
> >>>> +
> >>>> +	ret = ext4_emergency_state(sb);
> >>>> +	if (unlikely(ret))
> >>>> +		return ret;
> >>>> +
> >>>> +	/* Check validity of the cached writeback mapping. */
> >>>> +	if (offset >= wpc->iomap.offset &&
> >>>> +	    offset < wpc->iomap.offset + wpc->iomap.length &&
> >>>> +	    ext4_iomap_valid(inode, &wpc->iomap))
> >>>> +		return 0;
> >>>> +
> >>>> +	blk_len = dirty_len >> blkbits;
> >>>> +	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> >>>> +				      (UINT_MAX - 1));
> >>>
> >>> This is an interesting idea. I'm just a bit worried when we have
> >>> range_end == LLONG_MAX (bg flush) and we will always be trying to allocate
> >>> MAX_WRITEPAGES, incase of a slightly fragmented FS, we might keep
> >>> falling into slower mballoc criterias and might waste a lot of time
> >>> scanning the groups.
> >>
> >> Actually, MAX_WRITEPAGES is not allocated every time; the allocated
> >> length also depends on the length of data that has already been delayed
> >> for writing, and the smaller value is taken. If the user has indeed
> > 
> > Hmm so we take the blk_end based on range_end (which is LLONG_MAX for bg
> > flusher) and then our blk_len will be set accordingly and would become a
> > large number as well. Then we will set map.m_len based on this blk_len
> > and MAX_WRITEPAGES. Am I missing something that clamps our m_len?
> > 
> Please take a look at ext4_iomap_map_one_extent():
> 
> +		retval = es.es_len - (map->m_lblk - es.es_lblk);
> +		map->m_len = min_t(unsigned int, retval, map->m_len);
> 
> In this case, m_len is truncated to the length of the delalloc extent.
> 
> Thanks,
> Yi.

Oh yes of course, thanks for the clarification. Looks good then.

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Thanks,
Ojaswin
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
  2026-05-30  2:53     ` Zhang Yi
@ 2026-06-01  9:08       ` Ojaswin Mujoo
  2026-06-01 12:22         ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-01  9:08 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
> On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> In the generic buffered_head I/O path, we rely on the data=order mode to
> >> ensure that the zeroed EOF block data is written before updating
> >> i_disksize, thus preventing stale data from being exposed.
> >>
> >> However, the iomap buffered I/O path cannot use this mechanism. Instead,
> >> we issue the I/O immediately after performing the zero operation
> >> (without synchronous waiting for performance). This can reduce the risk
> >> of exposing stale data, but it does not guarantee that the zero data
> >> will be flushed to disk before the metadata of i_disksize is updated.
> >> The subsequent patches will wait for this I/O to complete before
> >> updating i_disksize.
> >>
> >> Suggested-by: Jan Kara <jack@suse.cz>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > I think we discussed that we may not need to do this [1] but I guess
> > you've decided to make the tradeoff of issuing the IO to avoid having to
> > wait for bg flush to complete the tail page zeroing 
> > 
> 
> Yes. For truncate up and append fallocate, originally i_disksize would
> be updated immediately, and the change would be persisted via the
> journal within default 5 seconds. But now, if the tail page is not
> committed immediately, the update to i_disksize will be delayed by about
> 30 seconds, and persistence will be postponed to around 35 seconds. I'm
> not sure what impact this change might have — I just don't really want
> to introduce it.

> For normal append writes, the impact is minimal, unless we call
> sync_range to sync the portion of data that extends beyond EOF.

Hmm while trying to retain the behavior for falloc and truncate up,
we end up changing it for append writes :) But anyways, I understand
your reasoning and don't have any strong opinions against it. I'll let
Jan pitch in since he had some comments around this.

> 
> In addition, if the zeroed page is not issued here immediately, the
> logic will become more complex because we need to more careful about the
> order of write-back IOs to prevent deadlock issues caused by mutual
> waiting.

You mean an endio completion waiting for ordered IO to complete but
ordered IO writeback is somehow waiting for this endio completion? Is that
actually possible?
> 
> > However, I think one side effect might be many threads calling the
> > writeback mechanism to issue zero IOs which might not scale well. I
> > don't know if it'll be a huge problem though, I guess it's a sort of
> > thing we will have to deal with in case we see it in real world
> > workloads.
> > 
> 
> I agree with you. However, I suspect that unless we run some specific
> benchmark tests, it should be difficult to encounter a large number of
> post-EOF append writes and truncate up operations in real-world usage
> scenarios — and I haven't come across such scenarios yet. For
> simplicity, I'd like to proceed with this implementation for now. If we
> do run into actual problems later, we can consider not issuing I/O
> directly here, but instead: 1) find the ordered block in
> ext4_sync_file() and perform writeback; 2) ensure writeback ordering
> for normal background writeback as well — otherwise, there is a risk of
> deadlock (mutual waiting). What do you think?

Yes sounds good Yi, we can deal with performance tuning later.

Regards,
Ojaswin

> 
> Cheers,
> Yi.
> 
> > [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
> > 
> >> ---
> >>  fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
> >>  1 file changed, 55 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 239d387ffaf2..e013aeb03d7b 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
> >>  					zero_written);
> >>  }
> >>  
> >> +static int ext4_iomap_submit_zero_block(struct inode *inode,
> >> +					loff_t from, loff_t end)
> >> +{
> >> +	struct address_space *mapping = inode->i_mapping;
> >> +	struct folio *folio;
> >> +	bool do_submit = false;
> >> +
> >> +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> >> +	if (IS_ERR(folio))
> >> +		/* Already writeback and clear? */
> >> +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
> >> +
> >> +	folio_wait_writeback(folio);
> >> +	WARN_ON_ONCE(folio_test_writeback(folio));
> >> +
> >> +	if (likely(folio_test_dirty(folio)))
> >> +		do_submit = true;
> >> +	folio_unlock(folio);
> >> +	folio_put(folio);
> >> +
> >> +	/* Submit zeroed block. */
> >> +	if (do_submit)
> >> +		return filemap_fdatawrite_range(mapping, from, end - 1);
> >> +	return 0;
> >> +}
> >> +
> >>  /*
> >>   * Zero out a mapping from file offset 'from' up to the end of the block
> >>   * which corresponds to 'from' or to the given 'end' inside this block.
> >> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> >>  	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
> >>  		return 0;
> >>  
> >> -	if (length > blocksize - offset)
> >> +	if (length > blocksize - offset) {
> >>  		length = blocksize - offset;
> >> +		end = from + length;
> >> +	}
> >>  
> >>  	err = ext4_block_zero_range(inode, from, length,
> >>  				    &did_zero, &zero_written);
> >> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> >>  	 * TODO: In the iomap path, handle this by updating i_disksize to
> >>  	 * i_size after the zeroed data has been written back.
> >>  	 */
> >> -	if (ext4_should_order_data(inode) &&
> >> -	    did_zero && zero_written && !IS_DAX(inode)) {
> >> -		handle_t *handle;
> >> +	if (did_zero && zero_written && !IS_DAX(inode)) {
> >> +		if (ext4_should_order_data(inode)) {
> >> +			handle_t *handle;
> >>  
> >> -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> >> -		if (IS_ERR(handle))
> >> -			return PTR_ERR(handle);
> >> +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> >> +			if (IS_ERR(handle))
> >> +				return PTR_ERR(handle);
> >>  
> >> -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
> >> -		ext4_journal_stop(handle);
> >> -		if (err)
> >> -			return err;
> >> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
> >> +							length);
> >> +			ext4_journal_stop(handle);
> >> +			if (err)
> >> +				return err;
> >> +		/*
> >> +		 * inodes using the iomap buffered I/O path do not use the
> >> +		 * data=ordered mode. We submit zeroed range directly here.
> >> +		 * Do not wait for I/O completion for performance.
> >> +		 *
> >> +		 * TODO: Any operation that extends i_disksize (including
> >> +		 * append write end io past the zeroed boundary, truncate up,
> >> +		 * and append fallocate) must wait for the relevant I/O to
> >> +		 * complete before updating i_disksize.
> >> +		 */
> >> +		} else if (ext4_inode_buffered_iomap(inode)) {
> >> +			err = ext4_iomap_submit_zero_block(inode, from, end);
> >> +			if (err)
> >> +				return err;
> >> +		}
> >>  	}
> >>  
> >>  	return 0;
> >> -- 
> >> 2.52.0
> >>
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
  2026-06-01  9:08       ` Ojaswin Mujoo
@ 2026-06-01 12:22         ` Zhang Yi
  2026-06-01 18:15           ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-06-01 12:22 UTC (permalink / raw)
  To: Ojaswin Mujoo, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai

On 6/1/2026 5:08 PM, Ojaswin Mujoo wrote:
> On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
>> On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
>>> On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> In the generic buffered_head I/O path, we rely on the data=order mode to
>>>> ensure that the zeroed EOF block data is written before updating
>>>> i_disksize, thus preventing stale data from being exposed.
>>>>
>>>> However, the iomap buffered I/O path cannot use this mechanism. Instead,
>>>> we issue the I/O immediately after performing the zero operation
>>>> (without synchronous waiting for performance). This can reduce the risk
>>>> of exposing stale data, but it does not guarantee that the zero data
>>>> will be flushed to disk before the metadata of i_disksize is updated.
>>>> The subsequent patches will wait for this I/O to complete before
>>>> updating i_disksize.
>>>>
>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> I think we discussed that we may not need to do this [1] but I guess
>>> you've decided to make the tradeoff of issuing the IO to avoid having to
>>> wait for bg flush to complete the tail page zeroing
>>>

I'm glad to hear that, thanks.

>>
>> Yes. For truncate up and append fallocate, originally i_disksize would
>> be updated immediately, and the change would be persisted via the
>> journal within default 5 seconds. But now, if the tail page is not
>> committed immediately, the update to i_disksize will be delayed by about
>> 30 seconds, and persistence will be postponed to around 35 seconds. I'm
>> not sure what impact this change might have — I just don't really want
>> to introduce it.
> 
>> For normal append writes, the impact is minimal, unless we call
>> sync_range to sync the portion of data that extends beyond EOF.
> 
> Hmm while trying to retain the behavior for falloc and truncate up,
> we end up changing it for append writes :) But anyways, I understand
> your reasoning and don't have any strong opinions against it. I'll let
> Jan pitch in since he had some comments around this.
> 
>>
>> In addition, if the zeroed page is not issued here immediately, the
>> logic will become more complex because we need to more careful about the
>> order of write-back IOs to prevent deadlock issues caused by mutual
>> waiting.
> 
> You mean an endio completion waiting for ordered IO to complete but
> ordered IO writeback is somehow waiting for this endio completion? Is that
> actually possible?

Well, after thinking it over more carefully, it seems this is
impossible, I cannot think of a scenario that could trigger this kind of
issue. The generic writeback process always executes writeback in folio
index order, so there would be no situation where a later data I/O
depends on an earlier ordered I/O. Even if both kinds of IOs are placed
in the same ioend, there should be no problem. I was confused and 
overthinking it.

 From this perspective, if we can accept that truncate up and fallocate
will have longer persistence time by default(I guess this is
acceptable), we can avoid writing back zeroed data immediately. To
achieve this, we only need to consider the case of sync file range. :-)

Regards,
Yi.
>>
>>> However, I think one side effect might be many threads calling the
>>> writeback mechanism to issue zero IOs which might not scale well. I
>>> don't know if it'll be a huge problem though, I guess it's a sort of
>>> thing we will have to deal with in case we see it in real world
>>> workloads.
>>>
>>
>> I agree with you. However, I suspect that unless we run some specific
>> benchmark tests, it should be difficult to encounter a large number of
>> post-EOF append writes and truncate up operations in real-world usage
>> scenarios — and I haven't come across such scenarios yet. For
>> simplicity, I'd like to proceed with this implementation for now. If we
>> do run into actual problems later, we can consider not issuing I/O
>> directly here, but instead: 1) find the ordered block in
>> ext4_sync_file() and perform writeback; 2) ensure writeback ordering
>> for normal background writeback as well — otherwise, there is a risk of
>> deadlock (mutual waiting). What do you think?
> 
> Yes sounds good Yi, we can deal with performance tuning later.
> 
> Regards,
> Ojaswin
> 
>>
>> Cheers,
>> Yi.
>>
>>> [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
>>>
>>>> ---
>>>>   fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
>>>>   1 file changed, 55 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 239d387ffaf2..e013aeb03d7b 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>>>>   					zero_written);
>>>>   }
>>>>   
>>>> +static int ext4_iomap_submit_zero_block(struct inode *inode,
>>>> +					loff_t from, loff_t end)
>>>> +{
>>>> +	struct address_space *mapping = inode->i_mapping;
>>>> +	struct folio *folio;
>>>> +	bool do_submit = false;
>>>> +
>>>> +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>>>> +	if (IS_ERR(folio))
>>>> +		/* Already writeback and clear? */
>>>> +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
>>>> +
>>>> +	folio_wait_writeback(folio);
>>>> +	WARN_ON_ONCE(folio_test_writeback(folio));
>>>> +
>>>> +	if (likely(folio_test_dirty(folio)))
>>>> +		do_submit = true;
>>>> +	folio_unlock(folio);
>>>> +	folio_put(folio);
>>>> +
>>>> +	/* Submit zeroed block. */
>>>> +	if (do_submit)
>>>> +		return filemap_fdatawrite_range(mapping, from, end - 1);
>>>> +	return 0;
>>>> +}
>>>> +
>>>>   /*
>>>>    * Zero out a mapping from file offset 'from' up to the end of the block
>>>>    * which corresponds to 'from' or to the given 'end' inside this block.
>>>> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>   	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>>>>   		return 0;
>>>>   
>>>> -	if (length > blocksize - offset)
>>>> +	if (length > blocksize - offset) {
>>>>   		length = blocksize - offset;
>>>> +		end = from + length;
>>>> +	}
>>>>   
>>>>   	err = ext4_block_zero_range(inode, from, length,
>>>>   				    &did_zero, &zero_written);
>>>> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>   	 * TODO: In the iomap path, handle this by updating i_disksize to
>>>>   	 * i_size after the zeroed data has been written back.
>>>>   	 */
>>>> -	if (ext4_should_order_data(inode) &&
>>>> -	    did_zero && zero_written && !IS_DAX(inode)) {
>>>> -		handle_t *handle;
>>>> +	if (did_zero && zero_written && !IS_DAX(inode)) {
>>>> +		if (ext4_should_order_data(inode)) {
>>>> +			handle_t *handle;
>>>>   
>>>> -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>> -		if (IS_ERR(handle))
>>>> -			return PTR_ERR(handle);
>>>> +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>> +			if (IS_ERR(handle))
>>>> +				return PTR_ERR(handle);
>>>>   
>>>> -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
>>>> -		ext4_journal_stop(handle);
>>>> -		if (err)
>>>> -			return err;
>>>> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
>>>> +							length);
>>>> +			ext4_journal_stop(handle);
>>>> +			if (err)
>>>> +				return err;
>>>> +		/*
>>>> +		 * inodes using the iomap buffered I/O path do not use the
>>>> +		 * data=ordered mode. We submit zeroed range directly here.
>>>> +		 * Do not wait for I/O completion for performance.
>>>> +		 *
>>>> +		 * TODO: Any operation that extends i_disksize (including
>>>> +		 * append write end io past the zeroed boundary, truncate up,
>>>> +		 * and append fallocate) must wait for the relevant I/O to
>>>> +		 * complete before updating i_disksize.
>>>> +		 */
>>>> +		} else if (ext4_inode_buffered_iomap(inode)) {
>>>> +			err = ext4_iomap_submit_zero_block(inode, from, end);
>>>> +			if (err)
>>>> +				return err;
>>>> +		}
>>>>   	}
>>>>   
>>>>   	return 0;
>>>> -- 
>>>> 2.52.0
>>>>
>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
  2026-06-01 12:22         ` Zhang Yi
@ 2026-06-01 18:15           ` Ojaswin Mujoo
  2026-06-02  3:36             ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-01 18:15 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
	yi.zhang, yangerkun, yukuai

On Mon, Jun 01, 2026 at 08:22:09PM +0800, Zhang Yi wrote:
> On 6/1/2026 5:08 PM, Ojaswin Mujoo wrote:
> > On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
> > > On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
> > > > On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
> > > > > From: Zhang Yi <yi.zhang@huawei.com>
> > > > > 
> > > > > In the generic buffered_head I/O path, we rely on the data=order mode to
> > > > > ensure that the zeroed EOF block data is written before updating
> > > > > i_disksize, thus preventing stale data from being exposed.
> > > > > 
> > > > > However, the iomap buffered I/O path cannot use this mechanism. Instead,
> > > > > we issue the I/O immediately after performing the zero operation
> > > > > (without synchronous waiting for performance). This can reduce the risk
> > > > > of exposing stale data, but it does not guarantee that the zero data
> > > > > will be flushed to disk before the metadata of i_disksize is updated.
> > > > > The subsequent patches will wait for this I/O to complete before
> > > > > updating i_disksize.
> > > > > 
> > > > > Suggested-by: Jan Kara <jack@suse.cz>
> > > > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > > 
> > > > I think we discussed that we may not need to do this [1] but I guess
> > > > you've decided to make the tradeoff of issuing the IO to avoid having to
> > > > wait for bg flush to complete the tail page zeroing
> > > > 
> 
> I'm glad to hear that, thanks.
> 
> > > 
> > > Yes. For truncate up and append fallocate, originally i_disksize would
> > > be updated immediately, and the change would be persisted via the
> > > journal within default 5 seconds. But now, if the tail page is not
> > > committed immediately, the update to i_disksize will be delayed by about
> > > 30 seconds, and persistence will be postponed to around 35 seconds. I'm
> > > not sure what impact this change might have — I just don't really want
> > > to introduce it.
> > 
> > > For normal append writes, the impact is minimal, unless we call
> > > sync_range to sync the portion of data that extends beyond EOF.
> > 
> > Hmm while trying to retain the behavior for falloc and truncate up,
> > we end up changing it for append writes :) But anyways, I understand
> > your reasoning and don't have any strong opinions against it. I'll let
> > Jan pitch in since he had some comments around this.
> > 
> > > 
> > > In addition, if the zeroed page is not issued here immediately, the
> > > logic will become more complex because we need to more careful about the
> > > order of write-back IOs to prevent deadlock issues caused by mutual
> > > waiting.
> > 
> > You mean an endio completion waiting for ordered IO to complete but
> > ordered IO writeback is somehow waiting for this endio completion? Is that
> > actually possible?
> 
> Well, after thinking it over more carefully, it seems this is
> impossible, I cannot think of a scenario that could trigger this kind of
> issue. The generic writeback process always executes writeback in folio
> index order, so there would be no situation where a later data I/O
> depends on an earlier ordered I/O. Even if both kinds of IOs are placed
> in the same ioend, there should be no problem. I was confused and
> overthinking it.
> 
> From this perspective, if we can accept that truncate up and fallocate
> will have longer persistence time by default(I guess this is
> acceptable), we can avoid writing back zeroed data immediately. To
> achieve this, we only need to consider the case of sync file range. :-)

Yeah, I think during writeback we will have to submit the ordered data
if we are writing back data beyond the i_disksize. 

If this is straightforward enough to implement, I think this approach
would be a safer choice cause we will avoid overheads due to small,
random ordered IOs overworking the writeback layer.

> 
> Regards,
> Yi.
> > > 
> > > > However, I think one side effect might be many threads calling the
> > > > writeback mechanism to issue zero IOs which might not scale well. I
> > > > don't know if it'll be a huge problem though, I guess it's a sort of
> > > > thing we will have to deal with in case we see it in real world
> > > > workloads.
> > > > 
> > > 
> > > I agree with you. However, I suspect that unless we run some specific
> > > benchmark tests, it should be difficult to encounter a large number of
> > > post-EOF append writes and truncate up operations in real-world usage
> > > scenarios — and I haven't come across such scenarios yet. For
> > > simplicity, I'd like to proceed with this implementation for now. If we
> > > do run into actual problems later, we can consider not issuing I/O
> > > directly here, but instead: 1) find the ordered block in
> > > ext4_sync_file() and perform writeback; 2) ensure writeback ordering
> > > for normal background writeback as well — otherwise, there is a risk of
> > > deadlock (mutual waiting). What do you think?
> > 
> > Yes sounds good Yi, we can deal with performance tuning later.
> > 
> > Regards,
> > Ojaswin
> > 
> > > 
> > > Cheers,
> > > Yi.
> > > 
> > > > [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
> > > > 
> > > > > ---
> > > > >   fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
> > > > >   1 file changed, 55 insertions(+), 11 deletions(-)
> > > > > 
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index 239d387ffaf2..e013aeb03d7b 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
> > > > >   					zero_written);
> > > > >   }
> > > > > +static int ext4_iomap_submit_zero_block(struct inode *inode,
> > > > > +					loff_t from, loff_t end)
> > > > > +{
> > > > > +	struct address_space *mapping = inode->i_mapping;
> > > > > +	struct folio *folio;
> > > > > +	bool do_submit = false;
> > > > > +
> > > > > +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> > > > > +	if (IS_ERR(folio))
> > > > > +		/* Already writeback and clear? */
> > > > > +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
> > > > > +
> > > > > +	folio_wait_writeback(folio);
> > > > > +	WARN_ON_ONCE(folio_test_writeback(folio));
> > > > > +
> > > > > +	if (likely(folio_test_dirty(folio)))
> > > > > +		do_submit = true;
> > > > > +	folio_unlock(folio);
> > > > > +	folio_put(folio);
> > > > > +
> > > > > +	/* Submit zeroed block. */
> > > > > +	if (do_submit)
> > > > > +		return filemap_fdatawrite_range(mapping, from, end - 1);
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > >   /*
> > > > >    * Zero out a mapping from file offset 'from' up to the end of the block
> > > > >    * which corresponds to 'from' or to the given 'end' inside this block.
> > > > > @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> > > > >   	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
> > > > >   		return 0;
> > > > > -	if (length > blocksize - offset)
> > > > > +	if (length > blocksize - offset) {
> > > > >   		length = blocksize - offset;
> > > > > +		end = from + length;
> > > > > +	}
> > > > >   	err = ext4_block_zero_range(inode, from, length,
> > > > >   				    &did_zero, &zero_written);
> > > > > @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> > > > >   	 * TODO: In the iomap path, handle this by updating i_disksize to
> > > > >   	 * i_size after the zeroed data has been written back.
> > > > >   	 */
> > > > > -	if (ext4_should_order_data(inode) &&
> > > > > -	    did_zero && zero_written && !IS_DAX(inode)) {
> > > > > -		handle_t *handle;
> > > > > +	if (did_zero && zero_written && !IS_DAX(inode)) {
> > > > > +		if (ext4_should_order_data(inode)) {
> > > > > +			handle_t *handle;
> > > > > -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> > > > > -		if (IS_ERR(handle))
> > > > > -			return PTR_ERR(handle);
> > > > > +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> > > > > +			if (IS_ERR(handle))
> > > > > +				return PTR_ERR(handle);
> > > > > -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
> > > > > -		ext4_journal_stop(handle);
> > > > > -		if (err)
> > > > > -			return err;
> > > > > +			err = ext4_jbd2_inode_add_write(handle, inode, from,
> > > > > +							length);
> > > > > +			ext4_journal_stop(handle);
> > > > > +			if (err)
> > > > > +				return err;
> > > > > +		/*
> > > > > +		 * inodes using the iomap buffered I/O path do not use the
> > > > > +		 * data=ordered mode. We submit zeroed range directly here.
> > > > > +		 * Do not wait for I/O completion for performance.
> > > > > +		 *
> > > > > +		 * TODO: Any operation that extends i_disksize (including
> > > > > +		 * append write end io past the zeroed boundary, truncate up,
> > > > > +		 * and append fallocate) must wait for the relevant I/O to
> > > > > +		 * complete before updating i_disksize.
> > > > > +		 */
> > > > > +		} else if (ext4_inode_buffered_iomap(inode)) {
> > > > > +			err = ext4_iomap_submit_zero_block(inode, from, end);
> > > > > +			if (err)
> > > > > +				return err;
> > > > > +		}
> > > > >   	}
> > > > >   	return 0;
> > > > > -- 
> > > > > 2.52.0
> > > > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-05-30  8:24       ` Zhang Yi
@ 2026-06-01 18:33         ` Ojaswin Mujoo
  2026-06-02  3:22           ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-01 18:33 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yizhang089, yangerkun,
	yukuai

On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
> On 5/30/2026 3:22 PM, Zhang Yi wrote:
> > Hi, Ojaswin!
> > 
> > On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
> >> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> >>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>
> >>> For append writes, wait for ordered I/O to complete before updating
> >>> i_disksize. This ensures that zeroed data is flushed to disk before the
> >>> metadata update, preventing stale data from being exposed during
> >>> unaligned post-EOF append writes.
> >>>
> >>> Suggested-by: Jan Kara <jack@suse.cz>
> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>> ---
> >>>  fs/ext4/ext4.h    | 11 +++++++
> >>>  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> >>>  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> >>>  fs/ext4/super.c   | 23 ++++++++++----
> >>>  4 files changed, 161 insertions(+), 13 deletions(-)
> >>>
> >>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>> index 078feda47e36..9ce2128eea3e 100644
> >>> --- a/fs/ext4/ext4.h
> >>> +++ b/fs/ext4/ext4.h
> >>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> >>>  #ifdef CONFIG_FS_ENCRYPTION
> >>>  	struct fscrypt_inode_info *i_crypt_info;
> >>>  #endif
> >>> +
> >>> +	/*
> >>> +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
> >>> +	 * and truncate-up operations. These parameters are used only in the
> >>> +	 * iomap buffered I/O path.
> >>> +	 */
> >>> +	ext4_lblk_t i_ordered_lblk;
> >>> +	ext4_lblk_t i_ordered_len;
> >>> +	wait_queue_head_t i_ordered_wq;
> >>>  };
> >>>  
> >>>  /*
> >>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> >>>  			     __u64 len, __u64 *moved_len);
> >>>  
> >>>  /* page-io.c */
> >>> +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
> >>> +
> >>>  extern int __init ext4_init_pageio(void);
> >>>  extern void ext4_exit_pageio(void);
> >>>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> >>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >>> index e013aeb03d7b..11fb369efeb1 100644
> >>> --- a/fs/ext4/inode.c
> >>> +++ b/fs/ext4/inode.c
> >>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >>>  {
> >>>  	struct iomap_ioend *ioend = wpc->wb_ctx;
> >>>  	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> >>> +	ext4_lblk_t start, end, order_lblk, order_len;
> >>>  
> >>>  	/*
> >>>  	 * After I/O completion, a worker needs to be scheduled when:
> >>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >>>  	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> >>>  		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> >>>  
> >>> +	/*
> >>> +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
> >>> +	 * handling and must not be merged with regular I/O operations.
> >>> +	 */
> >>> +	order_len = READ_ONCE(ei->i_ordered_len);
> >>> +	if (order_len) {
> >>> +		/*
> >>> +		 * Pair with smp_store_release() in ext4_block_zero_eof().
> >>> +		 * Ensure we see the updated i_ordered_lblk that was written
> >>> +		 * before the release store to i_ordered_len.
> >>> +		 */
> >>> +		smp_rmb();
> >>> +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
> >>> +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> >>> +		end = EXT4_B_TO_LBLK(ioend->io_inode,
> >>> +				     ioend->io_offset + ioend->io_size);
> >>> +
> >>> +		if (start <= order_lblk && end >= order_lblk + order_len) {
> >>
> >> Hi Zhang,
> >>
> >> I guess this check is enough cause ordered_lblk and ordered_len will
> >> always be  contained in a single block.
> > 
> > Yeah.
> > 
> >>
> >>> +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> >>> +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> >>> +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> >>
> >> FWIU, we are wanting the ordered IO to not be merged and submitted asap
> >> since we want to wake up the waiters. Is there any other reason?
> > 
> > My original intention was to prevent the loss of the
> > EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> > completion, which could be caused by merging an ordered ioend with a
> > normal ioend.  In patch 19, we need to determine the flag to update
> > i_disksize to the correct position.

Ahh okay, we don't want the flag to be lost.

> > 
> >>
> >> Adding the boundary in ->writeback_submit() only affects
> >> iomap_ioend_can_merge() which happens after we have woken up the waiters
> >> and deferred the IO to the wq. We ideally want it affect
> >> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> >> ->writeback_range().
> > 
> > IIUC, merging into the same ioend during the submission stage doesn't
> > seem to cause any problems.

Got it since the flag is set later. I was thinking we want to quickly
issue the ordered IO to wake up the waiters and not waste time trying to
merge it and hence we wanted to use that flag. 

> > 
> >>
> >> Secondly, I don't think boundary is the right flag here. It ensures
> >> that everything before the ordered iomap gets submitted and the ordered
> >> iomap starts a new ioend. This can still keep getting merged with the
> >> newer ioends untils we decide to submit the IO, which can delay waking
> >> up the waiters. If we really want the "no merge" behavior, we'll have to
> >> do something like [1] (Check the 2 NOMERGE flag patches).
> > 
> > Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> > still cannot prevent merging. I missed this, thank you for pointing this
> > out. However, I think perhaps we should change iomap_ioend_can_merge()
> > to check the iomap_ioend->io_private. Something like:
> > 
> > 	if (ioend->io_private || next->io_private)
>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>             ioend->io_private != next->io_private

I guess if the purpose is to just not lose the flag, then boundary seems
to work for because we only lose the flag if ordered ioend backward
merges to a prev one. Flag is retained if we forward merge. Which
boundary seems to take care of.

However, if we want to avoid merges so we can quickly issue IO and wake
up the waiters then the above change looks good. Also, if this is the
reason we'd also want to have this during submission stage so the flag
setting will probs have to move to ->wirteback_range()

Regards,
Ojaswin


> 
> 
> > 		return false;
> > 
> > What do you think?
> > 
> > Thanks,
> > Yi.
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-06-01 18:33         ` Ojaswin Mujoo
@ 2026-06-02  3:22           ` Zhang Yi
  2026-06-02  5:35             ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-06-02  3:22 UTC (permalink / raw)
  To: Ojaswin Mujoo, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yangerkun, yukuai

On 6/2/2026 2:33 AM, Ojaswin Mujoo wrote:
> On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
>> On 5/30/2026 3:22 PM, Zhang Yi wrote:
>>> Hi, Ojaswin!
>>>
>>> On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
>>>> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> For append writes, wait for ordered I/O to complete before updating
>>>>> i_disksize. This ensures that zeroed data is flushed to disk before the
>>>>> metadata update, preventing stale data from being exposed during
>>>>> unaligned post-EOF append writes.
>>>>>
>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>> ---
>>>>>   fs/ext4/ext4.h    | 11 +++++++
>>>>>   fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>>>>>   fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>>>>>   fs/ext4/super.c   | 23 ++++++++++----
>>>>>   4 files changed, 161 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>> index 078feda47e36..9ce2128eea3e 100644
>>>>> --- a/fs/ext4/ext4.h
>>>>> +++ b/fs/ext4/ext4.h
>>>>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
>>>>>   #ifdef CONFIG_FS_ENCRYPTION
>>>>>   	struct fscrypt_inode_info *i_crypt_info;
>>>>>   #endif
>>>>> +
>>>>> +	/*
>>>>> +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
>>>>> +	 * and truncate-up operations. These parameters are used only in the
>>>>> +	 * iomap buffered I/O path.
>>>>> +	 */
>>>>> +	ext4_lblk_t i_ordered_lblk;
>>>>> +	ext4_lblk_t i_ordered_len;
>>>>> +	wait_queue_head_t i_ordered_wq;
>>>>>   };
>>>>>   
>>>>>   /*
>>>>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>>>>>   			     __u64 len, __u64 *moved_len);
>>>>>   
>>>>>   /* page-io.c */
>>>>> +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
>>>>> +
>>>>>   extern int __init ext4_init_pageio(void);
>>>>>   extern void ext4_exit_pageio(void);
>>>>>   extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>> index e013aeb03d7b..11fb369efeb1 100644
>>>>> --- a/fs/ext4/inode.c
>>>>> +++ b/fs/ext4/inode.c
>>>>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>>>   {
>>>>>   	struct iomap_ioend *ioend = wpc->wb_ctx;
>>>>>   	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
>>>>> +	ext4_lblk_t start, end, order_lblk, order_len;
>>>>>   
>>>>>   	/*
>>>>>   	 * After I/O completion, a worker needs to be scheduled when:
>>>>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>>>   	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
>>>>>   		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>>>   
>>>>> +	/*
>>>>> +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
>>>>> +	 * handling and must not be merged with regular I/O operations.
>>>>> +	 */
>>>>> +	order_len = READ_ONCE(ei->i_ordered_len);
>>>>> +	if (order_len) {
>>>>> +		/*
>>>>> +		 * Pair with smp_store_release() in ext4_block_zero_eof().
>>>>> +		 * Ensure we see the updated i_ordered_lblk that was written
>>>>> +		 * before the release store to i_ordered_len.
>>>>> +		 */
>>>>> +		smp_rmb();
>>>>> +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
>>>>> +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
>>>>> +		end = EXT4_B_TO_LBLK(ioend->io_inode,
>>>>> +				     ioend->io_offset + ioend->io_size);
>>>>> +
>>>>> +		if (start <= order_lblk && end >= order_lblk + order_len) {
>>>>
>>>> Hi Zhang,
>>>>
>>>> I guess this check is enough cause ordered_lblk and ordered_len will
>>>> always be  contained in a single block.
>>>
>>> Yeah.
>>>
>>>>
>>>>> +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>>> +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
>>>>> +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
>>>>
>>>> FWIU, we are wanting the ordered IO to not be merged and submitted asap
>>>> since we want to wake up the waiters. Is there any other reason?
>>>
>>> My original intention was to prevent the loss of the
>>> EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
>>> completion, which could be caused by merging an ordered ioend with a
>>> normal ioend.  In patch 19, we need to determine the flag to update
>>> i_disksize to the correct position.
> 
> Ahh okay, we don't want the flag to be lost.
> 
>>>
>>>>
>>>> Adding the boundary in ->writeback_submit() only affects
>>>> iomap_ioend_can_merge() which happens after we have woken up the waiters
>>>> and deferred the IO to the wq. We ideally want it affect
>>>> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
>>>> ->writeback_range().
>>>
>>> IIUC, merging into the same ioend during the submission stage doesn't
>>> seem to cause any problems.
> 
> Got it since the flag is set later. I was thinking we want to quickly
> issue the ordered IO to wake up the waiters and not waste time trying to
> merge it and hence we wanted to use that flag.
> 
>>>
>>>>
>>>> Secondly, I don't think boundary is the right flag here. It ensures
>>>> that everything before the ordered iomap gets submitted and the ordered
>>>> iomap starts a new ioend. This can still keep getting merged with the
>>>> newer ioends untils we decide to submit the IO, which can delay waking
>>>> up the waiters. If we really want the "no merge" behavior, we'll have to
>>>> do something like [1] (Check the 2 NOMERGE flag patches).
>>>
>>> Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
>>> still cannot prevent merging. I missed this, thank you for pointing this
>>> out. However, I think perhaps we should change iomap_ioend_can_merge()
>>> to check the iomap_ioend->io_private. Something like:
>>>
>>> 	if (ioend->io_private || next->io_private)
>>              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>              ioend->io_private != next->io_private
> 
> I guess if the purpose is to just not lose the flag, then boundary seems
> to work for because we only lose the flag if ordered ioend backward
> merges to a prev one. Flag is retained if we forward merge. Which
> boundary seems to take care of.
> 
Yes, IOMAP_IOEND_BOUNDARY is indeed worked currently as it prevents flag
loss. However, from the perspective of the iomap infrastructure, I
believe it is still necessary to add the ioend->io_private !=
next->io_private check. Because ioends with different io_private values
should not be merged, as this carries the risk of potentially losing
io_private or even memory leaks. With this check in iomap, we would no
longer need IOMAP_IOEND_BOUNDARY.

> However, if we want to avoid merges so we can quickly issue IO and wake
> up the waiters then the above change looks good. Also, if this is the
> reason we'd also want to have this during submission stage so the flag
> setting will probs have to move to ->wirteback_range()

Yes. Issuing ordered I/O as soon as possible is beneficial as it reduces
the latency of sync file range. Suppose when we are syncing data beyond
the ordered range, the background writeback process has already started
committing and bundled the ordered range into a large ioend (up to
IOEND_BATCH_SIZE folios), then this sync operation will indeed
experience significant latency. However, for other non-sync scenarios,
there should be little benefit.

But I'm not sure if this is strictly necessary, because in the existing
implementation, issuing ordered I/O via data=ordered mode works the same
way — it also doesn't issue ordered I/O as soon as possible, and still
has to wait when encountering concurrent background writeback. So I
think we can keep the current implementation for now and see user
feedback to decide whether further optimization is needed.

Cheers,
Yi.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
  2026-06-01 18:15           ` Ojaswin Mujoo
@ 2026-06-02  3:36             ` Zhang Yi
  0 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-06-02  3:36 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
	yi.zhang, yangerkun, yukuai

On 6/2/2026 2:15 AM, Ojaswin Mujoo wrote:
> On Mon, Jun 01, 2026 at 08:22:09PM +0800, Zhang Yi wrote:
>> On 6/1/2026 5:08 PM, Ojaswin Mujoo wrote:
>>> On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
>>>> On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
>>>>> On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
>>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>>
>>>>>> In the generic buffered_head I/O path, we rely on the data=order mode to
>>>>>> ensure that the zeroed EOF block data is written before updating
>>>>>> i_disksize, thus preventing stale data from being exposed.
>>>>>>
>>>>>> However, the iomap buffered I/O path cannot use this mechanism. Instead,
>>>>>> we issue the I/O immediately after performing the zero operation
>>>>>> (without synchronous waiting for performance). This can reduce the risk
>>>>>> of exposing stale data, but it does not guarantee that the zero data
>>>>>> will be flushed to disk before the metadata of i_disksize is updated.
>>>>>> The subsequent patches will wait for this I/O to complete before
>>>>>> updating i_disksize.
>>>>>>
>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> I think we discussed that we may not need to do this [1] but I guess
>>>>> you've decided to make the tradeoff of issuing the IO to avoid having to
>>>>> wait for bg flush to complete the tail page zeroing
>>>>>
>>
>> I'm glad to hear that, thanks.
>>
>>>>
>>>> Yes. For truncate up and append fallocate, originally i_disksize would
>>>> be updated immediately, and the change would be persisted via the
>>>> journal within default 5 seconds. But now, if the tail page is not
>>>> committed immediately, the update to i_disksize will be delayed by about
>>>> 30 seconds, and persistence will be postponed to around 35 seconds. I'm
>>>> not sure what impact this change might have — I just don't really want
>>>> to introduce it.
>>>
>>>> For normal append writes, the impact is minimal, unless we call
>>>> sync_range to sync the portion of data that extends beyond EOF.
>>>
>>> Hmm while trying to retain the behavior for falloc and truncate up,
>>> we end up changing it for append writes :) But anyways, I understand
>>> your reasoning and don't have any strong opinions against it. I'll let
>>> Jan pitch in since he had some comments around this.
>>>
>>>>
>>>> In addition, if the zeroed page is not issued here immediately, the
>>>> logic will become more complex because we need to more careful about the
>>>> order of write-back IOs to prevent deadlock issues caused by mutual
>>>> waiting.
>>>
>>> You mean an endio completion waiting for ordered IO to complete but
>>> ordered IO writeback is somehow waiting for this endio completion? Is that
>>> actually possible?
>>
>> Well, after thinking it over more carefully, it seems this is
>> impossible, I cannot think of a scenario that could trigger this kind of
>> issue. The generic writeback process always executes writeback in folio
>> index order, so there would be no situation where a later data I/O
>> depends on an earlier ordered I/O. Even if both kinds of IOs are placed
>> in the same ioend, there should be no problem. I was confused and
>> overthinking it.
>>
>>  From this perspective, if we can accept that truncate up and fallocate
>> will have longer persistence time by default(I guess this is
>> acceptable), we can avoid writing back zeroed data immediately. To
>> achieve this, we only need to consider the case of sync file range. :-)
> 
> Yeah, I think during writeback we will have to submit the ordered data
> if we are writing back data beyond the i_disksize.
> 
> If this is straightforward enough to implement, I think this approach
> would be a safer choice cause we will avoid overheads due to small,
> random ordered IOs overworking the writeback layer.

Indeed, Let me investigate further to ensure there are no other side
effects.

Thanks,
Yi.

> 
>>
>> Regards,
>> Yi.
>>>>
>>>>> However, I think one side effect might be many threads calling the
>>>>> writeback mechanism to issue zero IOs which might not scale well. I
>>>>> don't know if it'll be a huge problem though, I guess it's a sort of
>>>>> thing we will have to deal with in case we see it in real world
>>>>> workloads.
>>>>>
>>>>
>>>> I agree with you. However, I suspect that unless we run some specific
>>>> benchmark tests, it should be difficult to encounter a large number of
>>>> post-EOF append writes and truncate up operations in real-world usage
>>>> scenarios — and I haven't come across such scenarios yet. For
>>>> simplicity, I'd like to proceed with this implementation for now. If we
>>>> do run into actual problems later, we can consider not issuing I/O
>>>> directly here, but instead: 1) find the ordered block in
>>>> ext4_sync_file() and perform writeback; 2) ensure writeback ordering
>>>> for normal background writeback as well — otherwise, there is a risk of
>>>> deadlock (mutual waiting). What do you think?
>>>
>>> Yes sounds good Yi, we can deal with performance tuning later.
>>>
>>> Regards,
>>> Ojaswin
>>>
>>>>
>>>> Cheers,
>>>> Yi.
>>>>
>>>>> [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
>>>>>
>>>>>> ---
>>>>>>    fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
>>>>>>    1 file changed, 55 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>> index 239d387ffaf2..e013aeb03d7b 100644
>>>>>> --- a/fs/ext4/inode.c
>>>>>> +++ b/fs/ext4/inode.c
>>>>>> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>>>>>>    					zero_written);
>>>>>>    }
>>>>>> +static int ext4_iomap_submit_zero_block(struct inode *inode,
>>>>>> +					loff_t from, loff_t end)
>>>>>> +{
>>>>>> +	struct address_space *mapping = inode->i_mapping;
>>>>>> +	struct folio *folio;
>>>>>> +	bool do_submit = false;
>>>>>> +
>>>>>> +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>>>>>> +	if (IS_ERR(folio))
>>>>>> +		/* Already writeback and clear? */
>>>>>> +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
>>>>>> +
>>>>>> +	folio_wait_writeback(folio);
>>>>>> +	WARN_ON_ONCE(folio_test_writeback(folio));
>>>>>> +
>>>>>> +	if (likely(folio_test_dirty(folio)))
>>>>>> +		do_submit = true;
>>>>>> +	folio_unlock(folio);
>>>>>> +	folio_put(folio);
>>>>>> +
>>>>>> +	/* Submit zeroed block. */
>>>>>> +	if (do_submit)
>>>>>> +		return filemap_fdatawrite_range(mapping, from, end - 1);
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>>    /*
>>>>>>     * Zero out a mapping from file offset 'from' up to the end of the block
>>>>>>     * which corresponds to 'from' or to the given 'end' inside this block.
>>>>>> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>>>    	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>>>>>>    		return 0;
>>>>>> -	if (length > blocksize - offset)
>>>>>> +	if (length > blocksize - offset) {
>>>>>>    		length = blocksize - offset;
>>>>>> +		end = from + length;
>>>>>> +	}
>>>>>>    	err = ext4_block_zero_range(inode, from, length,
>>>>>>    				    &did_zero, &zero_written);
>>>>>> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>>>    	 * TODO: In the iomap path, handle this by updating i_disksize to
>>>>>>    	 * i_size after the zeroed data has been written back.
>>>>>>    	 */
>>>>>> -	if (ext4_should_order_data(inode) &&
>>>>>> -	    did_zero && zero_written && !IS_DAX(inode)) {
>>>>>> -		handle_t *handle;
>>>>>> +	if (did_zero && zero_written && !IS_DAX(inode)) {
>>>>>> +		if (ext4_should_order_data(inode)) {
>>>>>> +			handle_t *handle;
>>>>>> -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>>>> -		if (IS_ERR(handle))
>>>>>> -			return PTR_ERR(handle);
>>>>>> +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>>>> +			if (IS_ERR(handle))
>>>>>> +				return PTR_ERR(handle);
>>>>>> -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
>>>>>> -		ext4_journal_stop(handle);
>>>>>> -		if (err)
>>>>>> -			return err;
>>>>>> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
>>>>>> +							length);
>>>>>> +			ext4_journal_stop(handle);
>>>>>> +			if (err)
>>>>>> +				return err;
>>>>>> +		/*
>>>>>> +		 * inodes using the iomap buffered I/O path do not use the
>>>>>> +		 * data=ordered mode. We submit zeroed range directly here.
>>>>>> +		 * Do not wait for I/O completion for performance.
>>>>>> +		 *
>>>>>> +		 * TODO: Any operation that extends i_disksize (including
>>>>>> +		 * append write end io past the zeroed boundary, truncate up,
>>>>>> +		 * and append fallocate) must wait for the relevant I/O to
>>>>>> +		 * complete before updating i_disksize.
>>>>>> +		 */
>>>>>> +		} else if (ext4_inode_buffered_iomap(inode)) {
>>>>>> +			err = ext4_iomap_submit_zero_block(inode, from, end);
>>>>>> +			if (err)
>>>>>> +				return err;
>>>>>> +		}
>>>>>>    	}
>>>>>>    	return 0;
>>>>>> -- 
>>>>>> 2.52.0
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-06-02  3:22           ` Zhang Yi
@ 2026-06-02  5:35             ` Ojaswin Mujoo
  0 siblings, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-02  5:35 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
	yangerkun, yukuai

On Tue, Jun 02, 2026 at 11:22:12AM +0800, Zhang Yi wrote:
> On 6/2/2026 2:33 AM, Ojaswin Mujoo wrote:
> > On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
> > > On 5/30/2026 3:22 PM, Zhang Yi wrote:
> > > > Hi, Ojaswin!
> > > > 
> > > > On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
> > > > > On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> > > > > > From: Zhang Yi <yi.zhang@huawei.com>
> > > > > > 
> > > > > > For append writes, wait for ordered I/O to complete before updating
> > > > > > i_disksize. This ensures that zeroed data is flushed to disk before the
> > > > > > metadata update, preventing stale data from being exposed during
> > > > > > unaligned post-EOF append writes.
> > > > > > 
> > > > > > Suggested-by: Jan Kara <jack@suse.cz>
> > > > > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > > > > ---
> > > > > >   fs/ext4/ext4.h    | 11 +++++++
> > > > > >   fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> > > > > >   fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> > > > > >   fs/ext4/super.c   | 23 ++++++++++----
> > > > > >   4 files changed, 161 insertions(+), 13 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > > > index 078feda47e36..9ce2128eea3e 100644
> > > > > > --- a/fs/ext4/ext4.h
> > > > > > +++ b/fs/ext4/ext4.h
> > > > > > @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> > > > > >   #ifdef CONFIG_FS_ENCRYPTION
> > > > > >   	struct fscrypt_inode_info *i_crypt_info;
> > > > > >   #endif
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
> > > > > > +	 * and truncate-up operations. These parameters are used only in the
> > > > > > +	 * iomap buffered I/O path.
> > > > > > +	 */
> > > > > > +	ext4_lblk_t i_ordered_lblk;
> > > > > > +	ext4_lblk_t i_ordered_len;
> > > > > > +	wait_queue_head_t i_ordered_wq;
> > > > > >   };
> > > > > >   /*
> > > > > > @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> > > > > >   			     __u64 len, __u64 *moved_len);
> > > > > >   /* page-io.c */
> > > > > > +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
> > > > > > +
> > > > > >   extern int __init ext4_init_pageio(void);
> > > > > >   extern void ext4_exit_pageio(void);
> > > > > >   extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > > index e013aeb03d7b..11fb369efeb1 100644
> > > > > > --- a/fs/ext4/inode.c
> > > > > > +++ b/fs/ext4/inode.c
> > > > > > @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> > > > > >   {
> > > > > >   	struct iomap_ioend *ioend = wpc->wb_ctx;
> > > > > >   	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> > > > > > +	ext4_lblk_t start, end, order_lblk, order_len;
> > > > > >   	/*
> > > > > >   	 * After I/O completion, a worker needs to be scheduled when:
> > > > > > @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> > > > > >   	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> > > > > >   		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> > > > > > +	/*
> > > > > > +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
> > > > > > +	 * handling and must not be merged with regular I/O operations.
> > > > > > +	 */
> > > > > > +	order_len = READ_ONCE(ei->i_ordered_len);
> > > > > > +	if (order_len) {
> > > > > > +		/*
> > > > > > +		 * Pair with smp_store_release() in ext4_block_zero_eof().
> > > > > > +		 * Ensure we see the updated i_ordered_lblk that was written
> > > > > > +		 * before the release store to i_ordered_len.
> > > > > > +		 */
> > > > > > +		smp_rmb();
> > > > > > +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
> > > > > > +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> > > > > > +		end = EXT4_B_TO_LBLK(ioend->io_inode,
> > > > > > +				     ioend->io_offset + ioend->io_size);
> > > > > > +
> > > > > > +		if (start <= order_lblk && end >= order_lblk + order_len) {
> > > > > 
> > > > > Hi Zhang,
> > > > > 
> > > > > I guess this check is enough cause ordered_lblk and ordered_len will
> > > > > always be  contained in a single block.
> > > > 
> > > > Yeah.
> > > > 
> > > > > 
> > > > > > +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> > > > > > +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> > > > > > +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> > > > > 
> > > > > FWIU, we are wanting the ordered IO to not be merged and submitted asap
> > > > > since we want to wake up the waiters. Is there any other reason?
> > > > 
> > > > My original intention was to prevent the loss of the
> > > > EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> > > > completion, which could be caused by merging an ordered ioend with a
> > > > normal ioend.  In patch 19, we need to determine the flag to update
> > > > i_disksize to the correct position.
> > 
> > Ahh okay, we don't want the flag to be lost.
> > 
> > > > 
> > > > > 
> > > > > Adding the boundary in ->writeback_submit() only affects
> > > > > iomap_ioend_can_merge() which happens after we have woken up the waiters
> > > > > and deferred the IO to the wq. We ideally want it affect
> > > > > iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> > > > > ->writeback_range().
> > > > 
> > > > IIUC, merging into the same ioend during the submission stage doesn't
> > > > seem to cause any problems.
> > 
> > Got it since the flag is set later. I was thinking we want to quickly
> > issue the ordered IO to wake up the waiters and not waste time trying to
> > merge it and hence we wanted to use that flag.
> > 
> > > > 
> > > > > 
> > > > > Secondly, I don't think boundary is the right flag here. It ensures
> > > > > that everything before the ordered iomap gets submitted and the ordered
> > > > > iomap starts a new ioend. This can still keep getting merged with the
> > > > > newer ioends untils we decide to submit the IO, which can delay waking
> > > > > up the waiters. If we really want the "no merge" behavior, we'll have to
> > > > > do something like [1] (Check the 2 NOMERGE flag patches).
> > > > 
> > > > Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> > > > still cannot prevent merging. I missed this, thank you for pointing this
> > > > out. However, I think perhaps we should change iomap_ioend_can_merge()
> > > > to check the iomap_ioend->io_private. Something like:
> > > > 
> > > > 	if (ioend->io_private || next->io_private)
> > >              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >              ioend->io_private != next->io_private
> > 
> > I guess if the purpose is to just not lose the flag, then boundary seems
> > to work for because we only lose the flag if ordered ioend backward
> > merges to a prev one. Flag is retained if we forward merge. Which
> > boundary seems to take care of.
> > 
> Yes, IOMAP_IOEND_BOUNDARY is indeed worked currently as it prevents flag
> loss. However, from the perspective of the iomap infrastructure, I
> believe it is still necessary to add the ioend->io_private !=
> next->io_private check. Because ioends with different io_private values
> should not be merged, as this carries the risk of potentially losing
> io_private or even memory leaks. With this check in iomap, we would no
> longer need IOMAP_IOEND_BOUNDARY.

I agree that even outside this patchset it seems like a sane thing to
do.
> 
> > However, if we want to avoid merges so we can quickly issue IO and wake
> > up the waiters then the above change looks good. Also, if this is the
> > reason we'd also want to have this during submission stage so the flag
> > setting will probs have to move to ->wirteback_range()
> 
> Yes. Issuing ordered I/O as soon as possible is beneficial as it reduces
> the latency of sync file range. Suppose when we are syncing data beyond
> the ordered range, the background writeback process has already started
> committing and bundled the ordered range into a large ioend (up to
> IOEND_BATCH_SIZE folios), then this sync operation will indeed
> experience significant latency. However, for other non-sync scenarios,
> there should be little benefit.
 
 Yes that's true.

> 
> But I'm not sure if this is strictly necessary, because in the existing
> implementation, issuing ordered I/O via data=ordered mode works the same
> way — it also doesn't issue ordered I/O as soon as possible, and still
> has to wait when encountering concurrent background writeback. So I
> think we can keep the current implementation for now and see user
> feedback to decide whether further optimization is needed.

I agree!

Thanks,
ojaswin
> 
> Cheers,
> Yi.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
  2026-05-30  9:32       ` Zhang Yi
@ 2026-06-02  5:56         ` Ojaswin Mujoo
  0 siblings, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-02  5:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Sat, May 30, 2026 at 05:32:54PM +0800, Zhang Yi wrote:
> On 5/28/2026 9:34 PM, Ojaswin Mujoo wrote:
> > On Wed, May 27, 2026 at 09:28:28PM +0530, Ojaswin Mujoo wrote:
> >> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> >>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>
> >>> For append writes, wait for ordered I/O to complete before updating
> >>> i_disksize. This ensures that zeroed data is flushed to disk before the
> >>> metadata update, preventing stale data from being exposed during
> >>> unaligned post-EOF append writes.
> >>>
> >>> Suggested-by: Jan Kara <jack@suse.cz>
> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>> ---
> >>>  fs/ext4/ext4.h    | 11 +++++++
> >>>  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> >>>  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> >>>  fs/ext4/super.c   | 23 ++++++++++----
> >>>  4 files changed, 161 insertions(+), 13 deletions(-)
> >>>
> [...]
> >>> @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> >>>  					loff_t from, loff_t end)
> >>>  {
> >>>  	struct address_space *mapping = inode->i_mapping;
> >>> +	struct ext4_inode_info *ei = EXT4_I(inode);
> >>>  	struct folio *folio;
> >>>  	bool do_submit = false;
> >>> +	int ret;
> >>>  
> >>>  	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> >>>  	if (IS_ERR(folio))
> >>> @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> >>>  	folio_wait_writeback(folio);
> >>>  	WARN_ON_ONCE(folio_test_writeback(folio));
> >>>  
> >>> -	if (likely(folio_test_dirty(folio)))
> >>> +	/*
> >>> +	 * Mark the ordered range. It will be cleared upon I/O completion
> >>> +	 * in ext4_iomap_end_bio(). Any operation that extends i_disksize
> >>> +	 * (including append write end io past the zeroed boundary,
> >>> +	 * truncate up and append fallocate) must wait for this I/O to
> >>> +	 * complete before updating i_disksize.
> >>> +	 *
> >>> +	 * When multiple overlapping unaligned EOF writes are in flight, we
> >>> +	 * only need to track and wait for the first one. Subsequent writes
> >>> +	 * will zero the gap in memory and ensure that the zeroed data is
> >>> +	 * written out along with the valid data in the same block before
> >>> +	 * i_disksize is updated.
> >>> +	 */
> >>> +	if (likely(folio_test_dirty(folio) &&
> >>> +		   READ_ONCE(ei->i_ordered_len) == 0)) {
> >>> +		WRITE_ONCE(ei->i_ordered_lblk,
> >>> +			   from >> inode->i_blkbits);
> >>> +		/*
> >>> +		 * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
> >>> +		 * and ext4_iomap_wb_ordered_wait(). Ensure the updated
> >>> +		 * i_ordered_lblk is visible when i_ordered_len becomes
> >>> +		 * non-zero.
> >>> +		 */
> >>> +		smp_store_release(&ei->i_ordered_len, 1);
> >>>  		do_submit = true;
> >>> +	}
> >>>  	folio_unlock(folio);
> >>>  	folio_put(folio);
> >>>  
> >>>  	/* Submit zeroed block. */
> >>> -	if (do_submit)
> >>> -		return filemap_fdatawrite_range(mapping, from, end - 1);
> >>> +	if (do_submit) {
> >>> +		ret = filemap_fdatawrite_range(mapping, from, end - 1);
> >>> +		if (ret) {
> >>> +			/*
> >>> +			 * Pairs with wait_event() in
> >>> +			 * ext4_iomap_wb_ordered_wait(). Ensure
> >>> +			 * i_ordered_len = 0 is visible before waking up
> >>> +			 * waiters.
> >>> +			 */
> >>> +			smp_store_release(&ei->i_ordered_len, 0);
> >>> +			wake_up_all(&ei->i_ordered_wq);
> >>> +			return ret;
> > 
> > Okay so even if the ordered IO fails we still let the i_disksize updates
> > go ahead? 
> 
> Yes when data_err=ignore, no when data_err=abort.
> 
> > I think this is a deviation from the current behavior where we
> > abort the journal. If this is acceptable we should atleast add a comment
> > on why its okay.
> > 
> 
> I think this behavior is consistent with the current data=ordered mode.
> In the data_err=ignore mode, if an I/O write fails, ext4_end_io_end()
> does not abort the journal, so i_disksize is still updated normally.
> Conversely, in the data_err=abort mode, the journal is aborted, and
> since i_disksize is not updated, it cannot be updated afterwards. Am I
> missing something?

So I was thinking about various scenarios where
filemap_fdatawrite_range() might return an ERROR and yes it seems like
we do end up aborting the journal for almost all paths and ENOMEM is
already taken care of. So I think it should be okay.
> 
> >>> +		}
> >>> +	}
> >>>  	return 0;
> >>>  }
> >>>  
> >>> @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> >>>  		 * data=ordered mode. We submit zeroed range directly here.
> >>>  		 * Do not wait for I/O completion for performance.
> >>>  		 *
> >>> -		 * TODO: Any operation that extends i_disksize (including
> >>> -		 * append write end io past the zeroed boundary, truncate up,
> >>> -		 * and append fallocate) must wait for the relevant I/O to
> >>> -		 * complete before updating i_disksize.
> >>> +		 * The end_io handler ext4_iomap_wb_ordered_wait() will wait
> >>> +		 * for I/O completion before updating i_disksize if the write
> >>> +		 * extends beyond the zeroed boundary.
> >>> +		 *
> >>> +		 * TODO: Any other operation that extends i_disksize
> >>> +		 * (including truncate up and append fallocate) must wait for
> >>> +		 * the relevant I/O to complete before updating i_disksize.
> >>>  		 */
> >>>  		} else if (ext4_inode_buffered_iomap(inode)) {
> >>>  			err = ext4_iomap_submit_zero_block(inode, from, end);
> >>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> >>> index 3050c887329f..ad05ebb49bf6 100644
> >>> --- a/fs/ext4/page-io.c
> >>> +++ b/fs/ext4/page-io.c
> >>> @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
> >>>  	return 0;
> >>>  }
> >>>  
> >>> +/*
> >>> + * If the old disk size is not block size aligned and the current
> >>> + * writeback range is entirely beyond the old EOF block, we should
> >>> + * wait for the zeroed data written in ext4_block_zero_eof() to be
> >>> + * written out, otherwise, it may expose stale data in that block.
> >>> + */
> >>> +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
> >>> +				       loff_t pos, loff_t end)
> >>> +{
> >>> +	struct ext4_inode_info *ei = EXT4_I(inode);
> >>> +	unsigned int blocksize = i_blocksize(inode);
> >>> +	loff_t disksize = READ_ONCE(ei->i_disksize);
> >>> +	ext4_lblk_t order_lblk, order_len;
> >>> +
> >>> +	/*
> >>> +	 * Waiting for ordered I/O is unnecessary when:
> >>> +	 * - The on-disk size is block-aligned (no stale data exists).
> >>> +	 * - The write start is within the block of the old EOF
> >>> +	 *   (overwriting, or appending to a block that already contains
> >>> +	 *   valid data).
> >>> +	 */
> >>> +	if (!(disksize & (blocksize - 1)) ||
> >>> +	    pos < round_up(disksize, blocksize))
> >>> +		return;
> > 
> > Okay these checks are pretty confusing. I was intially thinking that
> > i_disksize's block would always be equal to i_ordered_lblk but seems
> > like that is not true because ext4_block_zero_eof() uses from=i_size.
> 
> Yeah, this is the key point that I was a bit confused about as
> well.
> 
> > 
> > So we could have a sequence where
> > 
> > 1. truncate 4k (i_disksize = i_size = 4k)
> > 2. write 8k,10k (i_disksize = 4k i_size = 10k, i_ordered_len = 0 (old isisze  is block aligned)) 
> > 3. write 16k,18k (i_disksize = 4k i_size = 10k, i_ordered_len = 1, lblk=4)
>                                              18k                     lblk=2, (10k >> 12)
                                               ^^^ Yess correct, my bad.
> > 
> > Here we issue ordered IO even though it' probably not needed.  Now if
> > write 3 finishes first we see disksize as 4k so we don't wait for
> > ordered write. Which seems okay since we don't risk any stale data
> > exposure. However, this flow is pretty confuing.
> 
> Indeed!
> 
> > 
> > Can't we somehow avoid having to issue/set ordered len/lblk in case it
> > is not really needed, like only issue it if i_disksize (and not i_size) 
> > is unaligned. That can simplify some of our check and avoid extra IO
> > overhead.
> > 
> 
> I was also planning to explore optimizations on this point next.
> However, since the original logic in buffer_head already works this way,
> keeping the same logic in the iomap path will not introduce any
> additional side effects. To avoid unnecessary waiting, I simply added
> the disksize alignment check in ext4_iomap_wb_ordered_wait().
> 
> Therefore, I do not plan to implement this optimization in this series.
> I can open a separate series later to address this optimization — perhaps
> by checking i_disksize in ext4_block_zero_eof() before issuing or adding
> ordered I/O, and the buffer_head path might also benefit from optimization.
> Meanwhile, to avoid confusion, I can add a TODO comment in this patch.
> 
> What do you think?

Sure Zhang, such an optimization would make the code simpler but I'm
okay to do this in a different series.

Regards,
ojaswin

> 
> Cheers,
> Yi.
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-05-28 15:40     ` Darrick J. Wong
@ 2026-06-02  7:02       ` Ojaswin Mujoo
  0 siblings, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-02  7:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Thu, May 28, 2026 at 08:40:59AM -0700, Darrick J. Wong wrote:
> On Tue, May 26, 2026 at 10:40:30PM +0530, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@huawei.com>
> > > 
> > > Introduce two new iomap_ops instances for ext4 buffered writes:
> > > 
> > >  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> > >    ext4_da_map_blocks() to map delalloc extents.
> > >  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> > >    ext4_iomap_get_blocks() to directly allocate blocks.
> > > 
> > > Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> > > validity.
> > > 
> > > Key changes and considerations:
> > > 
> > >  - Unwritten extents for new blocks (dioread_nolock always on)
> > >    Since data=ordered mode is not used to prevent stale data exposure in
> > >    the non-delayed allocation path, new blocks are always allocated as
> > >    unwritten extents.
> > 
> > Okay makes sense.
> > 
> > > 
> > >  - Short write and write failure handling
> > >    a. Delalloc path: On short write or failure, the stale delalloc range
> > >       must be dropped and its space reservation released. Otherwise, a
> > >       clean folio may cover leftover delalloc extents, causing
> > >       inaccurate space reservation accounting.
> > 
> > Hmm, okay so in the usual buffer head path, seems like during a short
> > write we still zero the new buffers we couldn't write and keep it dirty
> > (folio_zero_new_buffers()). This way they are still written back and
> > the delalloc reservations are used up.
> > 
> > However in iomap we don't mark the range that we couldnt write as dirty
> > so we need to make sure we clear up the stale delalloc mappings. Is this
> > correct?
> 
> Yes, that's true of iomap's pagecache handling.

Thanks for confirming Darrick.

Regards,
Ojaswin

> 
> --D
> 
> > Regards,
> > Ojaswin
> > 
> > >    b. Non-delalloc path: No cleanup of allocated blocks is needed on
> > >       short write.
> > > 
> > >  - Lock ordering reversal
> > >    The folio lock and transaction start ordering is reversed compared to
> > >    the buffer_head buffered write path. To handle this, the journal
> > >    handle must be stopped in iomap_begin() callbacks. The lock ordering
> > >    documentation in super.c has been updated accordingly.
> > > 
> > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > ---
> > >  fs/ext4/ext4.h  |   4 ++
> > >  fs/ext4/file.c  |  20 +++++-
> > >  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> > >  fs/ext4/super.c |  10 ++-
> > >  4 files changed, 192 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 1e27d73d7427..4832e7f7db82 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> > >  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> > >  				struct buffer_head *bh);
> > >  void ext4_set_inode_mapping_order(struct inode *inode);
> > > +int ext4_nonda_switch(struct super_block *sb);
> > >  #define FALL_BACK_TO_NONDELALLOC 1
> > >  #define CONVERT_INLINE_DATA	 2
> > >  
> > > @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> > >  
> > >  extern const struct iomap_ops ext4_iomap_ops;
> > >  extern const struct iomap_ops ext4_iomap_report_ops;
> > > +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> > > +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> > > +extern const struct iomap_write_ops ext4_iomap_write_ops;
> > >  
> > >  static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> > >  {
> > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > > index eb1a323962b1..7f9bfbbc4a4e 100644
> > > --- a/fs/ext4/file.c
> > > +++ b/fs/ext4/file.c
> > > @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > >  	return count;
> > >  }
> > >  
> > > +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> > > +					 struct iov_iter *from)
> > > +{
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +	const struct iomap_ops *iomap_ops;
> > > +
> > > +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> > > +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> > > +	else
> > > +		iomap_ops = &ext4_iomap_buffered_write_ops;
> > > +
> > > +	return iomap_file_buffered_write(iocb, from, iomap_ops,
> > > +					 &ext4_iomap_write_ops, NULL);
> > > +}
> > > +
> > >  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > >  					struct iov_iter *from)
> > >  {
> > > @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > >  	if (ret <= 0)
> > >  		goto out;
> > >  
> > > -	ret = generic_perform_write(iocb, from);
> > > +	if (ext4_inode_buffered_iomap(inode))
> > > +		ret = ext4_iomap_buffered_write(iocb, from);
> > > +	else
> > > +		ret = generic_perform_write(iocb, from);
> > >  
> > >  out:
> > >  	inode_unlock(inode);
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 39577a6b65b9..1ae7d3f4a1c8 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
> > >  	return ret;
> > >  }
> > >  
> > > -static int ext4_nonda_switch(struct super_block *sb)
> > > +int ext4_nonda_switch(struct super_block *sb)
> > >  {
> > >  	s64 free_clusters, dirty_clusters;
> > >  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > > @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> > >  	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
> > >  }
> > >  
> > > +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> > > +{
> > > +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> > > +}
> > > +
> > > +const struct iomap_write_ops ext4_iomap_write_ops = {
> > > +	.iomap_valid = ext4_iomap_valid,
> > > +};
> > > +
> > >  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> > >  			   struct ext4_map_blocks *map, loff_t offset,
> > >  			   loff_t length, unsigned int flags)
> > > @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> > >  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> > >  		iomap->flags |= IOMAP_F_MERGED;
> > >  
> > > +	iomap->validity_cookie = map->m_seq;
> > > +
> > >  	/*
> > >  	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
> > >  	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> > > @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
> > >  	.iomap_begin = ext4_iomap_begin_report,
> > >  };
> > >  
> > > +/* Map blocks */
> > > +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> > > +
> > >  static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> > > -		loff_t length, struct ext4_map_blocks *map)
> > > +		loff_t length, ext4_get_blocks_t get_blocks,
> > > +		struct ext4_map_blocks *map)
> > >  {
> > >  	u8 blkbits = inode->i_blkbits;
> > >  
> > > @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> > >  	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> > >  			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> > >  
> > > +	if (get_blocks)
> > > +		return get_blocks(inode, map);
> > > +
> > >  	return ext4_map_blocks(NULL, inode, map, 0);
> > >  }
> > >  
> > > @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> > >  	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> > >  		return -ERANGE;
> > >  
> > > -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> > > +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> > >  	if (ret < 0)
> > >  		return ret;
> > >  
> > > @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> > >  	return 0;
> > >  }
> > >  
> > > +static int ext4_iomap_get_blocks(struct inode *inode,
> > > +				 struct ext4_map_blocks *map)
> > > +{
> > > +	loff_t i_size = i_size_read(inode);
> > > +	handle_t *handle;
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * Check if the blocks have already been allocated, this could
> > > +	 * avoid initiating a new journal transaction and return the
> > > +	 * mapping information directly.
> > > +	 */
> > > +	if ((map->m_lblk + map->m_len) <=
> > > +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> > > +		ret = ext4_map_blocks(NULL, inode, map, 0);
> > > +		if (ret < 0)
> > > +			return ret;
> > > +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> > > +				    EXT4_MAP_DELAYED))
> > > +			return 0;
> > > +	}
> > > +
> > > +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> > > +			ext4_chunk_trans_blocks(inode, map->m_len));
> > > +	if (IS_ERR(handle))
> > > +		return PTR_ERR(handle);
> > > +
> > > +	ret = ext4_map_blocks(handle, inode, map,
> > > +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> > > +	/*
> > > +	 * Stop handle here following the lock ordering of the folio lock
> > > +	 * and the transaction start.
> > > +	 */
> > > +	ext4_journal_stop(handle);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> > > +		loff_t offset, loff_t length, unsigned int flags,
> > > +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> > > +{
> > > +	int ret, retries = 0;
> > > +	struct ext4_map_blocks map;
> > > +	ext4_get_blocks_t *get_blocks;
> > > +
> > > +	ret = ext4_emergency_state(inode->i_sb);
> > > +	if (unlikely(ret))
> > > +		return ret;
> > > +
> > > +	/* Inline data and non-extent are not supported. */
> > > +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> > > +		return -ERANGE;
> > > +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> > > +		return -EINVAL;
> > > +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> > > +		return -EINVAL;
> > > +
> > > +	if (delalloc)
> > > +		get_blocks = ext4_da_map_blocks;
> > > +	else
> > > +		get_blocks = ext4_iomap_get_blocks;
> > > +retry:
> > > +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> > > +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > > +		goto retry;
> > > +	if (ret < 0)
> > > +		return ret;
> > > +
> > > +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> > > +	return 0;
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> > > +		loff_t offset, loff_t length, unsigned int flags,
> > > +		struct iomap *iomap, struct iomap *srcmap)
> > > +{
> > > +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> > > +						  iomap, srcmap, false);
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> > > +		loff_t offset, loff_t length, unsigned int flags,
> > > +		struct iomap *iomap, struct iomap *srcmap)
> > > +{
> > > +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> > > +						  iomap, srcmap, true);
> > > +}
> > > +
> > > +/*
> > > + * On write failure, drop the stale delayed allocation range and release
> > > + * its reserved space for both start and end blocks. Otherwise, we may
> > > + * leave a range of delayed extents covered by a clean folio, which can
> > > + * result in inaccurate space reservation accounting.
> > > + */
> > > +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> > > +				     loff_t length, struct iomap *iomap)
> > > +{
> > > +	down_write(&EXT4_I(inode)->i_data_sem);
> > > +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> > > +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> > > +	up_write(&EXT4_I(inode)->i_data_sem);
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> > > +					    loff_t length, ssize_t written,
> > > +					    unsigned int flags,
> > > +					    struct iomap *iomap)
> > > +{
> > > +	loff_t start_byte, end_byte;
> > > +
> > > +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> > > +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> > > +		return 0;
> > > +
> > > +	/* Nothing to do if we've written the entire delalloc extent */
> > > +	start_byte = iomap_last_written_block(inode, offset, written);
> > > +	end_byte = round_up(offset + length, i_blocksize(inode));
> > > +	if (start_byte >= end_byte)
> > > +		return 0;
> > > +
> > > +	filemap_invalidate_lock(inode->i_mapping);
> > > +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> > > +				     iomap, ext4_iomap_punch_delalloc);
> > > +	filemap_invalidate_unlock(inode->i_mapping);
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * Since we always allocate unwritten extents, there is no need for
> > > + * iomap_end to clean up allocated blocks on a short write.
> > > + */
> > > +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> > > +	.iomap_begin = ext4_iomap_buffered_write_begin,
> > > +};
> > > +
> > > +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> > > +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
> > > +	.iomap_end = ext4_iomap_buffered_da_write_end,
> > > +};
> > > +
> > >  const struct iomap_ops ext4_iomap_buffered_read_ops = {
> > >  	.iomap_begin = ext4_iomap_buffered_read_begin,
> > >  };
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index 6a77db4d3124..9bc294b769db 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
> > >   *   -> page lock -> i_data_sem (rw)
> > >   *
> > >   * buffered write path:
> > > - * sb_start_write -> i_mutex -> mmap_lock
> > > - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> > > - *   i_data_sem (rw)
> > > + * sb_start_write -> i_rwsem (w) -> mmap_lock
> > > + * - buffer_head path:
> > > + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> > > + *     i_data_sem (rw)
> > > + * - iomap path:
> > > + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> > > + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
> > >   *
> > >   * truncate:
> > >   * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> > > -- 
> > > 2.52.0
> > > 
> > 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-05-29  9:13     ` Zhang Yi
@ 2026-06-02 10:05       ` Ojaswin Mujoo
  2026-06-03  1:44         ` Zhang Yi
  0 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-02 10:05 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Fri, May 29, 2026 at 05:13:55PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
> 
> On 5/27/2026 1:10 AM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce two new iomap_ops instances for ext4 buffered writes:
> >>
> >>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> >>    ext4_da_map_blocks() to map delalloc extents.
> >>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> >>    ext4_iomap_get_blocks() to directly allocate blocks.
> >>
> >> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> >> validity.
> >>
> >> Key changes and considerations:
> >>
> >>  - Unwritten extents for new blocks (dioread_nolock always on)
> >>    Since data=ordered mode is not used to prevent stale data exposure in
> >>    the non-delayed allocation path, new blocks are always allocated as
> >>    unwritten extents.
> > 
> > Okay makes sense.
> > 
> >>
> >>  - Short write and write failure handling
> >>    a. Delalloc path: On short write or failure, the stale delalloc range
> >>       must be dropped and its space reservation released. Otherwise, a
> >>       clean folio may cover leftover delalloc extents, causing
> >>       inaccurate space reservation accounting.
> > 
> > Hmm, okay so in the usual buffer head path, seems like during a short
> > write we still zero the new buffers we couldn't write and keep it dirty
> > (folio_zero_new_buffers()). This way they are still written back and
> > the delalloc reservations are used up.
> > 
> 
> In fact, in the normal buffer head path, writeback does not consume
> delalloc reservations. Instead, the reservations are retained until the
> inode is released or the area is written again using delalloc. This is
> because i_size is not updated during short writes. Therefore, when a
> zeroed dirty folio is written back, no block mapping is created for it.
> For details, please see the lblk >= blocks judgment in
> mpage_process_page_bufs().

Oh okay I see, I'm not very clear on the code path but what about a case
where i_size is beyond the short write range.

> 
> This will not lead to duplicate space statistics, because
> ext4_da_map_blocks() only reserves space when inserting a new delalloc
> extent. Therefore, this does not pose a serious issue. However, It may
> cause some temporary and minor space leaks. Nevertheless, I think it
> would be better if delalloc extents could be released for the buffer
> head path when short writes occur.

Yes true, ideally it would be more intuitive if we cancelled the
reservations in short write.

Regards,
ojaswin

> 
> > However in iomap we don't mark the range that we couldnt write as dirty
> > so we need to make sure we clear up the stale delalloc mappings. Is this
> > correct?
> > 
> Yeah.
> 
> Thanks,
> Yi.
> 
> > Regards,
> > Ojaswin
> > 
> >>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
> >>       short write.
> >>
> >>  - Lock ordering reversal
> >>    The folio lock and transaction start ordering is reversed compared to
> >>    the buffer_head buffered write path. To handle this, the journal
> >>    handle must be stopped in iomap_begin() callbacks. The lock ordering
> >>    documentation in super.c has been updated accordingly.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/ext4.h  |   4 ++
> >>  fs/ext4/file.c  |  20 +++++-
> >>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> >>  fs/ext4/super.c |  10 ++-
> >>  4 files changed, 192 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 1e27d73d7427..4832e7f7db82 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> >>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> >>  				struct buffer_head *bh);
> >>  void ext4_set_inode_mapping_order(struct inode *inode);
> >> +int ext4_nonda_switch(struct super_block *sb);
> >>  #define FALL_BACK_TO_NONDELALLOC 1
> >>  #define CONVERT_INLINE_DATA	 2
> >>  
> >> @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> >>  
> >>  extern const struct iomap_ops ext4_iomap_ops;
> >>  extern const struct iomap_ops ext4_iomap_report_ops;
> >> +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> >> +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> >> +extern const struct iomap_write_ops ext4_iomap_write_ops;
> >>  
> >>  static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> >>  {
> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> >> index eb1a323962b1..7f9bfbbc4a4e 100644
> >> --- a/fs/ext4/file.c
> >> +++ b/fs/ext4/file.c
> >> @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >>  	return count;
> >>  }
> >>  
> >> +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> >> +					 struct iov_iter *from)
> >> +{
> >> +	struct inode *inode = file_inode(iocb->ki_filp);
> >> +	const struct iomap_ops *iomap_ops;
> >> +
> >> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> >> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> >> +	else
> >> +		iomap_ops = &ext4_iomap_buffered_write_ops;
> >> +
> >> +	return iomap_file_buffered_write(iocb, from, iomap_ops,
> >> +					 &ext4_iomap_write_ops, NULL);
> >> +}
> >> +
> >>  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >>  					struct iov_iter *from)
> >>  {
> >> @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >>  	if (ret <= 0)
> >>  		goto out;
> >>  
> >> -	ret = generic_perform_write(iocb, from);
> >> +	if (ext4_inode_buffered_iomap(inode))
> >> +		ret = ext4_iomap_buffered_write(iocb, from);
> >> +	else
> >> +		ret = generic_perform_write(iocb, from);
> >>  
> >>  out:
> >>  	inode_unlock(inode);
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 39577a6b65b9..1ae7d3f4a1c8 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
> >>  	return ret;
> >>  }
> >>  
> >> -static int ext4_nonda_switch(struct super_block *sb)
> >> +int ext4_nonda_switch(struct super_block *sb)
> >>  {
> >>  	s64 free_clusters, dirty_clusters;
> >>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> >> @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> >>  	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
> >>  }
> >>  
> >> +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> >> +{
> >> +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> >> +}
> >> +
> >> +const struct iomap_write_ops ext4_iomap_write_ops = {
> >> +	.iomap_valid = ext4_iomap_valid,
> >> +};
> >> +
> >>  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >>  			   struct ext4_map_blocks *map, loff_t offset,
> >>  			   loff_t length, unsigned int flags)
> >> @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >>  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >>  		iomap->flags |= IOMAP_F_MERGED;
> >>  
> >> +	iomap->validity_cookie = map->m_seq;
> >> +
> >>  	/*
> >>  	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
> >>  	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> >> @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
> >>  	.iomap_begin = ext4_iomap_begin_report,
> >>  };
> >>  
> >> +/* Map blocks */
> >> +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> >> +
> >>  static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> >> -		loff_t length, struct ext4_map_blocks *map)
> >> +		loff_t length, ext4_get_blocks_t get_blocks,
> >> +		struct ext4_map_blocks *map)
> >>  {
> >>  	u8 blkbits = inode->i_blkbits;
> >>  
> >> @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> >>  	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >>  			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> >>  
> >> +	if (get_blocks)
> >> +		return get_blocks(inode, map);
> >> +
> >>  	return ext4_map_blocks(NULL, inode, map, 0);
> >>  }
> >>  
> >> @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >>  	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> >>  		return -ERANGE;
> >>  
> >> -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> >> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> >>  	if (ret < 0)
> >>  		return ret;
> >>  
> >> @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >>  	return 0;
> >>  }
> >>  
> >> +static int ext4_iomap_get_blocks(struct inode *inode,
> >> +				 struct ext4_map_blocks *map)
> >> +{
> >> +	loff_t i_size = i_size_read(inode);
> >> +	handle_t *handle;
> >> +	int ret;
> >> +
> >> +	/*
> >> +	 * Check if the blocks have already been allocated, this could
> >> +	 * avoid initiating a new journal transaction and return the
> >> +	 * mapping information directly.
> >> +	 */
> >> +	if ((map->m_lblk + map->m_len) <=
> >> +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> >> +		ret = ext4_map_blocks(NULL, inode, map, 0);
> >> +		if (ret < 0)
> >> +			return ret;
> >> +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> >> +				    EXT4_MAP_DELAYED))
> >> +			return 0;
> >> +	}
> >> +
> >> +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> >> +			ext4_chunk_trans_blocks(inode, map->m_len));
> >> +	if (IS_ERR(handle))
> >> +		return PTR_ERR(handle);
> >> +
> >> +	ret = ext4_map_blocks(handle, inode, map,
> >> +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> >> +	/*
> >> +	 * Stop handle here following the lock ordering of the folio lock
> >> +	 * and the transaction start.
> >> +	 */
> >> +	ext4_journal_stop(handle);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> >> +{
> >> +	int ret, retries = 0;
> >> +	struct ext4_map_blocks map;
> >> +	ext4_get_blocks_t *get_blocks;
> >> +
> >> +	ret = ext4_emergency_state(inode->i_sb);
> >> +	if (unlikely(ret))
> >> +		return ret;
> >> +
> >> +	/* Inline data and non-extent are not supported. */
> >> +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> >> +		return -ERANGE;
> >> +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> >> +		return -EINVAL;
> >> +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> >> +		return -EINVAL;
> >> +
> >> +	if (delalloc)
> >> +		get_blocks = ext4_da_map_blocks;
> >> +	else
> >> +		get_blocks = ext4_iomap_get_blocks;
> >> +retry:
> >> +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> >> +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> >> +		goto retry;
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >> +	return 0;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> +						  iomap, srcmap, false);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> +						  iomap, srcmap, true);
> >> +}
> >> +
> >> +/*
> >> + * On write failure, drop the stale delayed allocation range and release
> >> + * its reserved space for both start and end blocks. Otherwise, we may
> >> + * leave a range of delayed extents covered by a clean folio, which can
> >> + * result in inaccurate space reservation accounting.
> >> + */
> >> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> >> +				     loff_t length, struct iomap *iomap)
> >> +{
> >> +	down_write(&EXT4_I(inode)->i_data_sem);
> >> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> >> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> >> +	up_write(&EXT4_I(inode)->i_data_sem);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >> +					    loff_t length, ssize_t written,
> >> +					    unsigned int flags,
> >> +					    struct iomap *iomap)
> >> +{
> >> +	loff_t start_byte, end_byte;
> >> +
> >> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> >> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> >> +		return 0;
> >> +
> >> +	/* Nothing to do if we've written the entire delalloc extent */
> >> +	start_byte = iomap_last_written_block(inode, offset, written);
> >> +	end_byte = round_up(offset + length, i_blocksize(inode));
> >> +	if (start_byte >= end_byte)
> >> +		return 0;
> >> +
> >> +	filemap_invalidate_lock(inode->i_mapping);
> >> +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> >> +				     iomap, ext4_iomap_punch_delalloc);
> >> +	filemap_invalidate_unlock(inode->i_mapping);
> >> +	return 0;
> >> +}
> >> +
> >> +/*
> >> + * Since we always allocate unwritten extents, there is no need for
> >> + * iomap_end to clean up allocated blocks on a short write.
> >> + */
> >> +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> >> +	.iomap_begin = ext4_iomap_buffered_write_begin,
> >> +};
> >> +
> >> +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> >> +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
> >> +	.iomap_end = ext4_iomap_buffered_da_write_end,
> >> +};
> >> +
> >>  const struct iomap_ops ext4_iomap_buffered_read_ops = {
> >>  	.iomap_begin = ext4_iomap_buffered_read_begin,
> >>  };
> >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >> index 6a77db4d3124..9bc294b769db 100644
> >> --- a/fs/ext4/super.c
> >> +++ b/fs/ext4/super.c
> >> @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
> >>   *   -> page lock -> i_data_sem (rw)
> >>   *
> >>   * buffered write path:
> >> - * sb_start_write -> i_mutex -> mmap_lock
> >> - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> >> - *   i_data_sem (rw)
> >> + * sb_start_write -> i_rwsem (w) -> mmap_lock
> >> + * - buffer_head path:
> >> + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> >> + *     i_data_sem (rw)
> >> + * - iomap path:
> >> + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> >> + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
> >>   *
> >>   * truncate:
> >>   * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> >> -- 
> >> 2.52.0
> >>
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
  2026-05-26 17:10   ` Ojaswin Mujoo
@ 2026-06-02 10:26   ` Ojaswin Mujoo
  2026-06-03  2:56     ` Zhang Yi
  2026-06-16 10:45   ` Jan Kara
  2 siblings, 1 reply; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-02 10:26 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce two new iomap_ops instances for ext4 buffered writes:
> 
>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>    ext4_da_map_blocks() to map delalloc extents.
>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>    ext4_iomap_get_blocks() to directly allocate blocks.
> 
> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> validity.
> 
> Key changes and considerations:
> 
>  - Unwritten extents for new blocks (dioread_nolock always on)
>    Since data=ordered mode is not used to prevent stale data exposure in
>    the non-delayed allocation path, new blocks are always allocated as
>    unwritten extents.
> 
>  - Short write and write failure handling
>    a. Delalloc path: On short write or failure, the stale delalloc range
>       must be dropped and its space reservation released. Otherwise, a
>       clean folio may cover leftover delalloc extents, causing
>       inaccurate space reservation accounting.
>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
>       short write.
> 
>  - Lock ordering reversal
>    The folio lock and transaction start ordering is reversed compared to
>    the buffer_head buffered write path. To handle this, the journal
>    handle must be stopped in iomap_begin() callbacks. The lock ordering
>    documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

I went through this again and after our discussion the changes looks
okay. Just a small quesiton below but otherwise feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

> ---
>  fs/ext4/ext4.h  |   4 ++
>  fs/ext4/file.c  |  20 +++++-
>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/super.c |  10 ++-
>  4 files changed, 192 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1e27d73d7427..4832e7f7db82 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>  				struct buffer_head *bh);
>  void ext4_set_inode_mapping_order(struct inode *inode);
> +int ext4_nonda_switch(struct super_block *sb);
>  #define FALL_BACK_TO_NONDELALLOC 1
>  #define CONVERT_INLINE_DATA	 2

<snip>

> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	return 0;
> +}
> +
> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, false);
> +}
> +
> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, true);
> +}
> +
> +/*
> + * On write failure, drop the stale delayed allocation range and release
> + * its reserved space for both start and end blocks. Otherwise, we may
> + * leave a range of delayed extents covered by a clean folio, which can
> + * result in inaccurate space reservation accounting.
> + */
> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> +				     loff_t length, struct iomap *iomap)
> +{
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> +					    loff_t length, ssize_t written,
> +					    unsigned int flags,
> +					    struct iomap *iomap)
> +{
> +	loff_t start_byte, end_byte;
> +
> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))

Will we ever get IOMAP_F_NEW here? I think the da_write_begin() call
either creates a new IOMAP_DELALLOC extent or finds older ones which
won't have EXT4_MAP_NEW set

> +		return 0;
> +
> +	/* Nothing to do if we've written the entire delalloc extent */
> +	start_byte = iomap_last_written_block(inode, offset, written);
> +	end_byte = round_up(offset + length, i_blocksize(inode));
> +	if (start_byte >= end_byte)
> +		return 0;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> +				     iomap, ext4_iomap_punch_delalloc);
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-06-02 10:05       ` Ojaswin Mujoo
@ 2026-06-03  1:44         ` Zhang Yi
  0 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-06-03  1:44 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On 6/2/2026 6:05 PM, Ojaswin Mujoo wrote:
> On Fri, May 29, 2026 at 05:13:55PM +0800, Zhang Yi wrote:
>> Hi, Ojaswin!
>>
>> On 5/27/2026 1:10 AM, Ojaswin Mujoo wrote:
>>> On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Introduce two new iomap_ops instances for ext4 buffered writes:
>>>>
>>>>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>>>>    ext4_da_map_blocks() to map delalloc extents.
>>>>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>>>>    ext4_iomap_get_blocks() to directly allocate blocks.
>>>>
>>>> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
>>>> validity.
>>>>
>>>> Key changes and considerations:
>>>>
>>>>  - Unwritten extents for new blocks (dioread_nolock always on)
>>>>    Since data=ordered mode is not used to prevent stale data exposure in
>>>>    the non-delayed allocation path, new blocks are always allocated as
>>>>    unwritten extents.
>>>
>>> Okay makes sense.
>>>
>>>>
>>>>  - Short write and write failure handling
>>>>    a. Delalloc path: On short write or failure, the stale delalloc range
>>>>       must be dropped and its space reservation released. Otherwise, a
>>>>       clean folio may cover leftover delalloc extents, causing
>>>>       inaccurate space reservation accounting.
>>>
>>> Hmm, okay so in the usual buffer head path, seems like during a short
>>> write we still zero the new buffers we couldn't write and keep it dirty
>>> (folio_zero_new_buffers()). This way they are still written back and
>>> the delalloc reservations are used up.
>>>
>>
>> In fact, in the normal buffer head path, writeback does not consume
>> delalloc reservations. Instead, the reservations are retained until the
>> inode is released or the area is written again using delalloc. This is
>> because i_size is not updated during short writes. Therefore, when a
>> zeroed dirty folio is written back, no block mapping is created for it.
>> For details, please see the lblk >= blocks judgment in
>> mpage_process_page_bufs().
> 
> Oh okay I see, I'm not very clear on the code path but what about a case
> where i_size is beyond the short write range.
> 

Yeah, You're right. When i_size extends beyond the short write range,
the delalloc reservation will be consumed during writeback.

Thanks,
Yi.




^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-06-02 10:26   ` Ojaswin Mujoo
@ 2026-06-03  2:56     ` Zhang Yi
  2026-06-03 11:08       ` Ojaswin Mujoo
  0 siblings, 1 reply; 85+ messages in thread
From: Zhang Yi @ 2026-06-03  2:56 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On 6/2/2026 6:26 PM, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Introduce two new iomap_ops instances for ext4 buffered writes:
>>
>>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>>    ext4_da_map_blocks() to map delalloc extents.
>>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>>    ext4_iomap_get_blocks() to directly allocate blocks.
>>
>> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
>> validity.
>>
>> Key changes and considerations:
>>
>>  - Unwritten extents for new blocks (dioread_nolock always on)
>>    Since data=ordered mode is not used to prevent stale data exposure in
>>    the non-delayed allocation path, new blocks are always allocated as
>>    unwritten extents.
>>
>>  - Short write and write failure handling
>>    a. Delalloc path: On short write or failure, the stale delalloc range
>>       must be dropped and its space reservation released. Otherwise, a
>>       clean folio may cover leftover delalloc extents, causing
>>       inaccurate space reservation accounting.
>>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
>>       short write.
>>
>>  - Lock ordering reversal
>>    The folio lock and transaction start ordering is reversed compared to
>>    the buffer_head buffered write path. To handle this, the journal
>>    handle must be stopped in iomap_begin() callbacks. The lock ordering
>>    documentation in super.c has been updated accordingly.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> I went through this again and after our discussion the changes looks
> okay. Just a small quesiton below but otherwise feel free to add:
> 
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Thank you a lot for your careful review!

> 
>> ---
>>  fs/ext4/ext4.h  |   4 ++
>>  fs/ext4/file.c  |  20 +++++-
>>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  fs/ext4/super.c |  10 ++-
>>  4 files changed, 192 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 1e27d73d7427..4832e7f7db82 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>>  				struct buffer_head *bh);
>>  void ext4_set_inode_mapping_order(struct inode *inode);
>> +int ext4_nonda_switch(struct super_block *sb);
>>  #define FALL_BACK_TO_NONDELALLOC 1
>>  #define CONVERT_INLINE_DATA	 2
> 
> <snip>
> 
>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>> +	return 0;
>> +}
>> +
>> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap)
>> +{
>> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> +						  iomap, srcmap, false);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap)
>> +{
>> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> +						  iomap, srcmap, true);
>> +}
>> +
>> +/*
>> + * On write failure, drop the stale delayed allocation range and release
>> + * its reserved space for both start and end blocks. Otherwise, we may
>> + * leave a range of delayed extents covered by a clean folio, which can
>> + * result in inaccurate space reservation accounting.
>> + */
>> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
>> +				     loff_t length, struct iomap *iomap)
>> +{
>> +	down_write(&EXT4_I(inode)->i_data_sem);
>> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
>> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
>> +	up_write(&EXT4_I(inode)->i_data_sem);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>> +					    loff_t length, ssize_t written,
>> +					    unsigned int flags,
>> +					    struct iomap *iomap)
>> +{
>> +	loff_t start_byte, end_byte;
>> +
>> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
>> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> 
> Will we ever get IOMAP_F_NEW here? I think the da_write_begin() call
> either creates a new IOMAP_DELALLOC extent or finds older ones which
> won't have EXT4_MAP_NEW set
> 

Oops. This is a bug! In ext4_da_map_blocks(), when allocating a new
delalloc extent, the EXT4_MAP_NEW flag should be set. If this flag is
not set, then when a short write occurs, we cannot distinguish whether
an extent is a pre-existing delalloc extent or a newly allocated one.
This prevents the subsequent truncate operation from being executed,
leaving the newly allocated delalloc extent behind. I will fix this in
next iteration.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-06-03  2:56     ` Zhang Yi
@ 2026-06-03 11:08       ` Ojaswin Mujoo
  0 siblings, 0 replies; 85+ messages in thread
From: Ojaswin Mujoo @ 2026-06-03 11:08 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai

On Wed, Jun 03, 2026 at 10:56:34AM +0800, Zhang Yi wrote:
> On 6/2/2026 6:26 PM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce two new iomap_ops instances for ext4 buffered writes:
> >>
> >>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> >>    ext4_da_map_blocks() to map delalloc extents.
> >>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> >>    ext4_iomap_get_blocks() to directly allocate blocks.
> >>
> >> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> >> validity.
> >>
> >> Key changes and considerations:
> >>
> >>  - Unwritten extents for new blocks (dioread_nolock always on)
> >>    Since data=ordered mode is not used to prevent stale data exposure in
> >>    the non-delayed allocation path, new blocks are always allocated as
> >>    unwritten extents.
> >>
> >>  - Short write and write failure handling
> >>    a. Delalloc path: On short write or failure, the stale delalloc range
> >>       must be dropped and its space reservation released. Otherwise, a
> >>       clean folio may cover leftover delalloc extents, causing
> >>       inaccurate space reservation accounting.
> >>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
> >>       short write.
> >>
> >>  - Lock ordering reversal
> >>    The folio lock and transaction start ordering is reversed compared to
> >>    the buffer_head buffered write path. To handle this, the journal
> >>    handle must be stopped in iomap_begin() callbacks. The lock ordering
> >>    documentation in super.c has been updated accordingly.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > I went through this again and after our discussion the changes looks
> > okay. Just a small quesiton below but otherwise feel free to add:
> > 
> > Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> Thank you a lot for your careful review!
> 
> > 
> >> ---
> >>  fs/ext4/ext4.h  |   4 ++
> >>  fs/ext4/file.c  |  20 +++++-
> >>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> >>  fs/ext4/super.c |  10 ++-
> >>  4 files changed, 192 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 1e27d73d7427..4832e7f7db82 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> >>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> >>  				struct buffer_head *bh);
> >>  void ext4_set_inode_mapping_order(struct inode *inode);
> >> +int ext4_nonda_switch(struct super_block *sb);
> >>  #define FALL_BACK_TO_NONDELALLOC 1
> >>  #define CONVERT_INLINE_DATA	 2
> > 
> > <snip>
> > 
> >> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >> +	return 0;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> +						  iomap, srcmap, false);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> +						  iomap, srcmap, true);
> >> +}
> >> +
> >> +/*
> >> + * On write failure, drop the stale delayed allocation range and release
> >> + * its reserved space for both start and end blocks. Otherwise, we may
> >> + * leave a range of delayed extents covered by a clean folio, which can
> >> + * result in inaccurate space reservation accounting.
> >> + */
> >> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> >> +				     loff_t length, struct iomap *iomap)
> >> +{
> >> +	down_write(&EXT4_I(inode)->i_data_sem);
> >> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> >> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> >> +	up_write(&EXT4_I(inode)->i_data_sem);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >> +					    loff_t length, ssize_t written,
> >> +					    unsigned int flags,
> >> +					    struct iomap *iomap)
> >> +{
> >> +	loff_t start_byte, end_byte;
> >> +
> >> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> >> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> > 
> > Will we ever get IOMAP_F_NEW here? I think the da_write_begin() call
> > either creates a new IOMAP_DELALLOC extent or finds older ones which
> > won't have EXT4_MAP_NEW set
> > 
> 
> Oops. This is a bug! In ext4_da_map_blocks(), when allocating a new
> delalloc extent, the EXT4_MAP_NEW flag should be set. If this flag is
> not set, then when a short write occurs, we cannot distinguish whether
> an extent is a pre-existing delalloc extent or a newly allocated one.
> This prevents the subsequent truncate operation from being executed,
> leaving the newly allocated delalloc extent behind. I will fix this in
> next iteration.

Yes thats true I misread the condition and missed that we will always
exit early here :/

Regards,
ojaswin

> 
> Thanks,
> Yi.
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]()
  2026-05-11  7:23 ` [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]() Zhang Yi
  2026-05-19  6:05   ` Ojaswin Mujoo
@ 2026-06-16  9:31   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Kara @ 2026-06-16  9:31 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:22, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Refactor ext4_setattr() by introducing two helper functions,
> ext4_truncate_up() and ext4_truncate_down(), to handle size changes. The
> current ATTR_SIZE processing consolidates checks for both shrinking and
> non-shrinking cases, leading to cluttered code. Separating the
> truncation paths improves readability.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Nice! Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 199 +++++++++++++++++++++++++++---------------------
>  1 file changed, 112 insertions(+), 87 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0751dc55e94f..35e958f89bd5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5855,6 +5855,112 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
>  	}
>  }
>  
> +/*
> + * Set i_size and i_disksize to 'newsize'.
> + *
> + * Both i_rwsem and i_data_sem are required here to avoid races between
> + * generic append writeback and concurrent truncate that also modify
> + * i_size and i_disksize.
> + */
> +static inline void ext4_set_inode_size(struct inode *inode, loff_t newsize)
> +{
> +	WARN_ON_ONCE(S_ISREG(inode->i_mode) && !inode_is_locked(inode));
> +
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	i_size_write(inode, newsize);
> +	EXT4_I(inode)->i_disksize = newsize;
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_truncate_up(struct inode *inode, loff_t oldsize, loff_t newsize)
> +{
> +	ext4_lblk_t old_lblk, new_lblk;
> +	handle_t *handle;
> +	int ret;
> +
> +	if (!IS_ALIGNED(oldsize | newsize, i_blocksize(inode))) {
> +		ret = ext4_inode_attach_jinode(inode);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> +	if (!IS_ALIGNED(oldsize, i_blocksize(inode))) {
> +		ret = ext4_block_zero_eof(inode, oldsize, LLONG_MAX);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	old_lblk = oldsize > 0 ? (oldsize - 1) >> inode->i_blkbits : 0;
> +	new_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
> +	ext4_fc_track_range(handle, inode, old_lblk, new_lblk);
> +
> +	ext4_set_inode_size(inode, newsize);
> +
> +	ret = ext4_mark_inode_dirty(handle, inode);
> +	ext4_journal_stop(handle);
> +	if (ret)
> +		return ret;
> +	/*
> +	 * isize extend must be called outside an active handle due to
> +	 * the lock ordering of transaction start and folio lock in the
> +	 * iomap buffered I/O path (folio lock -> transaction start).
> +	 */
> +	pagecache_isize_extended(inode, oldsize, newsize);
> +	return 0;
> +}
> +
> +static int ext4_truncate_down(struct inode *inode, loff_t oldsize,
> +			      loff_t newsize, int *orphan)
> +{
> +	ext4_lblk_t start_lblk;
> +	handle_t *handle;
> +	int ret;
> +
> +	/* Do not change i_size. */
> +	if (newsize == oldsize)
> +		goto truncate;
> +
> +	/* Shrink. */
> +	handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	if (ext4_handle_valid(handle)) {
> +		ret = ext4_orphan_add(handle, inode);
> +		*orphan = 1;
> +		if (ret) {
> +			ext4_journal_stop(handle);
> +			return ret;
> +		}
> +	}
> +
> +	start_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
> +	ext4_fc_track_range(handle, inode, start_lblk, EXT_MAX_BLOCKS - 1);
> +
> +	ext4_set_inode_size(inode, newsize);
> +
> +	ret = ext4_mark_inode_dirty(handle, inode);
> +	ext4_journal_stop(handle);
> +	if (ret)
> +		return ret;
> +
> +	if (ext4_should_journal_data(inode))
> +		ext4_wait_for_tail_page_commit(inode);
> +truncate:
> +	/*
> +	 * Truncate pagecache after we've waited for commit in data=journal
> +	 * mode to make pages freeable.  Call ext4_truncate() even if
> +	 * i_size didn't change to truncatea possible preallocated blocks.
> +	 */
> +	truncate_pagecache(inode, newsize);
> +	return ext4_truncate(inode);
> +}
> +
>  /*
>   * ext4_setattr()
>   *
> @@ -5951,7 +6057,6 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  	}
>  
>  	if (attr->ia_valid & ATTR_SIZE) {
> -		handle_t *handle;
>  		loff_t oldsize = inode->i_size;
>  		int shrink = (attr->ia_size < inode->i_size);
>  
> @@ -6003,94 +6108,14 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  			goto err_out;
>  		}
>  
> -		if (attr->ia_size != inode->i_size) {
> -			/* attach jbd2 jinode for EOF folio tail zeroing */
> -			if (attr->ia_size & (inode->i_sb->s_blocksize - 1) ||
> -			    oldsize & (inode->i_sb->s_blocksize - 1)) {
> -				error = ext4_inode_attach_jinode(inode);
> -				if (error)
> -					goto out_mmap_sem;
> -			}
> -
> -			/*
> -			 * Update c/mtime and tail zero the EOF folio on
> -			 * truncate up. ext4_truncate() handles the shrink case
> -			 * below.
> -			 */
> -			if (!shrink) {
> -				inode_set_mtime_to_ts(inode,
> -						      inode_set_ctime_current(inode));
> -				if (oldsize & (inode->i_sb->s_blocksize - 1)) {
> -					error = ext4_block_zero_eof(inode,
> -							oldsize, LLONG_MAX);
> -					if (error)
> -						goto out_mmap_sem;
> -				}
> -			}
> -
> -			handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> -			if (IS_ERR(handle)) {
> -				error = PTR_ERR(handle);
> -				goto out_mmap_sem;
> -			}
> -			if (ext4_handle_valid(handle) && shrink) {
> -				error = ext4_orphan_add(handle, inode);
> -				orphan = 1;
> -				if (error)
> -					goto out_handle;
> -			}
> -
> -			if (shrink)
> -				ext4_fc_track_range(handle, inode,
> -					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> -					inode->i_sb->s_blocksize_bits,
> -					EXT_MAX_BLOCKS - 1);
> -			else
> -				ext4_fc_track_range(
> -					handle, inode,
> -					(oldsize > 0 ? oldsize - 1 : oldsize) >>
> -					inode->i_sb->s_blocksize_bits,
> -					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> -					inode->i_sb->s_blocksize_bits);
> -
> -			/*
> -			 * We have to update i_size under i_data_sem together
> -			 * with i_disksize to avoid races with writeback code
> -			 * updating disksize in mpage_map_and_submit_extent().
> -			 */
> -			down_write(&EXT4_I(inode)->i_data_sem);
> -			i_size_write(inode, attr->ia_size);
> -			EXT4_I(inode)->i_disksize = attr->ia_size;
> -			up_write(&EXT4_I(inode)->i_data_sem);
> -
> -			error = ext4_mark_inode_dirty(handle, inode);
> -out_handle:
> -			ext4_journal_stop(handle);
> -			if (error)
> -				goto out_mmap_sem;
> -			if (!shrink) {
> -				pagecache_isize_extended(inode, oldsize,
> -							 inode->i_size);
> -			} else if (ext4_should_journal_data(inode)) {
> -				ext4_wait_for_tail_page_commit(inode);
> -			}
> +		if (attr->ia_size > oldsize)
> +			error = ext4_truncate_up(inode, oldsize, attr->ia_size);
> +		else {
> +			/* Shrink or do not change i_size. */
> +			error = ext4_truncate_down(inode, oldsize,
> +						   attr->ia_size, &orphan);
>  		}
>  
> -		/*
> -		 * Truncate pagecache after we've waited for commit
> -		 * in data=journal mode to make pages freeable.
> -		 */
> -		truncate_pagecache(inode, inode->i_size);
> -		/*
> -		 * Call ext4_truncate() even if i_size didn't change to
> -		 * truncate possible preallocated blocks.
> -		 */
> -		if (attr->ia_size <= oldsize) {
> -			rc = ext4_truncate(inode);
> -			if (rc)
> -				error = rc;
> -		}
> -out_mmap_sem:
>  		filemap_invalidate_unlock(inode->i_mapping);
>  	}
>  
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr()
  2026-05-11  7:23 ` [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr() Zhang Yi
  2026-05-19  6:08   ` Ojaswin Mujoo
@ 2026-06-16  9:36   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Kara @ 2026-06-16  9:36 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:23, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Refactor the error handling in ext4_setattr() for better clarity:
> 
>  - Return directly on ext4_break_layouts() failure.
>  - Propagate ext4_truncate() errors using the existing error variable
>    and jump to the common 'err_out' label.
>  - Propagate posix_acl_chmod() errors also through the error variable,
>    as it theoretically does not return a non-fatal error.
> 
> With these changes, every error path either returns immediately or jumps
> to err_out. Consequently, the "if (!error)" condition guarding
> setattr_copy() and mark_inode_dirty() becomes unreachable for error
> cases. Remove this redundant check and the unused rc variable can be
> removed as well.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 32 +++++++++++++++-----------------
>  1 file changed, 15 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 35e958f89bd5..b1ef706987c3 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5989,7 +5989,7 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  		 struct iattr *attr)
>  {
>  	struct inode *inode = d_inode(dentry);
> -	int error, rc = 0;
> +	int error;
>  	int orphan = 0;
>  	const unsigned int ia_valid = attr->ia_valid;
>  	bool inc_ivers = true;
> @@ -6102,10 +6102,10 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  
>  		filemap_invalidate_lock(inode->i_mapping);
>  
> -		rc = ext4_break_layouts(inode);
> -		if (rc) {
> +		error = ext4_break_layouts(inode);
> +		if (error) {
>  			filemap_invalidate_unlock(inode->i_mapping);
> -			goto err_out;
> +			return error;
>  		}
>  
>  		if (attr->ia_size > oldsize)
> @@ -6117,15 +6117,19 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  		}
>  
>  		filemap_invalidate_unlock(inode->i_mapping);
> +		if (error)
> +			goto err_out;
>  	}
>  
> -	if (!error) {
> -		if (inc_ivers)
> -			inode_inc_iversion(inode);
> -		setattr_copy(idmap, inode, attr);
> -		mark_inode_dirty(inode);
> -	}
> +	if (inc_ivers)
> +		inode_inc_iversion(inode);
> +	setattr_copy(idmap, inode, attr);
> +	mark_inode_dirty(inode);
>  
> +	if (ia_valid & ATTR_MODE)
> +		error = posix_acl_chmod(idmap, dentry, inode->i_mode);
> +
> +err_out:
>  	/*
>  	 * If the call to ext4_truncate failed to get a transaction handle at
>  	 * all, we need to clean up the in-core orphan list manually.
> @@ -6133,14 +6137,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  	if (orphan && inode->i_nlink)
>  		ext4_orphan_del(NULL, inode);
>  
> -	if (!error && (ia_valid & ATTR_MODE))
> -		rc = posix_acl_chmod(idmap, dentry, inode->i_mode);
> -
> -err_out:
> -	if  (error)
> +	if (error)
>  		ext4_std_error(inode->i_sb, error);
> -	if (!error)
> -		error = rc;
>  	return error;
>  }
>  
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
  2026-05-11  7:23 ` [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path Zhang Yi
  2026-05-19 10:41   ` Ojaswin Mujoo
@ 2026-06-16 10:01   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Kara @ 2026-06-16 10:01 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:27, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The data=ordered mode introduces two fundamental conflicts with the
> iomap buffered write path, leading to potential deadlocks.
> 
> 1) Lock ordering conflict
>    In the iomap writeback path, each folio is processed sequentially:
>    the folio lock is acquired first, followed by starting a transaction
>    to create block mappings. In data=ordered mode, writeback triggered
>    by the journal commit process may attempt to acquire a folio lock
>    that is already held by iomap. Meanwhile, iomap, under that same
>    folio lock, may start a new transaction and wait for the currently
>    committing transaction to finish, resulting in a deadlock.
> 
> 2) Partial folio submission not supported
>    When block size is smaller than folio size, a folio may contain both
>    mapped and unmapped blocks. In data=ordered mode, if the journal
>    waits for such a folio to be written back while the regular writeback
>    process has already started committing it (with the writeback flag
>    set), mapping the remaining unmapped blocks can deadlock. This is
>    because the writeback flag is cleared only after the entire folio is
>    processed and committed.
> 
> To support data=ordered mode, the iomap core would need two invasive
> changes:
>  - Acquire the transaction handle before locking any folio for
>    writeback.
>  - Support partial folio submission.
> 
> Both changes are complicated and risk performance regressions.
> Therefore, we must avoid using data=ordered mode when converting to the
> iomap path.
> 
> Currently, data=ordered mode is used in three scenarios:
>  - Append write
>  - Post-EOF partial block truncate-up followed by append write
>  - Online defragmentation
> 
> We can address the first two without data=ordered mode:
>  - For append write: always allocate unwritten blocks (i.e. always
>    enable dioread_nolock), preserving the behavior of current
>    extent-type inodes.
>  - For post-EOF truncate-up + append write: postpone updating i_disksize
>    until after the zeroed partial block has been written back.
> 
> Online defragmentation does not yet support iomap; this can be resolved
> separately in the future.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4_jbd2.h | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 63d17c5201b5..26999f173870 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
>  
>  static inline int ext4_should_order_data(struct inode *inode)
>  {
> -	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
> +	/*
> +	 * inodes using the iomap buffered I/O path do not use the
> +	 * data=ordered mode.
> +	 */
> +	return !ext4_inode_buffered_iomap(inode) &&
> +		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
>  }
>  
>  static inline int ext4_should_writeback_data(struct inode *inode)
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
  2026-05-11  7:23 ` [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
  2026-05-19 10:02   ` Ojaswin Mujoo
@ 2026-06-16 10:04   ` Jan Kara
  2026-06-16 12:37     ` Zhang Yi
  1 sibling, 1 reply; 85+ messages in thread
From: Jan Kara @ 2026-06-16 10:04 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:26, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The iomap buffered write path does not hold any locks between querying
> inode extent mapping information and performing buffered writes. It
> relies on the sequence counter saved in the inode to detect stale
> mappings.

Now that I'm looking at it again, I've got a bit confused here. Buffered
write path is holding i_rwsem between mapping blocks and using them so
there shouldn't be races.  Perhaps you mean buffered *writeback* path? But
then ext4_da_map_blocks() should not ever get called in the writeback path
because it is allocating delayed blocks... So this change looks unnecessary
to me now. Am I missing something?

								Honza

> 
> Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
> blocks") added the m_seq field to ext4_map_blocks to pass out extent
> sequence numbers, but it missed two callsites within
> ext4_da_map_blocks(). These callsites are on the delayed allocation
> path, which is also used in the iomap buffered write path. Pass out the
> sequence counter to ensure stale mappings can be detected.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6c4d9137b279..39577a6b65b9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>  	ext4_check_map_extents_env(inode);
>  
>  	/* Lookup extent status tree firstly */
> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>  		map->m_len = min_t(unsigned int, map->m_len,
>  				   es.es_len - (map->m_lblk - es.es_lblk));
>  
> @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>  	 * is held in write mode, before inserting a new da entry in
>  	 * the extent status tree.
>  	 */
> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>  		map->m_len = min_t(unsigned int, map->m_len,
>  				   es.es_len - (map->m_lblk - es.es_lblk));
>  
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
  2026-05-26 17:10   ` Ojaswin Mujoo
  2026-06-02 10:26   ` Ojaswin Mujoo
@ 2026-06-16 10:45   ` Jan Kara
  2026-06-16 14:42     ` Zhang Yi
  2 siblings, 1 reply; 85+ messages in thread
From: Jan Kara @ 2026-06-16 10:45 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:28, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce two new iomap_ops instances for ext4 buffered writes:
> 
>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>    ext4_da_map_blocks() to map delalloc extents.
>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>    ext4_iomap_get_blocks() to directly allocate blocks.
> 
> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> validity.
> 
> Key changes and considerations:
> 
>  - Unwritten extents for new blocks (dioread_nolock always on)
>    Since data=ordered mode is not used to prevent stale data exposure in
>    the non-delayed allocation path, new blocks are always allocated as
>    unwritten extents.
> 
>  - Short write and write failure handling
>    a. Delalloc path: On short write or failure, the stale delalloc range
>       must be dropped and its space reservation released. Otherwise, a
>       clean folio may cover leftover delalloc extents, causing
>       inaccurate space reservation accounting.
>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
>       short write.
> 
>  - Lock ordering reversal
>    The folio lock and transaction start ordering is reversed compared to
>    the buffer_head buffered write path. To handle this, the journal
>    handle must be stopped in iomap_begin() callbacks. The lock ordering
>    documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good to me - besides the IOMAP_F_NEW bugs Ojaswin found. One
observation I have here is that since the old indirect block based on-disk
format doesn't support unwritten extents we can never transition it to the
iomap scheme used here. So we'll have to figure out some way to avoid
maintaining two (actually three if we count data=journal) buffered write /
writeback paths in the long term. But let's address that once things settle
for the common paths.

								Honza

> ---
>  fs/ext4/ext4.h  |   4 ++
>  fs/ext4/file.c  |  20 +++++-
>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/super.c |  10 ++-
>  4 files changed, 192 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1e27d73d7427..4832e7f7db82 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>  				struct buffer_head *bh);
>  void ext4_set_inode_mapping_order(struct inode *inode);
> +int ext4_nonda_switch(struct super_block *sb);
>  #define FALL_BACK_TO_NONDELALLOC 1
>  #define CONVERT_INLINE_DATA	 2
>  
> @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
>  
>  extern const struct iomap_ops ext4_iomap_ops;
>  extern const struct iomap_ops ext4_iomap_report_ops;
> +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> +extern const struct iomap_write_ops ext4_iomap_write_ops;
>  
>  static inline int ext4_buffer_uptodate(struct buffer_head *bh)
>  {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index eb1a323962b1..7f9bfbbc4a4e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	return count;
>  }
>  
> +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> +					 struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	const struct iomap_ops *iomap_ops;
> +
> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> +	else
> +		iomap_ops = &ext4_iomap_buffered_write_ops;
> +
> +	return iomap_file_buffered_write(iocb, from, iomap_ops,
> +					 &ext4_iomap_write_ops, NULL);
> +}
> +
>  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>  					struct iov_iter *from)
>  {
> @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>  	if (ret <= 0)
>  		goto out;
>  
> -	ret = generic_perform_write(iocb, from);
> +	if (ext4_inode_buffered_iomap(inode))
> +		ret = ext4_iomap_buffered_write(iocb, from);
> +	else
> +		ret = generic_perform_write(iocb, from);
>  
>  out:
>  	inode_unlock(inode);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 39577a6b65b9..1ae7d3f4a1c8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
>  	return ret;
>  }
>  
> -static int ext4_nonda_switch(struct super_block *sb)
> +int ext4_nonda_switch(struct super_block *sb)
>  {
>  	s64 free_clusters, dirty_clusters;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
>  	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
>  }
>  
> +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> +{
> +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> +}
> +
> +const struct iomap_write_ops ext4_iomap_write_ops = {
> +	.iomap_valid = ext4_iomap_valid,
> +};
> +
>  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>  			   struct ext4_map_blocks *map, loff_t offset,
>  			   loff_t length, unsigned int flags)
> @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>  		iomap->flags |= IOMAP_F_MERGED;
>  
> +	iomap->validity_cookie = map->m_seq;
> +
>  	/*
>  	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
>  	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
>  	.iomap_begin = ext4_iomap_begin_report,
>  };
>  
> +/* Map blocks */
> +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> +
>  static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> -		loff_t length, struct ext4_map_blocks *map)
> +		loff_t length, ext4_get_blocks_t get_blocks,
> +		struct ext4_map_blocks *map)
>  {
>  	u8 blkbits = inode->i_blkbits;
>  
> @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
>  	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>  			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
>  
> +	if (get_blocks)
> +		return get_blocks(inode, map);
> +
>  	return ext4_map_blocks(NULL, inode, map, 0);
>  }
>  
> @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>  	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
>  		return -ERANGE;
>  
> -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>  	if (ret < 0)
>  		return ret;
>  
> @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>  	return 0;
>  }
>  
> +static int ext4_iomap_get_blocks(struct inode *inode,
> +				 struct ext4_map_blocks *map)
> +{
> +	loff_t i_size = i_size_read(inode);
> +	handle_t *handle;
> +	int ret;
> +
> +	/*
> +	 * Check if the blocks have already been allocated, this could
> +	 * avoid initiating a new journal transaction and return the
> +	 * mapping information directly.
> +	 */
> +	if ((map->m_lblk + map->m_len) <=
> +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> +		ret = ext4_map_blocks(NULL, inode, map, 0);
> +		if (ret < 0)
> +			return ret;
> +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> +				    EXT4_MAP_DELAYED))
> +			return 0;
> +	}
> +
> +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> +			ext4_chunk_trans_blocks(inode, map->m_len));
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	ret = ext4_map_blocks(handle, inode, map,
> +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> +	/*
> +	 * Stop handle here following the lock ordering of the folio lock
> +	 * and the transaction start.
> +	 */
> +	ext4_journal_stop(handle);
> +
> +	return ret;
> +}
> +
> +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> +{
> +	int ret, retries = 0;
> +	struct ext4_map_blocks map;
> +	ext4_get_blocks_t *get_blocks;
> +
> +	ret = ext4_emergency_state(inode->i_sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	/* Inline data and non-extent are not supported. */
> +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> +		return -ERANGE;
> +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> +		return -EINVAL;
> +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> +		return -EINVAL;
> +
> +	if (delalloc)
> +		get_blocks = ext4_da_map_blocks;
> +	else
> +		get_blocks = ext4_iomap_get_blocks;
> +retry:
> +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> +		goto retry;
> +	if (ret < 0)
> +		return ret;
> +
> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	return 0;
> +}
> +
> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, false);
> +}
> +
> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, true);
> +}
> +
> +/*
> + * On write failure, drop the stale delayed allocation range and release
> + * its reserved space for both start and end blocks. Otherwise, we may
> + * leave a range of delayed extents covered by a clean folio, which can
> + * result in inaccurate space reservation accounting.
> + */
> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> +				     loff_t length, struct iomap *iomap)
> +{
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> +					    loff_t length, ssize_t written,
> +					    unsigned int flags,
> +					    struct iomap *iomap)
> +{
> +	loff_t start_byte, end_byte;
> +
> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> +		return 0;
> +
> +	/* Nothing to do if we've written the entire delalloc extent */
> +	start_byte = iomap_last_written_block(inode, offset, written);
> +	end_byte = round_up(offset + length, i_blocksize(inode));
> +	if (start_byte >= end_byte)
> +		return 0;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> +				     iomap, ext4_iomap_punch_delalloc);
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return 0;
> +}
> +
> +/*
> + * Since we always allocate unwritten extents, there is no need for
> + * iomap_end to clean up allocated blocks on a short write.
> + */
> +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> +	.iomap_begin = ext4_iomap_buffered_write_begin,
> +};
> +
> +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
> +	.iomap_end = ext4_iomap_buffered_da_write_end,
> +};
> +
>  const struct iomap_ops ext4_iomap_buffered_read_ops = {
>  	.iomap_begin = ext4_iomap_buffered_read_begin,
>  };
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..9bc294b769db 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
>   *   -> page lock -> i_data_sem (rw)
>   *
>   * buffered write path:
> - * sb_start_write -> i_mutex -> mmap_lock
> - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> - *   i_data_sem (rw)
> + * sb_start_write -> i_rwsem (w) -> mmap_lock
> + * - buffer_head path:
> + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> + *     i_data_sem (rw)
> + * - iomap path:
> + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
>   *
>   * truncate:
>   * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
  2026-05-11  7:23 ` [PATCH v4 09/23] ext4: implement writeback " Zhang Yi
  2026-05-27 12:49   ` Ojaswin Mujoo
@ 2026-06-16 11:47   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Kara @ 2026-06-16 11:47 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:29, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Add the iomap writeback path for ext4 buffered I/O. This introduces:
> 
>  - ext4_iomap_writepages(): the main writeback entry point.
>  - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
>    block mapping and I/O submission.
>  - A new end I/O worker for converting unwritten extents, updating file
>    size, and handling DATA_ERR_ABORT after I/O completion.
> 
> Core implementation details:
> 
>  - ->writeback_range() callback
>    Calls ext4_iomap_map_writeback_range() to query the longest range of
>    existing mapped extents. For performance, when a block range is not
>    yet allocated, it allocates based on the writeback length and delalloc
>    extent length, rather than allocating for a single folio at a time.
>    The folio is then added to an iomap_ioend instance.
> 
>  - ->writeback_submit() callback
>    Registers ext4_iomap_end_bio() as the end bio callback. This callback
>    schedules a worker to handle:
>    - Unwritten extent conversion.
>    - i_disksize update after data is written back.
>    - Journal abort on writeback I/O failure.
> 
> Key changes and considerations:
> 
> - Append write and unwritten extents
>   Since data=ordered mode is not used to prevent stale data exposure
>   during append writebacks, new blocks are always allocated as unwritten
>   extents (i.e. always enable dioread_nolock), and i_disksize update is
>   postponed until I/O completion. Additionally, the deadlock that the
>   reserve handle was expected to resolve does not occur anymore.
>   Therefore, the end I/O worker can start a normal journal handle
>   instead of a reserve handle when converting unwritten extents.
> 
> - Lock ordering
>   The ->writeback_range() callback runs under the folio lock, requiring
>   the journal handle to be started under that same lock. This reverses
>   the order compared to the buffer_head writeback path. The lock ordering
>   documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4.h        |   4 +
>  fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
>  fs/ext4/super.c       |   7 +-
>  fs/iomap/ioend.c      |   3 +-
>  include/linux/iomap.h |   1 +
>  6 files changed, 346 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4832e7f7db82..078feda47e36 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
>  	 */
>  	struct list_head i_rsv_conversion_list;
>  	struct work_struct i_rsv_conversion_work;
> +	struct list_head i_iomap_ioend_list;
> +	struct work_struct i_iomap_ioend_work;

Ugh, this adds 48 bytes to ext4 inode. That's pretty heavy. Cannot we reuse
i_rsv_conversion_list / work for this? For each inode only one of them
should be used AFAICS.

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1ae7d3f4a1c8..a80195bd6f20 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -44,6 +44,7 @@
>  #include <linux/iversion.h>
>  
>  #include "ext4_jbd2.h"
> +#include "ext4_extents.h"
>  #include "xattr.h"
>  #include "acl.h"
>  #include "truncate.h"
> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
>  	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
>  }
>  
> +static int ext4_iomap_map_one_extent(struct inode *inode,
> +				     struct ext4_map_blocks *map)
> +{
> +	struct extent_status es;
> +	handle_t *handle = NULL;
> +	int credits, map_flags;
> +	int retval;
> +
> +	credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	map->m_flags = 0;
> +	/*
> +	 * It is necessary to look up extent and map blocks under i_data_sem
> +	 * in write mode, otherwise, the delalloc extent may become stale
> +	 * during concurrent truncate operations.
> +	 */
> +	ext4_fc_track_inode(handle, inode);
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> +		retval = es.es_len - (map->m_lblk - es.es_lblk);
> +		map->m_len = min_t(unsigned int, retval, map->m_len);
> +
> +		if (ext4_es_is_delayed(&es)) {
> +			map->m_flags |= EXT4_MAP_DELAYED;
> +			trace_ext4_da_write_pages_extent(inode, map);
> +			/*
> +			 * Call ext4_map_create_blocks() to allocate any
> +			 * delayed allocation blocks. It is possible that
> +			 * we're going to need more metadata blocks, however
> +			 * we must not fail because we're in writeback and
> +			 * there is nothing we can do so it might result in
> +			 * data loss. So use reserved blocks to allocate
> +			 * metadata if possible.
> +			 */
> +			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> +				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
> +				    EXT4_EX_NOCACHE;
> +
> +			retval = ext4_map_create_blocks(handle, inode, map,
> +							map_flags);
> +			if (retval > 0)
> +				ext4_fc_track_range(handle, inode, map->m_lblk,
> +						map->m_lblk + map->m_len - 1);
> +			goto out;
> +		} else if (unlikely(ext4_es_is_hole(&es)))
> +			goto out;
> +
> +		/* Found written or unwritten extent. */
> +		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> +		map->m_flags = ext4_es_is_written(&es) ?
> +			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> +		goto out;
> +	}
> +
> +	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> +out:
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +	ext4_journal_stop(handle);
> +	return retval < 0 ? retval : 0;
> +}
> +
> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> +					  loff_t offset, unsigned int dirty_len)
> +{
> +	struct inode *inode = wpc->inode;
> +	struct super_block *sb = inode->i_sb;
> +	struct journal_s *journal = EXT4_SB(sb)->s_journal;
> +	struct ext4_map_blocks map;
> +	unsigned int blkbits = inode->i_blkbits;
> +	unsigned int index = offset >> blkbits;
> +	unsigned int blk_end, blk_len;
> +	int ret;
> +
> +	ret = ext4_emergency_state(sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	/* Check validity of the cached writeback mapping. */
> +	if (offset >= wpc->iomap.offset &&
> +	    offset < wpc->iomap.offset + wpc->iomap.length &&
> +	    ext4_iomap_valid(inode, &wpc->iomap))
> +		return 0;
> +
> +	blk_len = dirty_len >> blkbits;
> +	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> +				      (UINT_MAX - 1));
> +	if (blk_end > index + blk_len)
> +		blk_len = blk_end - index + 1;
> +
> +retry:
> +	map.m_lblk = index;
> +	map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
> +	ret = ext4_map_blocks(NULL, inode, &map,
> +			      EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * The map is not a delalloc extent, it must either be a hole
> +	 * or an extent which have already been allocated.
> +	 */
> +	if (!(map.m_flags & EXT4_MAP_DELAYED))
> +		goto out;
> +
> +	/* Map one delalloc extent. */
> +	ret = ext4_iomap_map_one_extent(inode, &map);

So it looks somewhat strange that here we call ext4_map_blocks() (which
consults extent status tree and then possibly on-disk extent tree) and then
we call ext4_iomap_map_one_extent() which manipulates with the extent
status tree and possibly extent tree as well. Is all this complexity to
avoid starting a jbd2 handle unless really needed? If yes, is that really
worth it? Given iomap code caches the extent we'd start the transaction
only once per mapped extent which shouldn't be that bad?

If you have some benchmark showing this is really worth it, then I'd
probably prefer coming up with an ext4_get_blocks flag which tells it to
start a transaction on its own if we need to allocate blocks... That would
be much simpler than opencoding all this.

> +	if (ret < 0) {
> +		if (ext4_emergency_state(sb))
> +			return ret;
> +
> +		/*
> +		 * Retry transient ENOSPC errors, if
> +		 * ext4_count_free_blocks() is non-zero, a commit
> +		 * should free up blocks.
> +		 */
> +		if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
> +			jbd2_journal_force_commit_nested(journal);
> +			goto retry;
> +		}
> +
> +		ext4_msg(sb, KERN_CRIT,
> +			 "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
> +			 inode->i_ino, (unsigned long long)map.m_lblk,
> +			 (unsigned int)map.m_len, -ret);
> +		ext4_msg(sb, KERN_CRIT,
> +			 "This should not happen!! Data will be lost\n");
> +		if (ret == -ENOSPC)
> +			ext4_print_free_blocks(inode);
> +		return ret;
> +	}
> +out:
> +	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
> +	return 0;
> +}
> +

...

> +void ext4_iomap_end_bio(struct bio *bio)
> +{
> +	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
> +	struct inode *inode = ioend->io_inode;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +	unsigned long flags;
> +
> +	/* Needs to convert unwritten extents or update the i_disksize. */
> +	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
> +	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
> +		goto defer;
> +
> +	/* Needs to abort the journal on data_err=abort.  */
> +	if (unlikely(ioend->io_bio.bi_status))
> +		goto defer;
> +
> +	iomap_finish_ioend(ioend, 0);
> +	return;
> +defer:
> +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> +	if (list_empty(&ei->i_iomap_ioend_list))
> +		queue_work(sbi->rsv_conversion_wq, &ei->i_iomap_ioend_work);
> +	list_add_tail(&ioend->io_list, &ei->i_iomap_ioend_list);
> +	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> +}

For now, I'd prefer to do what XFS does and offload everything. Then you
don't have to export iomap_finish_ioend() (which would need to be in a
separate patch and acked by iomap maintainers) and the code is more
standard. There's a patchset in the works which adds general ioend offloading
infrastructure into iomap [1] and when that lands we should get all these
bells and whistles (even better ones with percpu work queues, batching,
etc.) for free.

[1] https://lore.kernel.org/all/20260514-blk-dontcache-v6-0-782e2fa7477b@columbia.edu/

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 10/23] ext4: implement mmap path using iomap
  2026-05-11  7:23 ` [PATCH v4 10/23] ext4: implement mmap " Zhang Yi
  2026-05-27 12:56   ` Ojaswin Mujoo
@ 2026-06-16 11:56   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Kara @ 2026-06-16 11:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:30, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce ext4_iomap_page_mkwrite() to implement the mmap iomap path
> for ext4. The heavy lifting is delegated to iomap_page_mkwrite(), which
> only requires ext4_iomap_buffered_write_ops and
> ext4_iomap_buffered_da_write_ops to allocate and map blocks.
> 
> Note that the lock ordering between folio lock and transaction start in
> this path is reversed compared to the buffer_head buffered write path.
> The lock ordering documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 32 +++++++++++++++++++++++++++++++-
>  fs/ext4/super.c |  8 ++++++--
>  2 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index a80195bd6f20..c6fe42d012fc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4020,7 +4020,7 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
>  		return -ERANGE;
>  	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>  		return -EINVAL;
> -	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> +	if (WARN_ON_ONCE(!(flags & (IOMAP_WRITE | IOMAP_FAULT))))
>  		return -EINVAL;
>  
>  	if (delalloc)
> @@ -4080,6 +4080,14 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
>  		return 0;
>  
> +	/*
> +	 * iomap_page_mkwrite() will never fail in a way that requires delalloc
> +	 * extents that it allocated to be revoked.  Hence never try to release
> +	 * them here.
> +	 */
> +	if (flags & IOMAP_FAULT)
> +		return 0;
> +
>  	/* Nothing to do if we've written the entire delalloc extent */
>  	start_byte = iomap_last_written_block(inode, offset, written);
>  	end_byte = round_up(offset + length, i_blocksize(inode));
> @@ -7191,6 +7199,23 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
>  	return ret;
>  }
>  
> +static vm_fault_t ext4_iomap_page_mkwrite(struct vm_fault *vmf)
> +{
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	const struct iomap_ops *iomap_ops;
> +
> +	/*
> +	 * ext4_nonda_switch() could writeback this folio, so have to
> +	 * call it before lock folio.
> +	 */
> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> +	else
> +		iomap_ops = &ext4_iomap_buffered_write_ops;
> +
> +	return iomap_page_mkwrite(vmf, iomap_ops, NULL);
> +}
> +
>  vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> @@ -7213,6 +7238,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  
>  	filemap_invalidate_lock_shared(mapping);
>  
> +	if (ext4_inode_buffered_iomap(inode)) {
> +		ret = ext4_iomap_page_mkwrite(vmf);
> +		goto out;
> +	}
> +
>  	err = ext4_convert_inline_data(inode);
>  	if (err)
>  		goto out_ret;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 51d87db53543..62bfe05a64bc 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -100,8 +100,12 @@ static const struct fs_parameter_spec ext4_param_specs[];
>   * Lock ordering
>   *
>   * page fault path:
> - * mmap_lock -> sb_start_pagefault -> invalidate_lock (r) -> transaction start
> - *   -> page lock -> i_data_sem (rw)
> + * - buffer_head path:
> + *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
> + *     transaction start -> folio lock -> i_data_sem (rw)
> + * - iomap path:
> + *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
> + *     folio lock -> transaction start -> i_data_sem (rw)
>   *
>   * buffered write path:
>   * sb_start_write -> i_rwsem (w) -> mmap_lock
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
  2026-05-11  7:23 ` [PATCH v4 14/23] ext4: implement partial block zero range path using iomap Zhang Yi
  2026-05-27 13:13   ` Ojaswin Mujoo
@ 2026-06-16 12:28   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Kara @ 2026-06-16 12:28 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:34, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
> infrastructure for ext4.
> 
> ext4_iomap_block_zero_range() calls iomap_zero_range() with
> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
> out either a mapped partial block or a dirty, unwritten partial block.
> 
> Important constraints:
> 
> Zeroing out under an active journal handle can cause deadlock, because
> the order of acquiring the folio lock and starting a handle is
> inconsistent with the iomap writeback path.
> 
> Therefore, ext4_iomap_block_zero_range():
> - Must NOT be called under an active handle.
> - Cannot rely on data=ordered mode to ensure zeroed data persistence
>   before updating i_disksize (for the cases of post-EOF append write,
>   post-EOF fallocate, and truncate up). In subsequent patches, we will
>   address this by synchronizing commit I/O but doesn't waiting for
>   completion, and updating i_disksize to i_size only after the zeroed
>   data has been written back.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 92 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c6fe42d012fc..e0dae2501292 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	return 0;
>  }
>  
> +static int ext4_iomap_zero_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);

This looks like a layering violation to me. I don't think you can safely
assume the iomap you're passed is a part of iomap_iter...

> +	struct ext4_map_blocks map;
> +	u8 blkbits = inode->i_blkbits;
> +	unsigned int iomap_flags = 0;
> +	int ret;
> +
> +	ret = ext4_emergency_state(inode->i_sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
> +		return -EINVAL;
> +
> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> +	 * this bypasses the flush iomap uses to trigger extent conversion
> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> +	 */
> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
> +
> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
> +			map.m_len = (start >> blkbits) - map.m_lblk;
> +	}

... and you need access to iter only for this which seems to be really a
hack that's trying to outsmart the iomap code. I have to admit I don't
fully understand what you are trying to achieve here. Are you trying to
avoid flushing of the range that will be zeroed out?

> +	ret = iomap_zero_range(inode, from, length, did_zero,
> +			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
> +			       NULL);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * TODO: The iomap does not distinguish between different types of
> +	 * zeroing and always sets zero_written if a zeroing operation is
> +	 * performed, which may result in unnecessary order operations.
> +	 */

Is this still true after your fix to did_zero handling?

> +	if (did_zero && zero_written)
> +		*zero_written = *did_zero;
> +
> +	return 0;
> +}
> +
>  /*
>   * Zeros out a mapping of length 'length' starting from file offset
>   * 'from'.  The range to be zero'd must be contained with in one block.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path
  2026-05-11  7:23 ` [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
  2026-05-27 13:14   ` Ojaswin Mujoo
@ 2026-06-16 12:29   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Kara @ 2026-06-16 12:29 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:35, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Add tracepoints for iomap buffered read, write, partial block zeroing,
> and writeback operations to help debug the iomap buffered I/O path.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c             |  6 +++++
>  include/trace/events/ext4.h | 45 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 51 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e0dae2501292..239d387ffaf2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3961,6 +3961,8 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>  	if (ret < 0)
>  		return ret;
>  
> +	trace_ext4_iomap_buffered_read_begin(inode, &map, offset, length,
> +					     flags);
>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>  	return 0;
>  }
> @@ -4034,6 +4036,8 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
>  	if (ret < 0)
>  		return ret;
>  
> +	trace_ext4_iomap_buffered_write_begin(inode, &map, offset, length,
> +					      flags);
>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>  	return 0;
>  }
> @@ -4136,6 +4140,7 @@ static int ext4_iomap_zero_begin(struct inode *inode,
>  			map.m_len = (start >> blkbits) - map.m_lblk;
>  	}
>  
> +	trace_ext4_iomap_zero_begin(inode, &map, offset, length, flags);
>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>  	iomap->flags |= iomap_flags;
>  
> @@ -4308,6 +4313,7 @@ static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
>  		return ret;
>  	}
>  out:
> +	trace_ext4_iomap_map_writeback_range(inode, &map, offset, dirty_len, 0);
>  	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
>  	return 0;
>  }
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index f493642cf121..ebafa06cd191 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -3096,6 +3096,51 @@ TRACE_EVENT(ext4_move_extent_exit,
>  		  __entry->ret)
>  );
>  
> +DECLARE_EVENT_CLASS(ext4_set_iomap_class,
> +	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map,
> +		 loff_t offset, loff_t length, unsigned int flags),
> +	TP_ARGS(inode, map, offset, length, flags),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(u64, ino)
> +		__field(ext4_lblk_t, m_lblk)
> +		__field(unsigned int, m_len)
> +		__field(unsigned int, m_flags)
> +		__field(u64, m_seq)
> +		__field(loff_t, offset)
> +		__field(loff_t, length)
> +		__field(unsigned int, iomap_flags)
> +	),
> +	TP_fast_assign(
> +		__entry->dev		= inode->i_sb->s_dev;
> +		__entry->ino		= inode->i_ino;
> +		__entry->m_lblk		= map->m_lblk;
> +		__entry->m_len		= map->m_len;
> +		__entry->m_flags	= map->m_flags;
> +		__entry->m_seq		= map->m_seq;
> +		__entry->offset		= offset;
> +		__entry->length		= length;
> +		__entry->iomap_flags	= flags;
> +
> +	),
> +	TP_printk("dev %d:%d ino %llu m_lblk %u m_len %u m_flags %s m_seq %llu orig_off 0x%llx orig_len 0x%llx iomap_flags 0x%x",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->ino, __entry->m_lblk, __entry->m_len,
> +		  show_mflags(__entry->m_flags), __entry->m_seq,
> +		  __entry->offset, __entry->length, __entry->iomap_flags)
> +)
> +
> +#define DEFINE_SET_IOMAP_EVENT(name) \
> +DEFINE_EVENT(ext4_set_iomap_class, name, \
> +	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map, \
> +		 loff_t offset, loff_t length, unsigned int flags), \
> +	TP_ARGS(inode, map, offset, length, flags))
> +
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_read_begin);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_write_begin);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_map_writeback_range);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_zero_begin);
> +
>  #endif /* _TRACE_EXT4_H */
>  
>  /* This part must be outside protection */
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 16/23] ext4: disable online defrag when inode using iomap buffered I/O path
  2026-05-11  7:23 ` [PATCH v4 16/23] ext4: disable online defrag when inode using " Zhang Yi
  2026-05-27 13:14   ` Ojaswin Mujoo
@ 2026-06-16 12:30   ` Jan Kara
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Kara @ 2026-06-16 12:30 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai

On Mon 11-05-26 15:23:36, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Online defragmentation does not currently support inodes using the
> iomap buffered I/O path. The existing implementation relies on
> buffer_head for sub-folio block management and data=ordered mode for
> data consistency, both of which are incompatible with the iomap path.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Sure. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/move_extent.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index 3329b7ad5dbd..f707a1096544 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -476,6 +476,17 @@ static int mext_check_validity(struct inode *orig_inode,
>  		return -EOPNOTSUPP;
>  	}
>  
> +	/*
> +	 * TODO: support online defrag for inodes that using the buffered
> +	 * I/O iomap path.
> +	 */
> +	if (ext4_inode_buffered_iomap(orig_inode) ||
> +	    ext4_inode_buffered_iomap(donor_inode)) {
> +		ext4_msg(sb, KERN_ERR,
> +			 "Online defrag not supported for inode with iomap buffered IO path");
> +		return -EOPNOTSUPP;
> +	}
> +
>  	if (donor_inode->i_mode & (S_ISUID|S_ISGID)) {
>  		ext4_debug("ext4 move extent: suid or sgid is set to donor file [ino:orig %llu, donor %llu]\n",
>  			   orig_inode->i_ino, donor_inode->i_ino);
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
  2026-06-16 10:04   ` Jan Kara
@ 2026-06-16 12:37     ` Zhang Yi
  0 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-06-16 12:37 UTC (permalink / raw)
  To: Jan Kara, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai

On 6/16/2026 6:04 PM, Jan Kara wrote:
> On Mon 11-05-26 15:23:26, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> The iomap buffered write path does not hold any locks between querying
>> inode extent mapping information and performing buffered writes. It
>> relies on the sequence counter saved in the inode to detect stale
>> mappings.
> 
> Now that I'm looking at it again, I've got a bit confused here. Buffered
> write path is holding i_rwsem between mapping blocks and using them so
> there shouldn't be races.  Perhaps you mean buffered *writeback* path? But
> then ext4_da_map_blocks() should not ever get called in the writeback path
> because it is allocating delayed blocks... So this change looks unnecessary
> to me now. Am I missing something?
> 
> 								Honza
> 

Hi Jan,

Thanks for coming back to this series. Sorry for the inaccurate
description in the commit message. However, this change is still needed.

As mentioned in the comment before the ->iomap_valid() callback in
iomap_write_begin(), consider the following scenario — a buffered write
to a dirty unwritten extent, with this concurrent race:

   Buffer write (holds i_rwsem)    Writeback (no i_rwsem)
   ext4_da_map_blocks()
     // ext4_es_lookup_extent()
     // finds UNWRITTEN extent
   ext4_set_iomap()
     // type = IOMAP_UNWRITTEN
     // validity_cookie = m_seq
                                   ext4_iomap_writepages()
                                     // writeback for unwritten extent
                                     // ext4_convert_unwritten_extents()
                                     // extent tree: UNWRITTEN → WRITTEN
                                     // advancing i_es_seq
   __iomap_write_begin()
     // ext4_iomap_valid(): cookie != i_es_seq
     // iomap invalidated, re-maps
     // gets type = IOMAP_MAPPED (WRITTEN)
     // iomap_block_needs_zeroing(): FALSE

Without passing out m_seq, the stale IOMAP_UNWRITTEN type from the iomap
would cause __iomap_write_begin()->iomap_block_needs_zeroing() to zero
out blocks that have already been written, corrupting data on partial
writes.

Thanks,
Yi

>>
>> Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
>> blocks") added the m_seq field to ext4_map_blocks to pass out extent
>> sequence numbers, but it missed two callsites within
>> ext4_da_map_blocks(). These callsites are on the delayed allocation
>> path, which is also used in the iomap buffered write path. Pass out the
>> sequence counter to ensure stale mappings can be detected.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> ---
>>   fs/ext4/inode.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 6c4d9137b279..39577a6b65b9 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>>   	ext4_check_map_extents_env(inode);
>>   
>>   	/* Lookup extent status tree firstly */
>> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
>> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>>   		map->m_len = min_t(unsigned int, map->m_len,
>>   				   es.es_len - (map->m_lblk - es.es_lblk));
>>   
>> @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>>   	 * is held in write mode, before inserting a new da entry in
>>   	 * the extent status tree.
>>   	 */
>> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
>> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>>   		map->m_len = min_t(unsigned int, map->m_len,
>>   				   es.es_len - (map->m_lblk - es.es_lblk));
>>   
>> -- 
>> 2.52.0
>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
  2026-06-16 10:45   ` Jan Kara
@ 2026-06-16 14:42     ` Zhang Yi
  0 siblings, 0 replies; 85+ messages in thread
From: Zhang Yi @ 2026-06-16 14:42 UTC (permalink / raw)
  To: Jan Kara, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai

On 6/16/2026 6:45 PM, Jan Kara wrote:
> On Mon 11-05-26 15:23:28, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Introduce two new iomap_ops instances for ext4 buffered writes:
>>
>>   - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>>     ext4_da_map_blocks() to map delalloc extents.
>>   - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>>     ext4_iomap_get_blocks() to directly allocate blocks.
>>
>> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
>> validity.
>>
>> Key changes and considerations:
>>
>>   - Unwritten extents for new blocks (dioread_nolock always on)
>>     Since data=ordered mode is not used to prevent stale data exposure in
>>     the non-delayed allocation path, new blocks are always allocated as
>>     unwritten extents.
>>
>>   - Short write and write failure handling
>>     a. Delalloc path: On short write or failure, the stale delalloc range
>>        must be dropped and its space reservation released. Otherwise, a
>>        clean folio may cover leftover delalloc extents, causing
>>        inaccurate space reservation accounting.
>>     b. Non-delalloc path: No cleanup of allocated blocks is needed on
>>        short write.
>>
>>   - Lock ordering reversal
>>     The folio lock and transaction start ordering is reversed compared to
>>     the buffer_head buffered write path. To handle this, the journal
>>     handle must be stopped in iomap_begin() callbacks. The lock ordering
>>     documentation in super.c has been updated accordingly.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Looks good to me - besides the IOMAP_F_NEW bugs Ojaswin found. One
> observation I have here is that since the old indirect block based on-disk
> format doesn't support unwritten extents we can never transition it to the
> iomap scheme used here. So we'll have to figure out some way to avoid
> maintaining two (actually three if we count data=journal) buffered write /
> writeback paths in the long term. But let's address that once things settle
> for the common paths.
> 
> 								Honza

Yes, I agree. In the future, we need to convert the buffered I/O path
to iomap as much as possible to reduce maintenance costs. For the ext3
filesystem, which does not support unwritten extents, I haven't given
much thought to a conversion plan at the moment. Perhaps we could
implement it through the "delay map" we discussed earlier in v2. After
this patch set is finished, we can take a closer look at the solution.
:-)

Best Regards,
Yi.

> 
>> ---
>>   fs/ext4/ext4.h  |   4 ++
>>   fs/ext4/file.c  |  20 +++++-
>>   fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   fs/ext4/super.c |  10 ++-
>>   4 files changed, 192 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 1e27d73d7427..4832e7f7db82 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>>   int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>>   				struct buffer_head *bh);
>>   void ext4_set_inode_mapping_order(struct inode *inode);
>> +int ext4_nonda_switch(struct super_block *sb);
>>   #define FALL_BACK_TO_NONDELALLOC 1
>>   #define CONVERT_INLINE_DATA	 2
>>   
>> @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
>>   
>>   extern const struct iomap_ops ext4_iomap_ops;
>>   extern const struct iomap_ops ext4_iomap_report_ops;
>> +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
>> +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
>> +extern const struct iomap_write_ops ext4_iomap_write_ops;
>>   
>>   static inline int ext4_buffer_uptodate(struct buffer_head *bh)
>>   {
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index eb1a323962b1..7f9bfbbc4a4e 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>>   	return count;
>>   }
>>   
>> +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
>> +					 struct iov_iter *from)
>> +{
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +	const struct iomap_ops *iomap_ops;
>> +
>> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
>> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
>> +	else
>> +		iomap_ops = &ext4_iomap_buffered_write_ops;
>> +
>> +	return iomap_file_buffered_write(iocb, from, iomap_ops,
>> +					 &ext4_iomap_write_ops, NULL);
>> +}
>> +
>>   static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>>   					struct iov_iter *from)
>>   {
>> @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>>   	if (ret <= 0)
>>   		goto out;
>>   
>> -	ret = generic_perform_write(iocb, from);
>> +	if (ext4_inode_buffered_iomap(inode))
>> +		ret = ext4_iomap_buffered_write(iocb, from);
>> +	else
>> +		ret = generic_perform_write(iocb, from);
>>   
>>   out:
>>   	inode_unlock(inode);
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 39577a6b65b9..1ae7d3f4a1c8 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
>>   	return ret;
>>   }
>>   
>> -static int ext4_nonda_switch(struct super_block *sb)
>> +int ext4_nonda_switch(struct super_block *sb)
>>   {
>>   	s64 free_clusters, dirty_clusters;
>>   	struct ext4_sb_info *sbi = EXT4_SB(sb);
>> @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
>>   	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
>>   }
>>   
>> +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
>> +{
>> +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
>> +}
>> +
>> +const struct iomap_write_ops ext4_iomap_write_ops = {
>> +	.iomap_valid = ext4_iomap_valid,
>> +};
>> +
>>   static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>>   			   struct ext4_map_blocks *map, loff_t offset,
>>   			   loff_t length, unsigned int flags)
>> @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>>   	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>>   		iomap->flags |= IOMAP_F_MERGED;
>>   
>> +	iomap->validity_cookie = map->m_seq;
>> +
>>   	/*
>>   	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
>>   	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
>> @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
>>   	.iomap_begin = ext4_iomap_begin_report,
>>   };
>>   
>> +/* Map blocks */
>> +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
>> +
>>   static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
>> -		loff_t length, struct ext4_map_blocks *map)
>> +		loff_t length, ext4_get_blocks_t get_blocks,
>> +		struct ext4_map_blocks *map)
>>   {
>>   	u8 blkbits = inode->i_blkbits;
>>   
>> @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
>>   	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>>   			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
>>   
>> +	if (get_blocks)
>> +		return get_blocks(inode, map);
>> +
>>   	return ext4_map_blocks(NULL, inode, map, 0);
>>   }
>>   
>> @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>>   	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
>>   		return -ERANGE;
>>   
>> -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
>> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>>   	if (ret < 0)
>>   		return ret;
>>   
>> @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>>   	return 0;
>>   }
>>   
>> +static int ext4_iomap_get_blocks(struct inode *inode,
>> +				 struct ext4_map_blocks *map)
>> +{
>> +	loff_t i_size = i_size_read(inode);
>> +	handle_t *handle;
>> +	int ret;
>> +
>> +	/*
>> +	 * Check if the blocks have already been allocated, this could
>> +	 * avoid initiating a new journal transaction and return the
>> +	 * mapping information directly.
>> +	 */
>> +	if ((map->m_lblk + map->m_len) <=
>> +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
>> +		ret = ext4_map_blocks(NULL, inode, map, 0);
>> +		if (ret < 0)
>> +			return ret;
>> +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
>> +				    EXT4_MAP_DELAYED))
>> +			return 0;
>> +	}
>> +
>> +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
>> +			ext4_chunk_trans_blocks(inode, map->m_len));
>> +	if (IS_ERR(handle))
>> +		return PTR_ERR(handle);
>> +
>> +	ret = ext4_map_blocks(handle, inode, map,
>> +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
>> +	/*
>> +	 * Stop handle here following the lock ordering of the folio lock
>> +	 * and the transaction start.
>> +	 */
>> +	ext4_journal_stop(handle);
>> +
>> +	return ret;
>> +}
>> +
>> +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
>> +{
>> +	int ret, retries = 0;
>> +	struct ext4_map_blocks map;
>> +	ext4_get_blocks_t *get_blocks;
>> +
>> +	ret = ext4_emergency_state(inode->i_sb);
>> +	if (unlikely(ret))
>> +		return ret;
>> +
>> +	/* Inline data and non-extent are not supported. */
>> +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
>> +		return -ERANGE;
>> +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>> +		return -EINVAL;
>> +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
>> +		return -EINVAL;
>> +
>> +	if (delalloc)
>> +		get_blocks = ext4_da_map_blocks;
>> +	else
>> +		get_blocks = ext4_iomap_get_blocks;
>> +retry:
>> +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
>> +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
>> +		goto retry;
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>> +	return 0;
>> +}
>> +
>> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap)
>> +{
>> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> +						  iomap, srcmap, false);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap)
>> +{
>> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> +						  iomap, srcmap, true);
>> +}
>> +
>> +/*
>> + * On write failure, drop the stale delayed allocation range and release
>> + * its reserved space for both start and end blocks. Otherwise, we may
>> + * leave a range of delayed extents covered by a clean folio, which can
>> + * result in inaccurate space reservation accounting.
>> + */
>> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
>> +				     loff_t length, struct iomap *iomap)
>> +{
>> +	down_write(&EXT4_I(inode)->i_data_sem);
>> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
>> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
>> +	up_write(&EXT4_I(inode)->i_data_sem);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>> +					    loff_t length, ssize_t written,
>> +					    unsigned int flags,
>> +					    struct iomap *iomap)
>> +{
>> +	loff_t start_byte, end_byte;
>> +
>> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
>> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
>> +		return 0;
>> +
>> +	/* Nothing to do if we've written the entire delalloc extent */
>> +	start_byte = iomap_last_written_block(inode, offset, written);
>> +	end_byte = round_up(offset + length, i_blocksize(inode));
>> +	if (start_byte >= end_byte)
>> +		return 0;
>> +
>> +	filemap_invalidate_lock(inode->i_mapping);
>> +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
>> +				     iomap, ext4_iomap_punch_delalloc);
>> +	filemap_invalidate_unlock(inode->i_mapping);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Since we always allocate unwritten extents, there is no need for
>> + * iomap_end to clean up allocated blocks on a short write.
>> + */
>> +const struct iomap_ops ext4_iomap_buffered_write_ops = {
>> +	.iomap_begin = ext4_iomap_buffered_write_begin,
>> +};
>> +
>> +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
>> +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
>> +	.iomap_end = ext4_iomap_buffered_da_write_end,
>> +};
>> +
>>   const struct iomap_ops ext4_iomap_buffered_read_ops = {
>>   	.iomap_begin = ext4_iomap_buffered_read_begin,
>>   };
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 6a77db4d3124..9bc294b769db 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
>>    *   -> page lock -> i_data_sem (rw)
>>    *
>>    * buffered write path:
>> - * sb_start_write -> i_mutex -> mmap_lock
>> - * sb_start_write -> i_mutex -> transaction start -> page lock ->
>> - *   i_data_sem (rw)
>> + * sb_start_write -> i_rwsem (w) -> mmap_lock
>> + * - buffer_head path:
>> + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
>> + *     i_data_sem (rw)
>> + * - iomap path:
>> + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
>> + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
>>    *
>>    * truncate:
>>    * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
>> -- 
>> 2.52.0
>>


^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2026-06-16 14:43 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
2026-05-19  5:24   ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]() Zhang Yi
2026-05-19  6:05   ` Ojaswin Mujoo
2026-06-16  9:31   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr() Zhang Yi
2026-05-19  6:08   ` Ojaswin Mujoo
2026-06-16  9:36   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O Zhang Yi
2026-05-19  9:28   ` Ojaswin Mujoo
2026-05-19 12:35     ` Zhang Yi
2026-05-19 16:53       ` Ojaswin Mujoo
2026-05-20  2:49         ` Zhang Yi
2026-05-26 17:11           ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 05/23] ext4: implement buffered read path using iomap Zhang Yi
2026-05-19 10:00   ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
2026-05-19 10:02   ` Ojaswin Mujoo
2026-06-16 10:04   ` Jan Kara
2026-06-16 12:37     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path Zhang Yi
2026-05-19 10:41   ` Ojaswin Mujoo
2026-05-19 13:31     ` Ojaswin Mujoo
2026-05-20  8:18       ` Zhang Yi
2026-05-20 13:17         ` Ojaswin Mujoo
2026-06-16 10:01   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
2026-05-26 17:10   ` Ojaswin Mujoo
2026-05-28 15:40     ` Darrick J. Wong
2026-06-02  7:02       ` Ojaswin Mujoo
2026-05-29  9:13     ` Zhang Yi
2026-06-02 10:05       ` Ojaswin Mujoo
2026-06-03  1:44         ` Zhang Yi
2026-06-02 10:26   ` Ojaswin Mujoo
2026-06-03  2:56     ` Zhang Yi
2026-06-03 11:08       ` Ojaswin Mujoo
2026-06-16 10:45   ` Jan Kara
2026-06-16 14:42     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 09/23] ext4: implement writeback " Zhang Yi
2026-05-27 12:49   ` Ojaswin Mujoo
2026-05-29 14:12     ` Zhang Yi
2026-05-29 15:32       ` Ojaswin Mujoo
2026-05-30  1:21         ` Zhang Yi
2026-06-01  6:26           ` Ojaswin Mujoo
2026-06-16 11:47   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 10/23] ext4: implement mmap " Zhang Yi
2026-05-27 12:56   ` Ojaswin Mujoo
2026-06-16 11:56   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 11/23] iomap: correct the range of a partial dirty clear Zhang Yi
2026-05-11  7:46   ` Christoph Hellwig
2026-05-11  8:57     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 12/23] iomap: support invalidating partial folios Zhang Yi
2026-05-11  7:23 ` [PATCH v4 13/23] iomap: fix incorrect did_zero setting in iomap_zero_iter() Zhang Yi
2026-05-11  7:23 ` [PATCH v4 14/23] ext4: implement partial block zero range path using iomap Zhang Yi
2026-05-27 13:13   ` Ojaswin Mujoo
2026-06-16 12:28   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
2026-05-27 13:14   ` Ojaswin Mujoo
2026-06-16 12:29   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 16/23] ext4: disable online defrag when inode using " Zhang Yi
2026-05-27 13:14   ` Ojaswin Mujoo
2026-06-16 12:30   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the " Zhang Yi
2026-05-27 13:41   ` Ojaswin Mujoo
2026-05-30  2:53     ` Zhang Yi
2026-06-01  9:08       ` Ojaswin Mujoo
2026-06-01 12:22         ` Zhang Yi
2026-06-01 18:15           ` Ojaswin Mujoo
2026-06-02  3:36             ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 18/23] ext4: wait for ordered I/O " Zhang Yi
2026-05-27 15:58   ` Ojaswin Mujoo
2026-05-28 13:34     ` Ojaswin Mujoo
2026-05-30  9:32       ` Zhang Yi
2026-06-02  5:56         ` Ojaswin Mujoo
2026-05-30  7:22     ` Zhang Yi
2026-05-30  8:24       ` Zhang Yi
2026-06-01 18:33         ` Ojaswin Mujoo
2026-06-02  3:22           ` Zhang Yi
2026-06-02  5:35             ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 19/23] ext4: update i_disksize to i_size on ordered I/O completion Zhang Yi
2026-05-11  7:23 ` [PATCH v4 20/23] ext4: wait for ordered I/O to complete during insert and collapse range Zhang Yi
2026-05-11  7:23 ` [PATCH v4 21/23] ext4: add tracepoints for ordered I/O in the iomap buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 22/23] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
2026-05-11  7:23 ` [PATCH v4 23/23] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox