[PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
@ 2025-11-21  6:07 Zhang Yi
  2025-11-21  6:07 ` [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at() Zhang Yi
                   ` (14 more replies)
  0 siblings, 15 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:07 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

Changes since v1:
 - Rebase the codes based on the latest linux-next 20251120.
 - Add patches 01-05, fix two stale data problems caused by
   EXT4_EXT_MAY_ZEROOUT when splitting extent.
 - Add patches 06-07, fix two stale extent status entries problems also
   caused by splitting extent.
 - Modify patches 08-10, extend __es_remove_extent() and
   ext4_es_cache_extent() to allow them to overwrite existing extents of
   the same status when caching on-disk extents, while also checking
   extents of different stauts and raising alarms to prevent misuse.
 - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
   remove the TODO comment in it.

v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/

Original Description

This series addresses the optimization that Jan pointed out [1]
regarding the introduction of a sequence number to
ext4_es_insert_extent(). The proposal is to replace all instances where
the cache of on-disk extents is updated by using ext4_es_cache_extent()
instead of ext4_es_insert_extent(). This change can prevent excessive
cache invalidations caused by unnecessarily increasing the extent
sequence number when reading from the on-disk extent tree.

[1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/

Cheers,
Yi.

Zhang Yi (13):
  ext4: cleanup zeroout in ext4_split_extent_at()
  ext4: subdivide EXT4_EXT_DATA_VALID1
  ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
    submitting I/O
  ext4: correct the mapping status if the extent has been zeroed
  ext4: don't cache extent during splitting extent
  ext4: drop extent cache before splitting extent
  ext4: cleanup useless out tag in __es_remove_extent()
  ext4: make __es_remove_extent() check extent status
  ext4: make ext4_es_cache_extent() support overwrite existing extents
  ext4: adjust the debug info in ext4_es_cache_extent()
  ext4: replace ext4_es_insert_extent() when caching on-disk extents
  ext4: drop the TODO comment in ext4_es_insert_extent()

 fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
 fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
 fs/ext4/inode.c          |  18 +++---
 3 files changed, 176 insertions(+), 90 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at()
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
@ 2025-11-21  6:07 ` Zhang Yi
  2025-11-26 11:26   ` Ojaswin Mujoo
                     ` (2 more replies)
  2025-11-21  6:08 ` [PATCH v2 02/13] ext4: subdivide EXT4_EXT_DATA_VALID1 Zhang Yi
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:07 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

zero_ex is a temporary variable used only for writing zeros and
inserting extent status entry, it will not be directly inserted into the
tree. Therefore, it can be assigned values from the target extent in
various scenarios, eliminating the need to explicitly assign values to
each variable individually.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 63 ++++++++++++++++++-----------------------------
 1 file changed, 24 insertions(+), 39 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c7d219e6c6d8..91682966597d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3278,46 +3278,31 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
 	ex = path[depth].p_ext;
 
 	if (EXT4_EXT_MAY_ZEROOUT & split_flag) {
-		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
-			if (split_flag & EXT4_EXT_DATA_VALID1) {
-				err = ext4_ext_zeroout(inode, ex2);
-				zero_ex.ee_block = ex2->ee_block;
-				zero_ex.ee_len = cpu_to_le16(
-						ext4_ext_get_actual_len(ex2));
-				ext4_ext_store_pblock(&zero_ex,
-						      ext4_ext_pblock(ex2));
-			} else {
-				err = ext4_ext_zeroout(inode, ex);
-				zero_ex.ee_block = ex->ee_block;
-				zero_ex.ee_len = cpu_to_le16(
-						ext4_ext_get_actual_len(ex));
-				ext4_ext_store_pblock(&zero_ex,
-						      ext4_ext_pblock(ex));
-			}
-		} else {
-			err = ext4_ext_zeroout(inode, &orig_ex);
-			zero_ex.ee_block = orig_ex.ee_block;
-			zero_ex.ee_len = cpu_to_le16(
-						ext4_ext_get_actual_len(&orig_ex));
-			ext4_ext_store_pblock(&zero_ex,
-					      ext4_ext_pblock(&orig_ex));
-		}
+		if (split_flag & EXT4_EXT_DATA_VALID1)
+			memcpy(&zero_ex, ex2, sizeof(zero_ex));
+		else if (split_flag & EXT4_EXT_DATA_VALID2)
+			memcpy(&zero_ex, ex, sizeof(zero_ex));
+		else
+			memcpy(&zero_ex, &orig_ex, sizeof(zero_ex));
 
-		if (!err) {
-			/* update the extent length and mark as initialized */
-			ex->ee_len = cpu_to_le16(ee_len);
-			ext4_ext_try_to_merge(handle, inode, path, ex);
-			err = ext4_ext_dirty(handle, inode, path + path->p_depth);
-			if (!err)
-				/* update extent status tree */
-				ext4_zeroout_es(inode, &zero_ex);
-			/* If we failed at this point, we don't know in which
-			 * state the extent tree exactly is so don't try to fix
-			 * length of the original extent as it may do even more
-			 * damage.
-			 */
-			goto out;
-		}
+		err = ext4_ext_zeroout(inode, &zero_ex);
+		if (err)
+			goto fix_extent_len;
+
+		/* update the extent length and mark as initialized */
+		ex->ee_len = cpu_to_le16(ee_len);
+		ext4_ext_try_to_merge(handle, inode, path, ex);
+		err = ext4_ext_dirty(handle, inode, path + path->p_depth);
+		if (!err)
+			/* update extent status tree */
+			ext4_zeroout_es(inode, &zero_ex);
+		/*
+		 * If we failed at this point, we don't know in which
+		 * state the extent tree exactly is so don't try to fix
+		 * length of the original extent as it may do even more
+		 * damage.
+		 */
+		goto out;
 	}
 
 fix_extent_len:
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 02/13] ext4: subdivide EXT4_EXT_DATA_VALID1
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
  2025-11-21  6:07 ` [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at() Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-26 11:27   ` Ojaswin Mujoo
  2025-11-21  6:08 ` [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 Zhang Yi
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When splitting an extent, if the EXT4_GET_BLOCKS_CONVERT flag is set and
it is necessary to split the target extent in the middle,
ext4_split_extent() first handles splitting the latter half of the
extent and passes the EXT4_EXT_DATA_VALID1 flag. This flag implies that
all blocks before the split point contain valid data; however, this
assumption is incorrect.

Therefore, subdivid EXT4_EXT_DATA_VALID1 into
EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_DATA_PARTIAL_VALID1, which
indicate that the first half of the extent is either entirely valid or
only partially valid, respectively. These two flags cannot be set
simultaneously.

This patch does not use EXT4_EXT_DATA_PARTIAL_VALID1, it only replaces
EXT4_EXT_DATA_VALID1 with EXT4_EXT_DATA_ENTIRE_VALID1 at the location
where it is set, no logical changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 91682966597d..f7aa497e5d6c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -43,8 +43,13 @@
 #define EXT4_EXT_MARK_UNWRIT1	0x2  /* mark first half unwritten */
 #define EXT4_EXT_MARK_UNWRIT2	0x4  /* mark second half unwritten */
 
-#define EXT4_EXT_DATA_VALID1	0x8  /* first half contains valid data */
-#define EXT4_EXT_DATA_VALID2	0x10 /* second half contains valid data */
+/* first half contains valid data */
+#define EXT4_EXT_DATA_ENTIRE_VALID1	0x8   /* has partially valid data */
+#define EXT4_EXT_DATA_PARTIAL_VALID1	0x10  /* has entirely valid data */
+#define EXT4_EXT_DATA_VALID1		(EXT4_EXT_DATA_ENTIRE_VALID1 | \
+					 EXT4_EXT_DATA_PARTIAL_VALID1)
+
+#define EXT4_EXT_DATA_VALID2	0x20 /* second half contains valid data */
 
 static __le32 ext4_extent_block_csum(struct inode *inode,
 				     struct ext4_extent_header *eh)
@@ -3190,8 +3195,9 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
 	unsigned int ee_len, depth;
 	int err = 0;
 
-	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
-	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
+	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) == EXT4_EXT_DATA_VALID1);
+	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) &&
+	       (split_flag & EXT4_EXT_DATA_VALID2));
 
 	ext_debug(inode, "logical block %llu\n", (unsigned long long)split);
 
@@ -3358,7 +3364,7 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
 			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
 				       EXT4_EXT_MARK_UNWRIT2;
 		if (split_flag & EXT4_EXT_DATA_VALID2)
-			split_flag1 |= EXT4_EXT_DATA_VALID1;
+			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
 		path = ext4_split_extent_at(handle, inode, path,
 				map->m_lblk + map->m_len, split_flag1, flags1);
 		if (IS_ERR(path))
@@ -3717,7 +3723,7 @@ static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle,
 
 	/* Convert to unwritten */
 	if (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
-		split_flag |= EXT4_EXT_DATA_VALID1;
+		split_flag |= EXT4_EXT_DATA_ENTIRE_VALID1;
 	/* Convert to initialized */
 	} else if (flags & EXT4_GET_BLOCKS_CONVERT) {
 		split_flag |= ee_block + ee_len <= eof_block ?
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
  2025-11-21  6:07 ` [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at() Zhang Yi
  2025-11-21  6:08 ` [PATCH v2 02/13] ext4: subdivide EXT4_EXT_DATA_VALID1 Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-26 11:29   ` Ojaswin Mujoo
  2025-11-27 13:41   ` Jan Kara
  2025-11-21  6:08 ` [PATCH v2 04/13] ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/O Zhang Yi
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When allocating initialized blocks from a large unwritten extent, or
when splitting an unwritten extent during end I/O and converting it to
initialized, there is currently a potential issue of stale data if the
extent needs to be split in the middle.

       0  A      B  N
       [UUUUUUUUUUUU]    U: unwritten extent
       [--DDDDDDDD--]    D: valid data
          |<-  ->| ----> this range needs to be initialized

ext4_split_extent() first try to split this extent at B with
EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
ext4_split_extent_at() failed to split this extent due to temporary lack
of space. It zeroout B to N and mark the entire extent from 0 to N
as written.

       0  A      B  N
       [WWWWWWWWWWWW]    W: written extent
       [SSDDDDDDDDZZ]    Z: zeroed, S: stale data

ext4_split_extent() then try to split this extent at A with
EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
a stale written extent from 0 to A.

       0  A      B   N
       [WW|WWWWWWWWWW]
       [SS|DDDDDDDDZZ]

Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
when splitting at B, don't convert the entire extent to written and left
it as unwritten after zeroing out B to N. The remaining work is just
like the standard two-part split. ext4_split_extent() will pass the
EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
second time, allowing it to properly handle the split. If the split is
successful, it will keep extent from 0 to A as unwritten.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f7aa497e5d6c..cafe66cb562f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
 		err = ext4_ext_zeroout(inode, &zero_ex);
 		if (err)
 			goto fix_extent_len;
+		/*
+		 * The first half contains partially valid data, the splitting
+		 * of this extent has not been completed, fix extent length
+		 * and ext4_split_extent() split will the first half again.
+		 */
+		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
+			goto fix_extent_len;
 
 		/* update the extent length and mark as initialized */
 		ex->ee_len = cpu_to_le16(ee_len);
@@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
 			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
 				       EXT4_EXT_MARK_UNWRIT2;
 		if (split_flag & EXT4_EXT_DATA_VALID2)
-			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
+			split_flag1 |= map->m_lblk > ee_block ?
+				       EXT4_EXT_DATA_PARTIAL_VALID1 :
+				       EXT4_EXT_DATA_ENTIRE_VALID1;
 		path = ext4_split_extent_at(handle, inode, path,
 				map->m_lblk + map->m_len, split_flag1, flags1);
 		if (IS_ERR(path))
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 04/13] ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/O
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (2 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-26 11:50   ` Ojaswin Mujoo
  2025-11-21  6:08 ` [PATCH v2 05/13] ext4: correct the mapping status if the extent has been zeroed Zhang Yi
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When allocating blocks during within-EOF DIO and writeback with
dioread_nolock enabled, EXT4_GET_BLOCKS_PRE_IO was set to split an
existing large unwritten extent. However, EXT4_GET_BLOCKS_CONVERT was
set when calling ext4_split_convert_extents(), which may potentially
result in stale data issues.

Assume we have an unwritten extent, and then DIO writes the second half.

   [UUUUUUUUUUUUUUUU] on-disk extent        U: unwritten extent
   [UUUUUUUUUUUUUUUU] extent status tree
            |<-   ->| ----> dio write this range

First, ext4_iomap_alloc() call ext4_map_blocks() with
EXT4_GET_BLOCKS_PRE_IO, EXT4_GET_BLOCKS_UNWRIT_EXT and
EXT4_GET_BLOCKS_CREATE flags set. ext4_map_blocks() find this extent and
call ext4_split_convert_extents() with EXT4_GET_BLOCKS_CONVERT and the
above flags set.

Then, ext4_split_convert_extents() calls ext4_split_extent() with
EXT4_EXT_MAY_ZEROOUT, EXT4_EXT_MARK_UNWRIT2 and EXT4_EXT_DATA_VALID2
flags set, and it calls ext4_split_extent_at() to split the second half
with EXT4_EXT_DATA_VALID2, EXT4_EXT_MARK_UNWRIT1, EXT4_EXT_MAY_ZEROOUT
and EXT4_EXT_MARK_UNWRIT2 flags set. However, ext4_split_extent_at()
failed to insert extent since a temporary lack -ENOSPC. It zeroes out
the first half but convert the entire on-disk extent to written since
the EXT4_EXT_DATA_VALID2 flag set, but left the second half as unwritten
in the extent status tree.

   [0000000000SSSSSS]  data                S: stale data, 0: zeroed
   [WWWWWWWWWWWWWWWW]  on-disk extent      W: written extent
   [WWWWWWWWWWUUUUUU]  extent status tree

Finally, if the DIO failed to write data to the disk, the stale data in
the second half will be exposed once the cached extent entry is gone.

Fix this issue by not passing EXT4_GET_BLOCKS_CONVERT when splitting
an unwritten extent before submitting I/O, and make
ext4_split_convert_extents() to zero out the entire extent range
to zero for this case, and also mark the extent in the extent status
tree for consistency.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index cafe66cb562f..2db84f6b0056 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3733,11 +3733,15 @@ static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle,
 	/* Convert to unwritten */
 	if (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
 		split_flag |= EXT4_EXT_DATA_ENTIRE_VALID1;
-	/* Convert to initialized */
-	} else if (flags & EXT4_GET_BLOCKS_CONVERT) {
+	/* Split the existing unwritten extent */
+	} else if (flags & (EXT4_GET_BLOCKS_UNWRIT_EXT |
+			    EXT4_GET_BLOCKS_CONVERT)) {
 		split_flag |= ee_block + ee_len <= eof_block ?
 			      EXT4_EXT_MAY_ZEROOUT : 0;
-		split_flag |= (EXT4_EXT_MARK_UNWRIT2 | EXT4_EXT_DATA_VALID2);
+		split_flag |= EXT4_EXT_MARK_UNWRIT2;
+		/* Convert to initialized */
+		if (flags & EXT4_GET_BLOCKS_CONVERT)
+			split_flag |= EXT4_EXT_DATA_VALID2;
 	}
 	flags |= EXT4_GET_BLOCKS_PRE_IO;
 	return ext4_split_extent(handle, inode, path, map, split_flag, flags,
@@ -3913,7 +3917,7 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
 	/* get_block() before submitting IO, split the extent */
 	if (flags & EXT4_GET_BLOCKS_PRE_IO) {
 		path = ext4_split_convert_extents(handle, inode, map, path,
-				flags | EXT4_GET_BLOCKS_CONVERT, allocated);
+						  flags, allocated);
 		if (IS_ERR(path))
 			return path;
 		/*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 05/13] ext4: correct the mapping status if the extent has been zeroed
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (3 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 04/13] ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/O Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-26 11:56   ` Ojaswin Mujoo
  2025-11-21  6:08 ` [PATCH v2 06/13] ext4: don't cache extent during splitting extent Zhang Yi
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Before submitting I/O and allocating blocks with the
EXT4_GET_BLOCKS_PRE_IO flag set, ext4_split_convert_extents() may
convert the target extent range to initialized due to ENOSPC, ENOMEM, or
EQUOTA errors. However, it still marks the mapping as incorrectly
unwritten. Although this may not seem to cause any practical problems,
it will result in an unnecessary extent conversion operation after I/O
completion. Therefore, it's better to correct the returned mapping
status.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2db84f6b0056..19338f488550 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3916,6 +3916,8 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
 
 	/* get_block() before submitting IO, split the extent */
 	if (flags & EXT4_GET_BLOCKS_PRE_IO) {
+		int depth;
+
 		path = ext4_split_convert_extents(handle, inode, map, path,
 						  flags, allocated);
 		if (IS_ERR(path))
@@ -3931,7 +3933,13 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
 			err = -EFSCORRUPTED;
 			goto errout;
 		}
-		map->m_flags |= EXT4_MAP_UNWRITTEN;
+		/* Don't mark unwritten if the extent has been zeroed out. */
+		path = ext4_find_extent(inode, map->m_lblk, path, flags);
+		if (IS_ERR(path))
+			return path;
+		depth = ext_depth(inode);
+		if (ext4_ext_is_unwritten(path[depth].p_ext))
+			map->m_flags |= EXT4_MAP_UNWRITTEN;
 		goto out;
 	}
 	/* IO end_io complete, convert the filled extent to written */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 06/13] ext4: don't cache extent during splitting extent
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (4 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 05/13] ext4: correct the mapping status if the extent has been zeroed Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-26 12:04   ` Ojaswin Mujoo
  2025-11-21  6:08 ` [PATCH v2 07/13] ext4: drop extent cache before " Zhang Yi
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Caching extents during the splitting process is risky, as it may result
in stale extents remaining in the status tree. Moreover, in most cases,
the corresponding extent block entries are likely already cached before
the split happens, making caching here not particularly useful.

Assume we have an unwritten extent, and then DIO writes the first half.

  [UUUUUUUUUUUUUUUU] on-disk extent        U: unwritten extent
  [UUUUUUUUUUUUUUUU] extent status tree
  |<-   ->| ----> dio write this range

First, when ext4_split_extent_at() splits this extent, it truncates the
existing extent and then inserts a new one. During this process, this
extent status entry may be shrunk, and calls to ext4_find_extent() and
ext4_cache_extents() may occur, which could potentially insert the
truncated range as a hole into the extent status tree. After the split
is completed, this hole is not replaced with the correct status.

  [UUUUUUU|UUUUUUUU] on-disk extent        U: unwritten extent
  [UUUUUUU|HHHHHHHH] extent status tree    H: hole

Then, the outer calling functions will not correct this remaining hole
extent either. Finally, if we perform a delayed buffer write on this
latter part, it will re-insert the delayed extent and cause an error in
space accounting.

In adition, if the unwritten extent cache is not shrunk during the
splitting, ext4_cache_extents() also conflicts with existing extents
when caching extents. In the future, we will add checks when caching
extents, which will trigger a warning. Therefore, Do not cache extents
that are being split.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 19338f488550..2b5aec3f8882 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3199,6 +3199,9 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
 	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) &&
 	       (split_flag & EXT4_EXT_DATA_VALID2));
 
+	/* Do not cache extents that are in the process of being modified. */
+	flags |= EXT4_EX_NOCACHE;
+
 	ext_debug(inode, "logical block %llu\n", (unsigned long long)split);
 
 	ext4_ext_show_leaf(inode, path);
@@ -3364,6 +3367,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
 	ee_len = ext4_ext_get_actual_len(ex);
 	unwritten = ext4_ext_is_unwritten(ex);
 
+	/* Do not cache extents that are in the process of being modified. */
+	flags |= EXT4_EX_NOCACHE;
+
 	if (map->m_lblk + map->m_len < ee_block + ee_len) {
 		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
 		flags1 = flags | EXT4_GET_BLOCKS_PRE_IO;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 07/13] ext4: drop extent cache before splitting extent
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (5 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 06/13] ext4: don't cache extent during splitting extent Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-26 12:24   ` Ojaswin Mujoo
  2025-11-21  6:08 ` [PATCH v2 08/13] ext4: cleanup useless out tag in __es_remove_extent() Zhang Yi
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When splitting an unwritten extent in the middle and converting it to
initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and
EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent.

Assume we have an unwritten file and buffered write in the middle of it
without dioread_nolock enabled, it will allocate blocks as written
extent.

       0  A      B  N
       [UUUUUUUUUUUU] on-disk extent      U: unwritten extent
       [UUUUUUUUUUUU] extent status tree
       [--DDDDDDDD--]                     D: valid data
          |<-  ->| ----> this range needs to be initialized

ext4_split_extent() first try to split this extent at B with
EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
ext4_split_extent_at() failed to split this extent due to temporary lack
of space. It zeroout B to N and leave the entire extent as unwritten.

       0  A      B  N
       [UUUUUUUUUUUU] on-disk extent
       [UUUUUUUUUUUU] extent status tree
       [--DDDDDDDDZZ]                     Z: zeroed data

ext4_split_extent() then try to split this extent at A with
EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and
leave
an written extent from A to N.

       0  A      B   N
       [UU|WWWWWWWWWW] on-disk extent     W: written extent
       [UU|UUUUUUUUUU] extent status tree
       [--|DDDDDDDDZZ]

Finally ext4_map_create_blocks() only insert extent A to B to the extent
status tree, and leave an stale unwritten extent in the status tree.

       0  A      B   N
       [UU|WWWWWWWWWW] on-disk extent     W: written extent
       [UU|WWWWWWWWUU] extent status tree
       [--|DDDDDDDDZZ]

Fix this issue by always remove cached extent status entry before
splitting extent.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2b5aec3f8882..9bb80af4b5cf 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3367,6 +3367,12 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
 	ee_len = ext4_ext_get_actual_len(ex);
 	unwritten = ext4_ext_is_unwritten(ex);
 
+	/*
+	 * Drop extent cache to prevent stale unwritten extents remaining
+	 * after zeroing out.
+	 */
+	ext4_es_remove_extent(inode, ee_block, ee_len);
+
 	/* Do not cache extents that are in the process of being modified. */
 	flags |= EXT4_EX_NOCACHE;
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 08/13] ext4: cleanup useless out tag in __es_remove_extent()
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (6 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 07/13] ext4: drop extent cache before " Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-27 12:43   ` Jan Kara
  2025-11-28  8:20   ` Ojaswin Mujoo
  2025-11-21  6:08 ` [PATCH v2 09/13] ext4: make __es_remove_extent() check extent status Zhang Yi
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

The out tag in __es_remove_extent() is just return err value, we can
return it directly if something bad happens. Therefore, remove the
useless out tag and rename out_get_reserved to out.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e04fbf10fe4f..04d56f8f6c0c 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1434,7 +1434,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct extent_status orig_es;
 	ext4_lblk_t len1, len2;
 	ext4_fsblk_t block;
-	int err = 0;
+	int err;
 	bool count_reserved = true;
 	struct rsvd_count rc;
 
@@ -1443,9 +1443,9 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 
 	es = __es_tree_search(&tree->root, lblk);
 	if (!es)
-		goto out;
+		return 0;
 	if (es->es_lblk > end)
-		goto out;
+		return 0;
 
 	/* Simply invalidate cache_es. */
 	tree->cache_es = NULL;
@@ -1480,7 +1480,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 
 				es->es_lblk = orig_es.es_lblk;
 				es->es_len = orig_es.es_len;
-				goto out;
+				return err;
 			}
 		} else {
 			es->es_lblk = end + 1;
@@ -1494,7 +1494,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 		if (count_reserved)
 			count_rsvd(inode, orig_es.es_lblk + len1,
 				   orig_es.es_len - len1 - len2, &orig_es, &rc);
-		goto out_get_reserved;
+		goto out;
 	}
 
 	if (len1 > 0) {
@@ -1536,11 +1536,10 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 		}
 	}
 
-out_get_reserved:
+out:
 	if (count_reserved)
 		*reserved = get_rsvd(inode, end, es, &rc);
-out:
-	return err;
+	return 0;
 }
 
 /*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 09/13] ext4: make __es_remove_extent() check extent status
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (7 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 08/13] ext4: cleanup useless out tag in __es_remove_extent() Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-21  6:08 ` [PATCH v2 10/13] ext4: make ext4_es_cache_extent() support overwrite existing extents Zhang Yi
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Currently, __es_remove_extent() unconditionally removes extent status
entries within the specified range. In order to prepare for extending
the ext4_es_cache_extent() function to cache on-disk extents, which may
overwrite some existing short-length extents with the same status, allow
__es_remove_extent() to check the specified extent type before removing
it, and return error and pass out the conflicting extent if the status
does not match.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 49 +++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 04d56f8f6c0c..818007bb613f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -178,7 +178,8 @@ static struct kmem_cache *ext4_pending_cachep;
 static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
 			      struct extent_status *prealloc);
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved,
+			      ext4_lblk_t end, unsigned int status,
+			      int *reserved, struct extent_status *res,
 			      struct extent_status *prealloc);
 static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
@@ -242,6 +243,21 @@ static inline void ext4_es_inc_seq(struct inode *inode)
 	WRITE_ONCE(ei->i_es_seq, ei->i_es_seq + 1);
 }
 
+static inline int __es_check_extent_status(struct extent_status *es,
+					   unsigned int status,
+					   struct extent_status *res)
+{
+	if (ext4_es_type(es) & status)
+		return 0;
+
+	if (res) {
+		res->es_lblk = es->es_lblk;
+		res->es_len = es->es_len;
+		res->es_pblk = es->es_pblk;
+	}
+	return -EINVAL;
+}
+
 /*
  * search through the tree for an delayed extent with a given offset.  If
  * it can't be found, try to find next extent.
@@ -929,7 +945,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		pr = __alloc_pending(true);
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
-	err1 = __es_remove_extent(inode, lblk, end, &resv_used, es1);
+	err1 = __es_remove_extent(inode, lblk, end, 0, &resv_used, NULL, es1);
 	if (err1 != 0)
 		goto error;
 	/* Free preallocated extent if it didn't get used. */
@@ -1409,23 +1425,27 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
 	return rc->ndelayed;
 }
 
-
 /*
  * __es_remove_extent - removes block range from extent status tree
  *
  * @inode - file containing range
  * @lblk - first block in range
  * @end - last block in range
+ * @status - the extent status to be checked
  * @reserved - number of cluster reservations released
+ * @res - return the extent if the status is not match
  * @prealloc - pre-allocated es to avoid memory allocation failures
  *
  * If @reserved is not NULL and delayed allocation is enabled, counts
  * block/cluster reservations freed by removing range and if bigalloc
- * enabled cancels pending reservations as needed. Returns 0 on success,
- * error code on failure.
+ * enabled cancels pending reservations as needed. If @status is not
+ * zero, check extent status type while removing extent, return -EINVAL
+ * and pass out the extent through @res if not match.  Returns 0 on
+ * success, error code on failure.
  */
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved,
+			      ext4_lblk_t end, unsigned int status,
+			      int *reserved, struct extent_status *res,
 			      struct extent_status *prealloc)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
@@ -1440,6 +1460,8 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 
 	if (reserved == NULL || !test_opt(inode->i_sb, DELALLOC))
 		count_reserved = false;
+	if (status == 0)
+		status = ES_TYPE_MASK;
 
 	es = __es_tree_search(&tree->root, lblk);
 	if (!es)
@@ -1447,6 +1469,10 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	if (es->es_lblk > end)
 		return 0;
 
+	err = __es_check_extent_status(es, status, res);
+	if (err)
+		return err;
+
 	/* Simply invalidate cache_es. */
 	tree->cache_es = NULL;
 	if (count_reserved)
@@ -1509,6 +1535,9 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	}
 
 	while (es && ext4_es_end(es) <= end) {
+		err = __es_check_extent_status(es, status, res);
+		if (err)
+			return err;
 		if (count_reserved)
 			count_rsvd(inode, es->es_lblk, es->es_len, es, &rc);
 		node = rb_next(&es->rb_node);
@@ -1524,6 +1553,10 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	if (es && es->es_lblk < end + 1) {
 		ext4_lblk_t orig_len = es->es_len;
 
+		err = __es_check_extent_status(es, status, res);
+		if (err)
+			return err;
+
 		len1 = ext4_es_end(es) - end;
 		if (count_reserved)
 			count_rsvd(inode, es->es_lblk, orig_len - len1,
@@ -1581,7 +1614,7 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	 * is reclaimed.
 	 */
 	write_lock(&EXT4_I(inode)->i_es_lock);
-	err = __es_remove_extent(inode, lblk, end, &reserved, es);
+	err = __es_remove_extent(inode, lblk, end, 0, &reserved, NULL, es);
 	if (err)
 		goto error;
 	/* Free preallocated extent if it didn't get used. */
@@ -2173,7 +2206,7 @@ void ext4_es_insert_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
 	}
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
-	err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
+	err1 = __es_remove_extent(inode, lblk, end, 0, NULL, NULL, es1);
 	if (err1 != 0)
 		goto error;
 	/* Free preallocated extent if it didn't get used. */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 10/13] ext4: make ext4_es_cache_extent() support overwrite existing extents
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (8 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 09/13] ext4: make __es_remove_extent() check extent status Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-21  6:08 ` [PATCH v2 11/13] ext4: adjust the debug info in ext4_es_cache_extent() Zhang Yi
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Currently, ext4_es_cache_extent() is used to load extents into the
extent status tree when reading on-disk extent blocks. But it inserts
information into the extent status tree if and only if there isn't
information about the specified range already. So it only used for the
initial loading and does not support overwrit extents.

However, there are many other places in ext4 where on-disk extents are
inserted into the extent status tree, such as in ext4_map_query_blocks().
Currently, they call ext4_es_insert_extent() to perform the insertion,
but they don't modify the extents, so ext4_es_cache_extent() would be a
more appropriate choice. However, when ext4_map_query_blocks() inserts
an extent, it may overwrite a short existing extent of the same type.
Therefore, to prepare for the replacements, we need to extend
ext4_es_cache_extent() to allow it to overwrite existing extents with
the same status. So it checks the found extents before removing and
inserting. (There is one exception, a hole in the on-disk extent but a
delayed extent in the extent status tree is allowed.)

In addition, since cached extents can be more lenient than the extents
they modify and do not involve modifying reserved blocks, it is not
necessary to ensure that the insertion operation succeeds as strictly as
in the ext4_es_insert_extent() function.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 47 ++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 818007bb613f..2643d7a31e7b 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1014,17 +1014,23 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 }
 
 /*
- * ext4_es_cache_extent() inserts information into the extent status
- * tree if and only if there isn't information about the range in
- * question already.
+ * ext4_es_cache_extent() inserts information into the extent status tree
+ * only if there is no existing information about the specified range or
+ * if the existing extents have the same status.
+ *
+ * Note that this interface is only used for caching on-disk extent
+ * information and cannot be used to convert existing extents in the extent
+ * status tree. To convert existing extents, use ext4_es_insert_extent()
+ * instead.
  */
 void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 			  ext4_lblk_t len, ext4_fsblk_t pblk,
 			  unsigned int status)
 {
 	struct extent_status *es;
-	struct extent_status newes;
+	struct extent_status chkes, newes;
 	ext4_lblk_t end = lblk + len - 1;
+	bool conflict = false;
 
 	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
 		return;
@@ -1040,11 +1046,38 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 	BUG_ON(end < lblk);
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
-
 	es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
-	if (!es || es->es_lblk > end)
-		__es_insert_extent(inode, &newes, NULL);
+	if (es && es->es_lblk <= end) {
+		/* Found an extent that covers the entire range. */
+		if (es->es_lblk <= lblk && es->es_lblk + es->es_len > end) {
+			if (__es_check_extent_status(es, status, &chkes))
+				conflict = true;
+			goto unlock;
+		}
+		/* Check and remove all extents in range. */
+		if (__es_remove_extent(inode, lblk, end, status, NULL,
+				       &chkes, NULL)) {
+			conflict = true;
+			goto unlock;
+		}
+	}
+	__es_insert_extent(inode, &newes, NULL);
+unlock:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
+	if (!conflict)
+		return;
+	/*
+	 * A hole in the on-disk extent but a delayed extent in the extent
+	 * status tree, is allowed.
+	 */
+	if (status == EXTENT_STATUS_HOLE &&
+	    ext4_es_type(&chkes) == EXTENT_STATUS_DELAYED)
+		return;
+
+	ext4_warning_inode(inode,
+			   "ES cache extent failed: add [%d,%d,%llu,0x%x] conflict with existing [%d,%d,%llu,0x%x]\n",
+			   lblk, len, pblk, status, chkes.es_lblk, chkes.es_len,
+			   ext4_es_pblock(&chkes), ext4_es_status(&chkes));
 }
 
 /*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 11/13] ext4: adjust the debug info in ext4_es_cache_extent()
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (9 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 10/13] ext4: make ext4_es_cache_extent() support overwrite existing extents Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-21  6:08 ` [PATCH v2 12/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Print a trace point after successfully inserting an extent in the
ext4_es_cache_extent() function. Additionally, similar to other extent
cache operation functions, call ext4_print_pending_tree() to display the
extent debug information of the inode when in ES_DEBUG mode.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 2643d7a31e7b..e370240555ec 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1038,7 +1038,6 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 	newes.es_lblk = lblk;
 	newes.es_len = len;
 	ext4_es_store_pblock_status(&newes, pblk, status);
-	trace_ext4_es_cache_extent(inode, &newes);
 
 	if (!len)
 		return;
@@ -1062,6 +1061,8 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 		}
 	}
 	__es_insert_extent(inode, &newes, NULL);
+	trace_ext4_es_cache_extent(inode, &newes);
+	ext4_es_print_tree(inode);
 unlock:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	if (!conflict)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 12/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (10 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 11/13] ext4: adjust the debug info in ext4_es_cache_extent() Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-21  6:08 ` [PATCH v2 13/13] ext4: drop the TODO comment in ext4_es_insert_extent() Zhang Yi
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

In ext4, the remaining places for inserting extents into the extent
status tree within ext4_ext_determine_insert_hole() and
ext4_map_query_blocks() directly cache on-disk extents. We can use
ext4_es_cache_extent() instead of ext4_es_insert_extent() in these
cases. This will help reduce unnecessary increases in extent sequence
numbers and cache invalidations after supporting IOMAP in the future.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c |  3 +--
 fs/ext4/inode.c   | 18 +++++++++---------
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9bb80af4b5cf..f61a1b57974e 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4184,8 +4184,7 @@ static ext4_lblk_t ext4_ext_determine_insert_hole(struct inode *inode,
 insert_hole:
 	/* Put just found gap into cache to speed up subsequent requests */
 	ext_debug(inode, " -> %u:%u\n", hole_start, len);
-	ext4_es_insert_extent(inode, hole_start, len, ~0,
-			      EXTENT_STATUS_HOLE, false);
+	ext4_es_cache_extent(inode, hole_start, len, ~0, EXTENT_STATUS_HOLE);
 
 	/* Update hole_len to reflect hole size after lblk */
 	if (hole_start != lblk)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5e588accc35a..6ac7602afc39 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -503,8 +503,8 @@ static int ext4_map_query_blocks_next_in_leaf(handle_t *handle,
 	retval = ext4_ext_map_blocks(handle, inode, &map2, 0);
 
 	if (retval <= 0) {
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status, false);
+		ext4_es_cache_extent(inode, map->m_lblk, map->m_len,
+				     map->m_pblk, status);
 		return map->m_len;
 	}
 
@@ -525,13 +525,13 @@ static int ext4_map_query_blocks_next_in_leaf(handle_t *handle,
 	 */
 	if (map->m_pblk + map->m_len == map2.m_pblk &&
 			status == status2) {
-		ext4_es_insert_extent(inode, map->m_lblk,
-				      map->m_len + map2.m_len, map->m_pblk,
-				      status, false);
+		ext4_es_cache_extent(inode, map->m_lblk,
+				     map->m_len + map2.m_len, map->m_pblk,
+				     status);
 		map->m_len += map2.m_len;
 	} else {
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status, false);
+		ext4_es_cache_extent(inode, map->m_lblk, map->m_len,
+				     map->m_pblk, status);
 	}
 
 	return map->m_len;
@@ -573,8 +573,8 @@ static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
 			map->m_len == orig_mlen) {
 		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
 				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status, false);
+		ext4_es_cache_extent(inode, map->m_lblk, map->m_len,
+				     map->m_pblk, status);
 	} else {
 		retval = ext4_map_query_blocks_next_in_leaf(handle, inode, map,
 							    orig_mlen);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 13/13] ext4: drop the TODO comment in ext4_es_insert_extent()
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (11 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 12/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
@ 2025-11-21  6:08 ` Zhang Yi
  2025-11-23 10:55 ` [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Ojaswin Mujoo
  2025-11-28 16:49 ` Theodore Tso
  14 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-21  6:08 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, yi.zhang, yizhang089, libaokun1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Now we have ext4_es_cache_extent() to cache on-disk extents instead of
ext4_es_insert_extent(), so drop the TODO comment.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents_status.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e370240555ec..b681bd0c3dc0 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -898,7 +898,8 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
 
 /*
  * ext4_es_insert_extent() adds information to an inode's extent
- * status tree.
+ * status tree. This interface is used for modifying extents. To cache
+ * on-disk extents, use ext4_es_cache_extent() instead.
  */
 void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			   ext4_lblk_t len, ext4_fsblk_t pblk,
@@ -977,10 +978,6 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		}
 		pending = err3;
 	}
-	/*
-	 * TODO: For cache on-disk extents, there is no need to increment
-	 * the sequence counter, this requires future optimization.
-	 */
 	ext4_es_inc_seq(inode);
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (12 preceding siblings ...)
  2025-11-21  6:08 ` [PATCH v2 13/13] ext4: drop the TODO comment in ext4_es_insert_extent() Zhang Yi
@ 2025-11-23 10:55 ` Ojaswin Mujoo
  2025-11-24  5:04   ` Zhang Yi
  2025-11-28 16:49 ` Theodore Tso
  14 siblings, 1 reply; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-23 10:55 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
> Changes since v1:
>  - Rebase the codes based on the latest linux-next 20251120.
>  - Add patches 01-05, fix two stale data problems caused by

Hi Zhang, thanks for the patches.

I've always felt uncomfortable with the ZEROOUT code here because it
seems to have many such bugs as you pointed out in the series. Its very
fragile and the bugs are easy to miss behind all the data valid and
split flags mess. 

As per my understanding, ZEROOUT logic seems to be a special best-effort
try to make the split/convert operation "work" when dealing with
transient errors like ENOSPC etc. I was just wondering if it makes sense
to just get rid of the whole ZEROOUT logic completely and just reset the
extent to orig state if there is any error. This allows us to get rid of
DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
slightly less messy.

Maybe we can have a retry loop at the top level caller if we want to try
again for say ENOSPC or ENOMEM. 

Would love to hear your thoughts on it.

Thanks,
Ojaswin

>    EXT4_EXT_MAY_ZEROOUT when splitting extent.
>  - Add patches 06-07, fix two stale extent status entries problems also
>    caused by splitting extent.
>  - Modify patches 08-10, extend __es_remove_extent() and
>    ext4_es_cache_extent() to allow them to overwrite existing extents of
>    the same status when caching on-disk extents, while also checking
>    extents of different stauts and raising alarms to prevent misuse.
>  - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
>    remove the TODO comment in it.
> 
> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
> 
> Original Description
> 
> This series addresses the optimization that Jan pointed out [1]
> regarding the introduction of a sequence number to
> ext4_es_insert_extent(). The proposal is to replace all instances where
> the cache of on-disk extents is updated by using ext4_es_cache_extent()
> instead of ext4_es_insert_extent(). This change can prevent excessive
> cache invalidations caused by unnecessarily increasing the extent
> sequence number when reading from the on-disk extent tree.
> 
> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
> 
> Cheers,
> Yi.
> 
> Zhang Yi (13):
>   ext4: cleanup zeroout in ext4_split_extent_at()
>   ext4: subdivide EXT4_EXT_DATA_VALID1
>   ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
>   ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
>     submitting I/O
>   ext4: correct the mapping status if the extent has been zeroed
>   ext4: don't cache extent during splitting extent
>   ext4: drop extent cache before splitting extent
>   ext4: cleanup useless out tag in __es_remove_extent()
>   ext4: make __es_remove_extent() check extent status
>   ext4: make ext4_es_cache_extent() support overwrite existing extents
>   ext4: adjust the debug info in ext4_es_cache_extent()
>   ext4: replace ext4_es_insert_extent() when caching on-disk extents
>   ext4: drop the TODO comment in ext4_es_insert_extent()
> 
>  fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
>  fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
>  fs/ext4/inode.c          |  18 +++---
>  3 files changed, 176 insertions(+), 90 deletions(-)
> 
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-23 10:55 ` [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Ojaswin Mujoo
@ 2025-11-24  5:04   ` Zhang Yi
  2025-11-24 12:50     ` Ojaswin Mujoo
  2025-11-27 12:24     ` Jan Kara
  0 siblings, 2 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-24  5:04 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

Hi, Ojaswin!

On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
>> Changes since v1:
>>  - Rebase the codes based on the latest linux-next 20251120.
>>  - Add patches 01-05, fix two stale data problems caused by
> 
> Hi Zhang, thanks for the patches.
> 

Thank you for take time to look at this series.

> I've always felt uncomfortable with the ZEROOUT code here because it
> seems to have many such bugs as you pointed out in the series. Its very
> fragile and the bugs are easy to miss behind all the data valid and
> split flags mess. 
> 

Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
significantly increased the complexity of split extents and the
potential for bugs.

> As per my understanding, ZEROOUT logic seems to be a special best-effort
> try to make the split/convert operation "work" when dealing with
> transient errors like ENOSPC etc. I was just wondering if it makes sense
> to just get rid of the whole ZEROOUT logic completely and just reset the
> extent to orig state if there is any error. This allows us to get rid of
> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
> slightly less messy.
> 
> Maybe we can have a retry loop at the top level caller if we want to try
> again for say ENOSPC or ENOMEM. 
> 
> Would love to hear your thoughts on it.
> 

I think this is a direction worth exploring. However, what I am
currently considering is that we need to address this scenario of
splitting extent during the I/O completion. Although the ZEROOUT logic
is fragile and has many issues recently, it currently serves as a
fallback solution for handling ENOSPC errors that arise when splitting
extents during I/O completion. It ensures that I/O operations do not
fail due to insufficient extent blocks.

Please see ext4_convert_unwritten_extents_endio(). Although we have made
our best effort to tried to split extents using
EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
covered all scenarios. Moreover, after converting the buffered I/O path
to the iomap infrastructure in the future, we may need to split extents
during the I/O completion worker[1].

In most block allocation processes, we already have a retry loop to deal
with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
doesn't seem appropriate to place this logic into the I/O completion
handling process (I haven't thought this solution through deeply yet,
but I'm afraid it could introduce potential deadlock risks due to its
involvement with journal operations), and we can't just simply try again.
If we remove the ZEROOUT logic, we may lose our last line of defense
during the I/O completion.

Currently, I am considering whether it is possible to completely remove
EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
submitting I/Os; instead, all splitting would be performed when
converting extents to written after the I/O completes. Based on my patch,
"ext4: use reserved metadata blocks when splitting extent on endio"[2],
and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
shows no regressions.

So I think the ZEROOUT logic remains somewhat useful until we find better
solution(e.g., making more precise reservations for metadata). Perhaps we
can refactor both the split extent and ZEROOUT logic to make them more
concise.

[1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/

Cheers,
Yi.

> Thanks,
> Ojaswin
> 
>>    EXT4_EXT_MAY_ZEROOUT when splitting extent.
>>  - Add patches 06-07, fix two stale extent status entries problems also
>>    caused by splitting extent.
>>  - Modify patches 08-10, extend __es_remove_extent() and
>>    ext4_es_cache_extent() to allow them to overwrite existing extents of
>>    the same status when caching on-disk extents, while also checking
>>    extents of different stauts and raising alarms to prevent misuse.
>>  - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
>>    remove the TODO comment in it.
>>
>> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
>>
>> Original Description
>>
>> This series addresses the optimization that Jan pointed out [1]
>> regarding the introduction of a sequence number to
>> ext4_es_insert_extent(). The proposal is to replace all instances where
>> the cache of on-disk extents is updated by using ext4_es_cache_extent()
>> instead of ext4_es_insert_extent(). This change can prevent excessive
>> cache invalidations caused by unnecessarily increasing the extent
>> sequence number when reading from the on-disk extent tree.
>>
>> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
>>
>> Cheers,
>> Yi.
>>
>> Zhang Yi (13):
>>   ext4: cleanup zeroout in ext4_split_extent_at()
>>   ext4: subdivide EXT4_EXT_DATA_VALID1
>>   ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
>>   ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
>>     submitting I/O
>>   ext4: correct the mapping status if the extent has been zeroed
>>   ext4: don't cache extent during splitting extent
>>   ext4: drop extent cache before splitting extent
>>   ext4: cleanup useless out tag in __es_remove_extent()
>>   ext4: make __es_remove_extent() check extent status
>>   ext4: make ext4_es_cache_extent() support overwrite existing extents
>>   ext4: adjust the debug info in ext4_es_cache_extent()
>>   ext4: replace ext4_es_insert_extent() when caching on-disk extents
>>   ext4: drop the TODO comment in ext4_es_insert_extent()
>>
>>  fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
>>  fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
>>  fs/ext4/inode.c          |  18 +++---
>>  3 files changed, 176 insertions(+), 90 deletions(-)
>>
>> -- 
>> 2.46.1
>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-24  5:04   ` Zhang Yi
@ 2025-11-24 12:50     ` Ojaswin Mujoo
  2025-11-24 14:05       ` Zhang Yi
  2025-11-27 12:24     ` Jan Kara
  1 sibling, 1 reply; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-24 12:50 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Mon, Nov 24, 2025 at 01:04:04PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
> 
> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
> > On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
> >> Changes since v1:
> >>  - Rebase the codes based on the latest linux-next 20251120.
> >>  - Add patches 01-05, fix two stale data problems caused by
> > 
> > Hi Zhang, thanks for the patches.
> > 
> 
> Thank you for take time to look at this series.
> 
> > I've always felt uncomfortable with the ZEROOUT code here because it
> > seems to have many such bugs as you pointed out in the series. Its very
> > fragile and the bugs are easy to miss behind all the data valid and
> > split flags mess. 
> > 
> 
> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
> significantly increased the complexity of split extents and the
> potential for bugs.
> 
> > As per my understanding, ZEROOUT logic seems to be a special best-effort
> > try to make the split/convert operation "work" when dealing with
> > transient errors like ENOSPC etc. I was just wondering if it makes sense
> > to just get rid of the whole ZEROOUT logic completely and just reset the
> > extent to orig state if there is any error. This allows us to get rid of
> > DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
> > slightly less messy.
> > 
> > Maybe we can have a retry loop at the top level caller if we want to try
> > again for say ENOSPC or ENOMEM. 
> > 
> > Would love to hear your thoughts on it.
> > 
> 
> I think this is a direction worth exploring. However, what I am
> currently considering is that we need to address this scenario of
> splitting extent during the I/O completion. Although the ZEROOUT logic
> is fragile and has many issues recently, it currently serves as a
> fallback solution for handling ENOSPC errors that arise when splitting
> extents during I/O completion. It ensures that I/O operations do not
> fail due to insufficient extent blocks.
> 
> Please see ext4_convert_unwritten_extents_endio(). Although we have made
> our best effort to tried to split extents using
> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
> covered all scenarios. Moreover, after converting the buffered I/O path
> to the iomap infrastructure in the future, we may need to split extents
> during the I/O completion worker[1].
> 
> In most block allocation processes, we already have a retry loop to deal
> with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
> doesn't seem appropriate to place this logic into the I/O completion
> handling process (I haven't thought this solution through deeply yet,
> but I'm afraid it could introduce potential deadlock risks due to its
> involvement with journal operations), and we can't just simply try again.
> If we remove the ZEROOUT logic, we may lose our last line of defense
> during the I/O completion.
> 
> Currently, I am considering whether it is possible to completely remove
> EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
> submitting I/Os; instead, all splitting would be performed when
> converting extents to written after the I/O completes. Based on my patch,
> "ext4: use reserved metadata blocks when splitting extent on endio"[2],
> and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
> shows no regressions.
> 
> So I think the ZEROOUT logic remains somewhat useful until we find better
> solution(e.g., making more precise reservations for metadata). Perhaps we
> can refactor both the split extent and ZEROOUT logic to make them more
> concise.

Hi Yi,

Okay it makese sense to keep the zeroout if iomap path is planning to
shift the extent splitting to endio. Plus, I agree based on the comments
in the ext4_convert_unwritte_extents_endio() that we might even today
need to split in endio (although i cant recall when this happens) which
would need a zeroout fallback.

And yes, I think refactoring the whole logic to be less confusing would
be better. I had an older unposted untested patch cleaning up some of
this, I was looking at it again today and there seems to be a lot of
cleanups we can do here but that becomes out of scope of this patchset I
believe.

Sure then, lets keep it as it is for now. I'll review the changes you
made and later I can post a patch refactoring this area.

Regards,
ojaswin

> 
> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
> [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/
> 
> Cheers,
> Yi.
> 
> > Thanks,
> > Ojaswin
> > 
> >>    EXT4_EXT_MAY_ZEROOUT when splitting extent.
> >>  - Add patches 06-07, fix two stale extent status entries problems also
> >>    caused by splitting extent.
> >>  - Modify patches 08-10, extend __es_remove_extent() and
> >>    ext4_es_cache_extent() to allow them to overwrite existing extents of
> >>    the same status when caching on-disk extents, while also checking
> >>    extents of different stauts and raising alarms to prevent misuse.
> >>  - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
> >>    remove the TODO comment in it.
> >>
> >> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
> >>
> >> Original Description
> >>
> >> This series addresses the optimization that Jan pointed out [1]
> >> regarding the introduction of a sequence number to
> >> ext4_es_insert_extent(). The proposal is to replace all instances where
> >> the cache of on-disk extents is updated by using ext4_es_cache_extent()
> >> instead of ext4_es_insert_extent(). This change can prevent excessive
> >> cache invalidations caused by unnecessarily increasing the extent
> >> sequence number when reading from the on-disk extent tree.
> >>
> >> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
> >>
> >> Cheers,
> >> Yi.
> >>
> >> Zhang Yi (13):
> >>   ext4: cleanup zeroout in ext4_split_extent_at()
> >>   ext4: subdivide EXT4_EXT_DATA_VALID1
> >>   ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
> >>   ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
> >>     submitting I/O
> >>   ext4: correct the mapping status if the extent has been zeroed
> >>   ext4: don't cache extent during splitting extent
> >>   ext4: drop extent cache before splitting extent
> >>   ext4: cleanup useless out tag in __es_remove_extent()
> >>   ext4: make __es_remove_extent() check extent status
> >>   ext4: make ext4_es_cache_extent() support overwrite existing extents
> >>   ext4: adjust the debug info in ext4_es_cache_extent()
> >>   ext4: replace ext4_es_insert_extent() when caching on-disk extents
> >>   ext4: drop the TODO comment in ext4_es_insert_extent()
> >>
> >>  fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
> >>  fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
> >>  fs/ext4/inode.c          |  18 +++---
> >>  3 files changed, 176 insertions(+), 90 deletions(-)
> >>
> >> -- 
> >> 2.46.1
> >>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-24 12:50     ` Ojaswin Mujoo
@ 2025-11-24 14:05       ` Zhang Yi
  0 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-24 14:05 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On 11/24/2025 8:50 PM, Ojaswin Mujoo wrote:
> On Mon, Nov 24, 2025 at 01:04:04PM +0800, Zhang Yi wrote:
>> Hi, Ojaswin!
>>
>> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
>>> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
>>>> Changes since v1:
>>>>  - Rebase the codes based on the latest linux-next 20251120.
>>>>  - Add patches 01-05, fix two stale data problems caused by
>>>
>>> Hi Zhang, thanks for the patches.
>>>
>>
>> Thank you for take time to look at this series.
>>
>>> I've always felt uncomfortable with the ZEROOUT code here because it
>>> seems to have many such bugs as you pointed out in the series. Its very
>>> fragile and the bugs are easy to miss behind all the data valid and
>>> split flags mess. 
>>>
>>
>> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
>> significantly increased the complexity of split extents and the
>> potential for bugs.
>>
>>> As per my understanding, ZEROOUT logic seems to be a special best-effort
>>> try to make the split/convert operation "work" when dealing with
>>> transient errors like ENOSPC etc. I was just wondering if it makes sense
>>> to just get rid of the whole ZEROOUT logic completely and just reset the
>>> extent to orig state if there is any error. This allows us to get rid of
>>> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
>>> slightly less messy.
>>>
>>> Maybe we can have a retry loop at the top level caller if we want to try
>>> again for say ENOSPC or ENOMEM. 
>>>
>>> Would love to hear your thoughts on it.
>>>
>>
>> I think this is a direction worth exploring. However, what I am
>> currently considering is that we need to address this scenario of
>> splitting extent during the I/O completion. Although the ZEROOUT logic
>> is fragile and has many issues recently, it currently serves as a
>> fallback solution for handling ENOSPC errors that arise when splitting
>> extents during I/O completion. It ensures that I/O operations do not
>> fail due to insufficient extent blocks.
>>
>> Please see ext4_convert_unwritten_extents_endio(). Although we have made
>> our best effort to tried to split extents using
>> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
>> covered all scenarios. Moreover, after converting the buffered I/O path
>> to the iomap infrastructure in the future, we may need to split extents
>> during the I/O completion worker[1].
>>
>> In most block allocation processes, we already have a retry loop to deal
>> with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
>> doesn't seem appropriate to place this logic into the I/O completion
>> handling process (I haven't thought this solution through deeply yet,
>> but I'm afraid it could introduce potential deadlock risks due to its
>> involvement with journal operations), and we can't just simply try again.
>> If we remove the ZEROOUT logic, we may lose our last line of defense
>> during the I/O completion.
>>
>> Currently, I am considering whether it is possible to completely remove
>> EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
>> submitting I/Os; instead, all splitting would be performed when
>> converting extents to written after the I/O completes. Based on my patch,
>> "ext4: use reserved metadata blocks when splitting extent on endio"[2],
>> and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
>> shows no regressions.
>>
>> So I think the ZEROOUT logic remains somewhat useful until we find better
>> solution(e.g., making more precise reservations for metadata). Perhaps we
>> can refactor both the split extent and ZEROOUT logic to make them more
>> concise.
> 
> Hi Yi,
> 
> Okay it makese sense to keep the zeroout if iomap path is planning to
> shift the extent splitting to endio. Plus, I agree based on the comments
> in the ext4_convert_unwritte_extents_endio() that we might even today
> need to split in endio (although i cant recall when this happens) which
> would need a zeroout fallback.

A relatively common case is the concurrency between folio write-back and
fallocate. After an unwritten extent is allocated during the writeback
process, if fallocate is performed before the I/O completes, the unwritten
extent created by fallocate will merge with the writeback portion.
Consequently, a split operation is required once the I/O completes.

> 
> And yes, I think refactoring the whole logic to be less confusing would
> be better. I had an older unposted untested patch cleaning up some of
> this, I was looking at it again today and there seems to be a lot of
> cleanups we can do here but that becomes out of scope of this patchset I
> believe.
> 
> Sure then, lets keep it as it is for now. I'll review the changes you
> made and later I can post a patch refactoring this area.

OK, thank you a lot for your review and look forward to your patch.

Thanks,
Yi.

> 
> Regards,
> ojaswin
> 
>>
>> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
>> [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/
>>
>> Cheers,
>> Yi.
>>
>>> Thanks,
>>> Ojaswin
>>>
>>>>    EXT4_EXT_MAY_ZEROOUT when splitting extent.
>>>>  - Add patches 06-07, fix two stale extent status entries problems also
>>>>    caused by splitting extent.
>>>>  - Modify patches 08-10, extend __es_remove_extent() and
>>>>    ext4_es_cache_extent() to allow them to overwrite existing extents of
>>>>    the same status when caching on-disk extents, while also checking
>>>>    extents of different stauts and raising alarms to prevent misuse.
>>>>  - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
>>>>    remove the TODO comment in it.
>>>>
>>>> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
>>>>
>>>> Original Description
>>>>
>>>> This series addresses the optimization that Jan pointed out [1]
>>>> regarding the introduction of a sequence number to
>>>> ext4_es_insert_extent(). The proposal is to replace all instances where
>>>> the cache of on-disk extents is updated by using ext4_es_cache_extent()
>>>> instead of ext4_es_insert_extent(). This change can prevent excessive
>>>> cache invalidations caused by unnecessarily increasing the extent
>>>> sequence number when reading from the on-disk extent tree.
>>>>
>>>> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
>>>>
>>>> Cheers,
>>>> Yi.
>>>>
>>>> Zhang Yi (13):
>>>>   ext4: cleanup zeroout in ext4_split_extent_at()
>>>>   ext4: subdivide EXT4_EXT_DATA_VALID1
>>>>   ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
>>>>   ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
>>>>     submitting I/O
>>>>   ext4: correct the mapping status if the extent has been zeroed
>>>>   ext4: don't cache extent during splitting extent
>>>>   ext4: drop extent cache before splitting extent
>>>>   ext4: cleanup useless out tag in __es_remove_extent()
>>>>   ext4: make __es_remove_extent() check extent status
>>>>   ext4: make ext4_es_cache_extent() support overwrite existing extents
>>>>   ext4: adjust the debug info in ext4_es_cache_extent()
>>>>   ext4: replace ext4_es_insert_extent() when caching on-disk extents
>>>>   ext4: drop the TODO comment in ext4_es_insert_extent()
>>>>
>>>>  fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
>>>>  fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
>>>>  fs/ext4/inode.c          |  18 +++---
>>>>  3 files changed, 176 insertions(+), 90 deletions(-)
>>>>
>>>> -- 
>>>> 2.46.1
>>>>
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at()
  2025-11-21  6:07 ` [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at() Zhang Yi
@ 2025-11-26 11:26   ` Ojaswin Mujoo
  2025-11-27 12:02   ` Jan Kara
  2025-11-28  1:54   ` Baokun Li
  2 siblings, 0 replies; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-26 11:26 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 21, 2025 at 02:07:59PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> zero_ex is a temporary variable used only for writing zeros and
> inserting extent status entry, it will not be directly inserted into the
> tree. Therefore, it can be assigned values from the target extent in
> various scenarios, eliminating the need to explicitly assign values to
> each variable individually.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> ---
>  fs/ext4/extents.c | 63 ++++++++++++++++++-----------------------------
>  1 file changed, 24 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index c7d219e6c6d8..91682966597d 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3278,46 +3278,31 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>  	ex = path[depth].p_ext;
>  
>  	if (EXT4_EXT_MAY_ZEROOUT & split_flag) {
> -		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
> -			if (split_flag & EXT4_EXT_DATA_VALID1) {
> -				err = ext4_ext_zeroout(inode, ex2);
> -				zero_ex.ee_block = ex2->ee_block;
> -				zero_ex.ee_len = cpu_to_le16(
> -						ext4_ext_get_actual_len(ex2));
> -				ext4_ext_store_pblock(&zero_ex,
> -						      ext4_ext_pblock(ex2));
> -			} else {
> -				err = ext4_ext_zeroout(inode, ex);
> -				zero_ex.ee_block = ex->ee_block;
> -				zero_ex.ee_len = cpu_to_le16(
> -						ext4_ext_get_actual_len(ex));
> -				ext4_ext_store_pblock(&zero_ex,
> -						      ext4_ext_pblock(ex));
> -			}
> -		} else {
> -			err = ext4_ext_zeroout(inode, &orig_ex);
> -			zero_ex.ee_block = orig_ex.ee_block;
> -			zero_ex.ee_len = cpu_to_le16(
> -						ext4_ext_get_actual_len(&orig_ex));
> -			ext4_ext_store_pblock(&zero_ex,
> -					      ext4_ext_pblock(&orig_ex));
> -		}
> +		if (split_flag & EXT4_EXT_DATA_VALID1)
> +			memcpy(&zero_ex, ex2, sizeof(zero_ex));
> +		else if (split_flag & EXT4_EXT_DATA_VALID2)
> +			memcpy(&zero_ex, ex, sizeof(zero_ex));
> +		else
> +			memcpy(&zero_ex, &orig_ex, sizeof(zero_ex));
>  
> -		if (!err) {
> -			/* update the extent length and mark as initialized */
> -			ex->ee_len = cpu_to_le16(ee_len);
> -			ext4_ext_try_to_merge(handle, inode, path, ex);
> -			err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> -			if (!err)
> -				/* update extent status tree */
> -				ext4_zeroout_es(inode, &zero_ex);
> -			/* If we failed at this point, we don't know in which
> -			 * state the extent tree exactly is so don't try to fix
> -			 * length of the original extent as it may do even more
> -			 * damage.
> -			 */
> -			goto out;
> -		}
> +		err = ext4_ext_zeroout(inode, &zero_ex);
> +		if (err)
> +			goto fix_extent_len;
> +
> +		/* update the extent length and mark as initialized */
> +		ex->ee_len = cpu_to_le16(ee_len);
> +		ext4_ext_try_to_merge(handle, inode, path, ex);
> +		err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> +		if (!err)
> +			/* update extent status tree */
> +			ext4_zeroout_es(inode, &zero_ex);
> +		/*
> +		 * If we failed at this point, we don't know in which
> +		 * state the extent tree exactly is so don't try to fix
> +		 * length of the original extent as it may do even more
> +		 * damage.
> +		 */
> +		goto out;
>  	}
>  
>  fix_extent_len:
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 02/13] ext4: subdivide EXT4_EXT_DATA_VALID1
  2025-11-21  6:08 ` [PATCH v2 02/13] ext4: subdivide EXT4_EXT_DATA_VALID1 Zhang Yi
@ 2025-11-26 11:27   ` Ojaswin Mujoo
  2025-11-26 11:55     ` Ojaswin Mujoo
  0 siblings, 1 reply; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-26 11:27 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 21, 2025 at 02:08:00PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When splitting an extent, if the EXT4_GET_BLOCKS_CONVERT flag is set and
> it is necessary to split the target extent in the middle,
> ext4_split_extent() first handles splitting the latter half of the
> extent and passes the EXT4_EXT_DATA_VALID1 flag. This flag implies that
> all blocks before the split point contain valid data; however, this
> assumption is incorrect.
> 
> Therefore, subdivid EXT4_EXT_DATA_VALID1 into
> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_DATA_PARTIAL_VALID1, which
> indicate that the first half of the extent is either entirely valid or
> only partially valid, respectively. These two flags cannot be set
> simultaneously.
> 
> This patch does not use EXT4_EXT_DATA_PARTIAL_VALID1, it only replaces
> EXT4_EXT_DATA_VALID1 with EXT4_EXT_DATA_ENTIRE_VALID1 at the location
> where it is set, no logical changes.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

> ---
>  fs/ext4/extents.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 91682966597d..f7aa497e5d6c 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -43,8 +43,13 @@
>  #define EXT4_EXT_MARK_UNWRIT1	0x2  /* mark first half unwritten */
>  #define EXT4_EXT_MARK_UNWRIT2	0x4  /* mark second half unwritten */
>  
> -#define EXT4_EXT_DATA_VALID1	0x8  /* first half contains valid data */
> -#define EXT4_EXT_DATA_VALID2	0x10 /* second half contains valid data */
> +/* first half contains valid data */
> +#define EXT4_EXT_DATA_ENTIRE_VALID1	0x8   /* has partially valid data */
> +#define EXT4_EXT_DATA_PARTIAL_VALID1	0x10  /* has entirely valid data */
> +#define EXT4_EXT_DATA_VALID1		(EXT4_EXT_DATA_ENTIRE_VALID1 | \
> +					 EXT4_EXT_DATA_PARTIAL_VALID1)
> +
> +#define EXT4_EXT_DATA_VALID2	0x20 /* second half contains valid data */
>  
>  static __le32 ext4_extent_block_csum(struct inode *inode,
>  				     struct ext4_extent_header *eh)
> @@ -3190,8 +3195,9 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>  	unsigned int ee_len, depth;
>  	int err = 0;
>  
> -	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
> -	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
> +	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) == EXT4_EXT_DATA_VALID1);
> +	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) &&
> +	       (split_flag & EXT4_EXT_DATA_VALID2));
>  
>  	ext_debug(inode, "logical block %llu\n", (unsigned long long)split);
>  
> @@ -3358,7 +3364,7 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>  				       EXT4_EXT_MARK_UNWRIT2;
>  		if (split_flag & EXT4_EXT_DATA_VALID2)
> -			split_flag1 |= EXT4_EXT_DATA_VALID1;
> +			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
>  		path = ext4_split_extent_at(handle, inode, path,
>  				map->m_lblk + map->m_len, split_flag1, flags1);
>  		if (IS_ERR(path))
> @@ -3717,7 +3723,7 @@ static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle,
>  
>  	/* Convert to unwritten */
>  	if (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
> -		split_flag |= EXT4_EXT_DATA_VALID1;
> +		split_flag |= EXT4_EXT_DATA_ENTIRE_VALID1;
>  	/* Convert to initialized */
>  	} else if (flags & EXT4_GET_BLOCKS_CONVERT) {
>  		split_flag |= ee_block + ee_len <= eof_block ?
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-21  6:08 ` [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 Zhang Yi
@ 2025-11-26 11:29   ` Ojaswin Mujoo
  2025-11-27  6:13     ` Zhang Yi
  2025-11-27 13:41   ` Jan Kara
  1 sibling, 1 reply; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-26 11:29 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 21, 2025 at 02:08:01PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When allocating initialized blocks from a large unwritten extent, or
> when splitting an unwritten extent during end I/O and converting it to
> initialized, there is currently a potential issue of stale data if the
> extent needs to be split in the middle.
> 
>        0  A      B  N
>        [UUUUUUUUUUUU]    U: unwritten extent
>        [--DDDDDDDD--]    D: valid data
>           |<-  ->| ----> this range needs to be initialized
> 
> ext4_split_extent() first try to split this extent at B with
> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> ext4_split_extent_at() failed to split this extent due to temporary lack
> of space. It zeroout B to N and mark the entire extent from 0 to N
> as written.
> 
>        0  A      B  N
>        [WWWWWWWWWWWW]    W: written extent
>        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
> 
> ext4_split_extent() then try to split this extent at A with
> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
> a stale written extent from 0 to A.
> 
>        0  A      B   N
>        [WW|WWWWWWWWWW]
>        [SS|DDDDDDDDZZ]
> 
> Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
> when splitting at B, don't convert the entire extent to written and left
> it as unwritten after zeroing out B to N. The remaining work is just
> like the standard two-part split. ext4_split_extent() will pass the
> EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
> second time, allowing it to properly handle the split. If the split is
> successful, it will keep extent from 0 to A as unwritten.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Hi Yi,

This patch looks good to me. I'm just wondering since this is a stale
data exposure that might need a backport, should we add a Fixes: tag 
and also keep these fixes before the refactor in 1/13 so backport is
easier.

Other than that, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> ---
>  fs/ext4/extents.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index f7aa497e5d6c..cafe66cb562f 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>  		err = ext4_ext_zeroout(inode, &zero_ex);
>  		if (err)
>  			goto fix_extent_len;
> +		/*
> +		 * The first half contains partially valid data, the splitting
> +		 * of this extent has not been completed, fix extent length
> +		 * and ext4_split_extent() split will the first half again.
> +		 */
> +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
> +			goto fix_extent_len;
>  
>  		/* update the extent length and mark as initialized */
>  		ex->ee_len = cpu_to_le16(ee_len);
> @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>  				       EXT4_EXT_MARK_UNWRIT2;
>  		if (split_flag & EXT4_EXT_DATA_VALID2)
> -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
> +			split_flag1 |= map->m_lblk > ee_block ?
> +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
> +				       EXT4_EXT_DATA_ENTIRE_VALID1;
>  		path = ext4_split_extent_at(handle, inode, path,
>  				map->m_lblk + map->m_len, split_flag1, flags1);
>  		if (IS_ERR(path))
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 04/13] ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/O
  2025-11-21  6:08 ` [PATCH v2 04/13] ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/O Zhang Yi
@ 2025-11-26 11:50   ` Ojaswin Mujoo
  0 siblings, 0 replies; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-26 11:50 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 21, 2025 at 02:08:02PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When allocating blocks during within-EOF DIO and writeback with
> dioread_nolock enabled, EXT4_GET_BLOCKS_PRE_IO was set to split an
> existing large unwritten extent. However, EXT4_GET_BLOCKS_CONVERT was
> set when calling ext4_split_convert_extents(), which may potentially
> result in stale data issues.
> 
> Assume we have an unwritten extent, and then DIO writes the second half.
> 
>    [UUUUUUUUUUUUUUUU] on-disk extent        U: unwritten extent
>    [UUUUUUUUUUUUUUUU] extent status tree
>             |<-   ->| ----> dio write this range
> 
> First, ext4_iomap_alloc() call ext4_map_blocks() with
> EXT4_GET_BLOCKS_PRE_IO, EXT4_GET_BLOCKS_UNWRIT_EXT and
> EXT4_GET_BLOCKS_CREATE flags set. ext4_map_blocks() find this extent and
> call ext4_split_convert_extents() with EXT4_GET_BLOCKS_CONVERT and the
> above flags set.
> 
> Then, ext4_split_convert_extents() calls ext4_split_extent() with
> EXT4_EXT_MAY_ZEROOUT, EXT4_EXT_MARK_UNWRIT2 and EXT4_EXT_DATA_VALID2
> flags set, and it calls ext4_split_extent_at() to split the second half
> with EXT4_EXT_DATA_VALID2, EXT4_EXT_MARK_UNWRIT1, EXT4_EXT_MAY_ZEROOUT
> and EXT4_EXT_MARK_UNWRIT2 flags set. However, ext4_split_extent_at()
> failed to insert extent since a temporary lack -ENOSPC. It zeroes out
> the first half but convert the entire on-disk extent to written since
> the EXT4_EXT_DATA_VALID2 flag set, but left the second half as unwritten
> in the extent status tree.
> 
>    [0000000000SSSSSS]  data                S: stale data, 0: zeroed
>    [WWWWWWWWWWWWWWWW]  on-disk extent      W: written extent
>    [WWWWWWWWWWUUUUUU]  extent status tree
> 
> Finally, if the DIO failed to write data to the disk, the stale data in
> the second half will be exposed once the cached extent entry is gone.
> 
> Fix this issue by not passing EXT4_GET_BLOCKS_CONVERT when splitting
> an unwritten extent before submitting I/O, and make
> ext4_split_convert_extents() to zero out the entire extent range
> to zero for this case, and also mark the extent in the extent status
> tree for consistency.

Hi Yi,

Your analysis makes sense to me and I agree that passing CONVERT flag
there might not have been correct since we are not neccessarily
converting the extent to initialized.

Other than that, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/extents.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index cafe66cb562f..2db84f6b0056 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3733,11 +3733,15 @@ static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle,
>  	/* Convert to unwritten */
>  	if (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
>  		split_flag |= EXT4_EXT_DATA_ENTIRE_VALID1;
> -	/* Convert to initialized */
> -	} else if (flags & EXT4_GET_BLOCKS_CONVERT) {
> +	/* Split the existing unwritten extent */
> +	} else if (flags & (EXT4_GET_BLOCKS_UNWRIT_EXT |
> +			    EXT4_GET_BLOCKS_CONVERT)) {
>  		split_flag |= ee_block + ee_len <= eof_block ?
>  			      EXT4_EXT_MAY_ZEROOUT : 0;
> -		split_flag |= (EXT4_EXT_MARK_UNWRIT2 | EXT4_EXT_DATA_VALID2);
> +		split_flag |= EXT4_EXT_MARK_UNWRIT2;
> +		/* Convert to initialized */
> +		if (flags & EXT4_GET_BLOCKS_CONVERT)
> +			split_flag |= EXT4_EXT_DATA_VALID2;
>  	}
>  	flags |= EXT4_GET_BLOCKS_PRE_IO;
>  	return ext4_split_extent(handle, inode, path, map, split_flag, flags,
> @@ -3913,7 +3917,7 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
>  	/* get_block() before submitting IO, split the extent */
>  	if (flags & EXT4_GET_BLOCKS_PRE_IO) {
>  		path = ext4_split_convert_extents(handle, inode, map, path,
> -				flags | EXT4_GET_BLOCKS_CONVERT, allocated);
> +						  flags, allocated);
>  		if (IS_ERR(path))
>  			return path;
>  		/*
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 02/13] ext4: subdivide EXT4_EXT_DATA_VALID1
  2025-11-26 11:27   ` Ojaswin Mujoo
@ 2025-11-26 11:55     ` Ojaswin Mujoo
  2025-11-27  6:09       ` Zhang Yi
  0 siblings, 1 reply; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-26 11:55 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Wed, Nov 26, 2025 at 04:57:27PM +0530, Ojaswin Mujoo wrote:
> On Fri, Nov 21, 2025 at 02:08:00PM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > When splitting an extent, if the EXT4_GET_BLOCKS_CONVERT flag is set and
> > it is necessary to split the target extent in the middle,
> > ext4_split_extent() first handles splitting the latter half of the
> > extent and passes the EXT4_EXT_DATA_VALID1 flag. This flag implies that
> > all blocks before the split point contain valid data; however, this
> > assumption is incorrect.
> > 
> > Therefore, subdivid EXT4_EXT_DATA_VALID1 into
> > EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_DATA_PARTIAL_VALID1, which
> > indicate that the first half of the extent is either entirely valid or
> > only partially valid, respectively. These two flags cannot be set
> > simultaneously.
> > 
> > This patch does not use EXT4_EXT_DATA_PARTIAL_VALID1, it only replaces
> > EXT4_EXT_DATA_VALID1 with EXT4_EXT_DATA_ENTIRE_VALID1 at the location
> > where it is set, no logical changes.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Looks good, feel free to add:
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> > ---
> >  fs/ext4/extents.c | 18 ++++++++++++------
> >  1 file changed, 12 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 91682966597d..f7aa497e5d6c 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -43,8 +43,13 @@
> >  #define EXT4_EXT_MARK_UNWRIT1	0x2  /* mark first half unwritten */
> >  #define EXT4_EXT_MARK_UNWRIT2	0x4  /* mark second half unwritten */
> >  
> > -#define EXT4_EXT_DATA_VALID1	0x8  /* first half contains valid data */
> > -#define EXT4_EXT_DATA_VALID2	0x10 /* second half contains valid data */
> > +/* first half contains valid data */
> > +#define EXT4_EXT_DATA_ENTIRE_VALID1	0x8   /* has partially valid data */
> > +#define EXT4_EXT_DATA_PARTIAL_VALID1	0x10  /* has entirely valid data */

Hey, sorry I forgot to mention this minor typo in my last email. The
comment for partial and entirely valid flags are mismatched :)

Regards,
ojaswin

> > +#define EXT4_EXT_DATA_VALID1		(EXT4_EXT_DATA_ENTIRE_VALID1 | \
> > +					 EXT4_EXT_DATA_PARTIAL_VALID1)
> > +
> > +#define EXT4_EXT_DATA_VALID2	0x20 /* second half contains valid data */
> >  
> >  static __le32 ext4_extent_block_csum(struct inode *inode,
> >  				     struct ext4_extent_header *eh)
> > @@ -3190,8 +3195,9 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
> >  	unsigned int ee_len, depth;
> >  	int err = 0;
> >  
> > -	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
> > -	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
> > +	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) == EXT4_EXT_DATA_VALID1);
> > +	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) &&
> > +	       (split_flag & EXT4_EXT_DATA_VALID2));
> >  
> >  	ext_debug(inode, "logical block %llu\n", (unsigned long long)split);
> >  
> > @@ -3358,7 +3364,7 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
> >  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
> >  				       EXT4_EXT_MARK_UNWRIT2;
> >  		if (split_flag & EXT4_EXT_DATA_VALID2)
> > -			split_flag1 |= EXT4_EXT_DATA_VALID1;
> > +			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
> >  		path = ext4_split_extent_at(handle, inode, path,
> >  				map->m_lblk + map->m_len, split_flag1, flags1);
> >  		if (IS_ERR(path))
> > @@ -3717,7 +3723,7 @@ static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle,
> >  
> >  	/* Convert to unwritten */
> >  	if (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
> > -		split_flag |= EXT4_EXT_DATA_VALID1;
> > +		split_flag |= EXT4_EXT_DATA_ENTIRE_VALID1;
> >  	/* Convert to initialized */
> >  	} else if (flags & EXT4_GET_BLOCKS_CONVERT) {
> >  		split_flag |= ee_block + ee_len <= eof_block ?
> > -- 
> > 2.46.1
> > 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 05/13] ext4: correct the mapping status if the extent has been zeroed
  2025-11-21  6:08 ` [PATCH v2 05/13] ext4: correct the mapping status if the extent has been zeroed Zhang Yi
@ 2025-11-26 11:56   ` Ojaswin Mujoo
  0 siblings, 0 replies; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-26 11:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 21, 2025 at 02:08:03PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Before submitting I/O and allocating blocks with the
> EXT4_GET_BLOCKS_PRE_IO flag set, ext4_split_convert_extents() may
> convert the target extent range to initialized due to ENOSPC, ENOMEM, or
> EQUOTA errors. However, it still marks the mapping as incorrectly
> unwritten. Although this may not seem to cause any practical problems,
> it will result in an unnecessary extent conversion operation after I/O
> completion. Therefore, it's better to correct the returned mapping
> status.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> ---
>  fs/ext4/extents.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 2db84f6b0056..19338f488550 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3916,6 +3916,8 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
>  
>  	/* get_block() before submitting IO, split the extent */
>  	if (flags & EXT4_GET_BLOCKS_PRE_IO) {
> +		int depth;
> +
>  		path = ext4_split_convert_extents(handle, inode, map, path,
>  						  flags, allocated);
>  		if (IS_ERR(path))
> @@ -3931,7 +3933,13 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
>  			err = -EFSCORRUPTED;
>  			goto errout;
>  		}
> -		map->m_flags |= EXT4_MAP_UNWRITTEN;
> +		/* Don't mark unwritten if the extent has been zeroed out. */
> +		path = ext4_find_extent(inode, map->m_lblk, path, flags);
> +		if (IS_ERR(path))
> +			return path;
> +		depth = ext_depth(inode);
> +		if (ext4_ext_is_unwritten(path[depth].p_ext))
> +			map->m_flags |= EXT4_MAP_UNWRITTEN;
>  		goto out;
>  	}
>  	/* IO end_io complete, convert the filled extent to written */
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 06/13] ext4: don't cache extent during splitting extent
  2025-11-21  6:08 ` [PATCH v2 06/13] ext4: don't cache extent during splitting extent Zhang Yi
@ 2025-11-26 12:04   ` Ojaswin Mujoo
  2025-11-27  7:01     ` Zhang Yi
  0 siblings, 1 reply; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-26 12:04 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 21, 2025 at 02:08:04PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Caching extents during the splitting process is risky, as it may result
> in stale extents remaining in the status tree. Moreover, in most cases,
> the corresponding extent block entries are likely already cached before
> the split happens, making caching here not particularly useful.
> 
> Assume we have an unwritten extent, and then DIO writes the first half.
> 
>   [UUUUUUUUUUUUUUUU] on-disk extent        U: unwritten extent
>   [UUUUUUUUUUUUUUUU] extent status tree
>   |<-   ->| ----> dio write this range
> 
> First, when ext4_split_extent_at() splits this extent, it truncates the
> existing extent and then inserts a new one. During this process, this
> extent status entry may be shrunk, and calls to ext4_find_extent() and
> ext4_cache_extents() may occur, which could potentially insert the
> truncated range as a hole into the extent status tree. After the split
> is completed, this hole is not replaced with the correct status.
> 
>   [UUUUUUU|UUUUUUUU] on-disk extent        U: unwritten extent
>   [UUUUUUU|HHHHHHHH] extent status tree    H: hole
> 
> Then, the outer calling functions will not correct this remaining hole
> extent either. Finally, if we perform a delayed buffer write on this
> latter part, it will re-insert the delayed extent and cause an error in
> space accounting.

Okay, makes sense. So one basic question, I see that in
ext4_ext_insert_extent() doesnt really care about updating es unless as a
side effect of ext4_find_extent().  For example, if we end up at goto
has_space; we don't add the new extent and niether do we update that the
exsisting extent has shrunk. 

Similarly, the splitting code also doesn't seem to care about the es
cache other than zeroout in the error handling.

Is there a reason for this? Do we expect the upper layers to maintain
the es cache?
> 
> In adition, if the unwritten extent cache is not shrunk during the
> splitting, ext4_cache_extents() also conflicts with existing extents
> when caching extents. In the future, we will add checks when caching
> extents, which will trigger a warning. Therefore, Do not cache extents
> that are being split.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/extents.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 19338f488550..2b5aec3f8882 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3199,6 +3199,9 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>  	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) &&
>  	       (split_flag & EXT4_EXT_DATA_VALID2));
>  
> +	/* Do not cache extents that are in the process of being modified. */
> +	flags |= EXT4_EX_NOCACHE;
> +
>  	ext_debug(inode, "logical block %llu\n", (unsigned long long)split);
>  
>  	ext4_ext_show_leaf(inode, path);
> @@ -3364,6 +3367,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>  	ee_len = ext4_ext_get_actual_len(ex);
>  	unwritten = ext4_ext_is_unwritten(ex);
>  
> +	/* Do not cache extents that are in the process of being modified. */
> +	flags |= EXT4_EX_NOCACHE;
> +
>  	if (map->m_lblk + map->m_len < ee_block + ee_len) {
>  		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
>  		flags1 = flags | EXT4_GET_BLOCKS_PRE_IO;
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 07/13] ext4: drop extent cache before splitting extent
  2025-11-21  6:08 ` [PATCH v2 07/13] ext4: drop extent cache before " Zhang Yi
@ 2025-11-26 12:24   ` Ojaswin Mujoo
  2025-11-27  7:27     ` Zhang Yi
  0 siblings, 1 reply; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-26 12:24 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 21, 2025 at 02:08:05PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When splitting an unwritten extent in the middle and converting it to
> initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and
> EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent.
> 
> Assume we have an unwritten file and buffered write in the middle of it
> without dioread_nolock enabled, it will allocate blocks as written
> extent.
> 
>        0  A      B  N
>        [UUUUUUUUUUUU] on-disk extent      U: unwritten extent
>        [UUUUUUUUUUUU] extent status tree
>        [--DDDDDDDD--]                     D: valid data
>           |<-  ->| ----> this range needs to be initialized
> 
> ext4_split_extent() first try to split this extent at B with
> EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> ext4_split_extent_at() failed to split this extent due to temporary lack
> of space. It zeroout B to N and leave the entire extent as unwritten.
> 
>        0  A      B  N
>        [UUUUUUUUUUUU] on-disk extent
>        [UUUUUUUUUUUU] extent status tree
>        [--DDDDDDDDZZ]                     Z: zeroed data
> 
> ext4_split_extent() then try to split this extent at A with
> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and
> leave
> an written extent from A to N.

Hi Yi, 

thanks for the detailed description. I'm trying to understand the
codepath a bit and I believe you are talking about:

ext4_ext_handle_unwritten_extents()
  ext4_ext_convert_to_initialized()
	  // Case 5: split 1 unwrit into 3 parts and convert to init
		ext4_split_extent()

in which case, after the second split succeeds
> 
>        0  A      B   N
>        [UU|WWWWWWWWWW] on-disk extent     W: written extent
>        [UU|UUUUUUUUUU] extent status tree

WHen will extent status get split into 2 unwrit extents as you show
above? I seem to be missing that call since IIUC ext4_ext_insert_extent
itself doesn't seem to be accounting for the newly inserted extent in es.

Regards,
ojaswin

>        [--|DDDDDDDDZZ]

> 
> Finally ext4_map_create_blocks() only insert extent A to B to the extent
> status tree, and leave an stale unwritten extent in the status tree.
> 
>        0  A      B   N
>        [UU|WWWWWWWWWW] on-disk extent     W: written extent
>        [UU|WWWWWWWWUU] extent status tree
>        [--|DDDDDDDDZZ]
> 
> Fix this issue by always remove cached extent status entry before
> splitting extent.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/extents.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 2b5aec3f8882..9bb80af4b5cf 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3367,6 +3367,12 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>  	ee_len = ext4_ext_get_actual_len(ex);
>  	unwritten = ext4_ext_is_unwritten(ex);
>  
> +	/*
> +	 * Drop extent cache to prevent stale unwritten extents remaining
> +	 * after zeroing out.
> +	 */
> +	ext4_es_remove_extent(inode, ee_block, ee_len);
> +
>  	/* Do not cache extents that are in the process of being modified. */
>  	flags |= EXT4_EX_NOCACHE;
>  
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 02/13] ext4: subdivide EXT4_EXT_DATA_VALID1
  2025-11-26 11:55     ` Ojaswin Mujoo
@ 2025-11-27  6:09       ` Zhang Yi
  0 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-27  6:09 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

Hi, thank you again for reviewing this series!

On 11/26/2025 7:55 PM, Ojaswin Mujoo wrote:
> On Wed, Nov 26, 2025 at 04:57:27PM +0530, Ojaswin Mujoo wrote:
>> On Fri, Nov 21, 2025 at 02:08:00PM +0800, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> When splitting an extent, if the EXT4_GET_BLOCKS_CONVERT flag is set and
>>> it is necessary to split the target extent in the middle,
>>> ext4_split_extent() first handles splitting the latter half of the
>>> extent and passes the EXT4_EXT_DATA_VALID1 flag. This flag implies that
>>> all blocks before the split point contain valid data; however, this
>>> assumption is incorrect.
>>>
>>> Therefore, subdivid EXT4_EXT_DATA_VALID1 into
>>> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_DATA_PARTIAL_VALID1, which
>>> indicate that the first half of the extent is either entirely valid or
>>> only partially valid, respectively. These two flags cannot be set
>>> simultaneously.
>>>
>>> This patch does not use EXT4_EXT_DATA_PARTIAL_VALID1, it only replaces
>>> EXT4_EXT_DATA_VALID1 with EXT4_EXT_DATA_ENTIRE_VALID1 at the location
>>> where it is set, no logical changes.
>>>
>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>
>> Looks good, feel free to add:
>> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>
>>> ---
>>>  fs/ext4/extents.c | 18 ++++++++++++------
>>>  1 file changed, 12 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>>> index 91682966597d..f7aa497e5d6c 100644
>>> --- a/fs/ext4/extents.c
>>> +++ b/fs/ext4/extents.c
>>> @@ -43,8 +43,13 @@
>>>  #define EXT4_EXT_MARK_UNWRIT1	0x2  /* mark first half unwritten */
>>>  #define EXT4_EXT_MARK_UNWRIT2	0x4  /* mark second half unwritten */
>>>  
>>> -#define EXT4_EXT_DATA_VALID1	0x8  /* first half contains valid data */
>>> -#define EXT4_EXT_DATA_VALID2	0x10 /* second half contains valid data */
>>> +/* first half contains valid data */
>>> +#define EXT4_EXT_DATA_ENTIRE_VALID1	0x8   /* has partially valid data */
>>> +#define EXT4_EXT_DATA_PARTIAL_VALID1	0x10  /* has entirely valid data */
> 
> Hey, sorry I forgot to mention this minor typo in my last email. The
> comment for partial and entirely valid flags are mismatched :)

Ha, right, I missed that, will fix in next iteration.

Thanks,
Yi.

> 
> Regards,
> ojaswin
> 
>>> +#define EXT4_EXT_DATA_VALID1		(EXT4_EXT_DATA_ENTIRE_VALID1 | \
>>> +					 EXT4_EXT_DATA_PARTIAL_VALID1)
>>> +
>>> +#define EXT4_EXT_DATA_VALID2	0x20 /* second half contains valid data */
>>>  
>>>  static __le32 ext4_extent_block_csum(struct inode *inode,
>>>  				     struct ext4_extent_header *eh)
>>> @@ -3190,8 +3195,9 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>>>  	unsigned int ee_len, depth;
>>>  	int err = 0;
>>>  
>>> -	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
>>> -	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
>>> +	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) == EXT4_EXT_DATA_VALID1);
>>> +	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) &&
>>> +	       (split_flag & EXT4_EXT_DATA_VALID2));
>>>  
>>>  	ext_debug(inode, "logical block %llu\n", (unsigned long long)split);
>>>  
>>> @@ -3358,7 +3364,7 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>>>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>>>  				       EXT4_EXT_MARK_UNWRIT2;
>>>  		if (split_flag & EXT4_EXT_DATA_VALID2)
>>> -			split_flag1 |= EXT4_EXT_DATA_VALID1;
>>> +			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
>>>  		path = ext4_split_extent_at(handle, inode, path,
>>>  				map->m_lblk + map->m_len, split_flag1, flags1);
>>>  		if (IS_ERR(path))
>>> @@ -3717,7 +3723,7 @@ static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle,
>>>  
>>>  	/* Convert to unwritten */
>>>  	if (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
>>> -		split_flag |= EXT4_EXT_DATA_VALID1;
>>> +		split_flag |= EXT4_EXT_DATA_ENTIRE_VALID1;
>>>  	/* Convert to initialized */
>>>  	} else if (flags & EXT4_GET_BLOCKS_CONVERT) {
>>>  		split_flag |= ee_block + ee_len <= eof_block ?
>>> -- 
>>> 2.46.1
>>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-26 11:29   ` Ojaswin Mujoo
@ 2025-11-27  6:13     ` Zhang Yi
  0 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-27  6:13 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On 11/26/2025 7:29 PM, Ojaswin Mujoo wrote:
> On Fri, Nov 21, 2025 at 02:08:01PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When allocating initialized blocks from a large unwritten extent, or
>> when splitting an unwritten extent during end I/O and converting it to
>> initialized, there is currently a potential issue of stale data if the
>> extent needs to be split in the middle.
>>
>>        0  A      B  N
>>        [UUUUUUUUUUUU]    U: unwritten extent
>>        [--DDDDDDDD--]    D: valid data
>>           |<-  ->| ----> this range needs to be initialized
>>
>> ext4_split_extent() first try to split this extent at B with
>> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
>> ext4_split_extent_at() failed to split this extent due to temporary lack
>> of space. It zeroout B to N and mark the entire extent from 0 to N
>> as written.
>>
>>        0  A      B  N
>>        [WWWWWWWWWWWW]    W: written extent
>>        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
>>
>> ext4_split_extent() then try to split this extent at A with
>> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
>> a stale written extent from 0 to A.
>>
>>        0  A      B   N
>>        [WW|WWWWWWWWWW]
>>        [SS|DDDDDDDDZZ]
>>
>> Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
>> when splitting at B, don't convert the entire extent to written and left
>> it as unwritten after zeroing out B to N. The remaining work is just
>> like the standard two-part split. ext4_split_extent() will pass the
>> EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
>> second time, allowing it to properly handle the split. If the split is
>> successful, it will keep extent from 0 to A as unwritten.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Hi Yi,
> 
> This patch looks good to me. I'm just wondering since this is a stale
> data exposure that might need a backport, should we add a Fixes: tag 
> and also keep these fixes before the refactor in 1/13 so backport is
> easier.

Sure, I can move patch 01 after all the fix patches.

Thanks,
Yi.

> 
> Other than that, feel free to add:
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> Regards,
> ojaswin
> 
>> ---
>>  fs/ext4/extents.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index f7aa497e5d6c..cafe66cb562f 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>>  		err = ext4_ext_zeroout(inode, &zero_ex);
>>  		if (err)
>>  			goto fix_extent_len;
>> +		/*
>> +		 * The first half contains partially valid data, the splitting
>> +		 * of this extent has not been completed, fix extent length
>> +		 * and ext4_split_extent() split will the first half again.
>> +		 */
>> +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
>> +			goto fix_extent_len;
>>  
>>  		/* update the extent length and mark as initialized */
>>  		ex->ee_len = cpu_to_le16(ee_len);
>> @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>>  				       EXT4_EXT_MARK_UNWRIT2;
>>  		if (split_flag & EXT4_EXT_DATA_VALID2)
>> -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
>> +			split_flag1 |= map->m_lblk > ee_block ?
>> +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
>> +				       EXT4_EXT_DATA_ENTIRE_VALID1;
>>  		path = ext4_split_extent_at(handle, inode, path,
>>  				map->m_lblk + map->m_len, split_flag1, flags1);
>>  		if (IS_ERR(path))
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 06/13] ext4: don't cache extent during splitting extent
  2025-11-26 12:04   ` Ojaswin Mujoo
@ 2025-11-27  7:01     ` Zhang Yi
  2025-11-28  8:18       ` Ojaswin Mujoo
  0 siblings, 1 reply; 52+ messages in thread
From: Zhang Yi @ 2025-11-27  7:01 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On 11/26/2025 8:04 PM, Ojaswin Mujoo wrote:
> On Fri, Nov 21, 2025 at 02:08:04PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Caching extents during the splitting process is risky, as it may result
>> in stale extents remaining in the status tree. Moreover, in most cases,
>> the corresponding extent block entries are likely already cached before
>> the split happens, making caching here not particularly useful.
>>
>> Assume we have an unwritten extent, and then DIO writes the first half.
>>
>>   [UUUUUUUUUUUUUUUU] on-disk extent        U: unwritten extent
>>   [UUUUUUUUUUUUUUUU] extent status tree
>>   |<-   ->| ----> dio write this range
>>
>> First, when ext4_split_extent_at() splits this extent, it truncates the
>> existing extent and then inserts a new one. During this process, this
>> extent status entry may be shrunk, and calls to ext4_find_extent() and
>> ext4_cache_extents() may occur, which could potentially insert the
>> truncated range as a hole into the extent status tree. After the split
>> is completed, this hole is not replaced with the correct status.
>>
>>   [UUUUUUU|UUUUUUUU] on-disk extent        U: unwritten extent
>>   [UUUUUUU|HHHHHHHH] extent status tree    H: hole
>>
>> Then, the outer calling functions will not correct this remaining hole
>> extent either. Finally, if we perform a delayed buffer write on this
>> latter part, it will re-insert the delayed extent and cause an error in
>> space accounting.
> 
> Okay, makes sense. So one basic question, I see that in
> ext4_ext_insert_extent() doesnt really care about updating es unless as a
> side effect of ext4_find_extent().  For example, if we end up at goto
> has_space; we don't add the new extent and niether do we update that the
> exsisting extent has shrunk. 
> 
> Similarly, the splitting code also doesn't seem to care about the es
> cache other than zeroout in the error handling.
> 
> Is there a reason for this? Do we expect the upper layers to maintain
> the es cache?

Yeah, if we don't consider the zeroout case caused by a failed split,
under typical circumstances, the ext4_es_insert_extent() in
ext4_map_create_blocks() is called to insert or update the cached
extent entries.

However, ext4_map_create_blocks() only insert or update
the range that the caller want to map, it can't know the actual
initialized range if this extent has been zeroed out, so we have to
update the extent cache in ext4_split_extent_at() for this special case.
Please see commit adb2355104b2 ("ext4: update extent status tree after
an extent is zeroed out") for details.

Unfortunately, the legacy scenario described in this patch remains
unhandled.

Cheers,
Yi.

>>
>> In adition, if the unwritten extent cache is not shrunk during the
>> splitting, ext4_cache_extents() also conflicts with existing extents
>> when caching extents. In the future, we will add checks when caching
>> extents, which will trigger a warning. Therefore, Do not cache extents
>> that are being split.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/extents.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 19338f488550..2b5aec3f8882 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -3199,6 +3199,9 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>>  	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) &&
>>  	       (split_flag & EXT4_EXT_DATA_VALID2));
>>  
>> +	/* Do not cache extents that are in the process of being modified. */
>> +	flags |= EXT4_EX_NOCACHE;
>> +
>>  	ext_debug(inode, "logical block %llu\n", (unsigned long long)split);
>>  
>>  	ext4_ext_show_leaf(inode, path);
>> @@ -3364,6 +3367,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>>  	ee_len = ext4_ext_get_actual_len(ex);
>>  	unwritten = ext4_ext_is_unwritten(ex);
>>  
>> +	/* Do not cache extents that are in the process of being modified. */
>> +	flags |= EXT4_EX_NOCACHE;
>> +
>>  	if (map->m_lblk + map->m_len < ee_block + ee_len) {
>>  		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
>>  		flags1 = flags | EXT4_GET_BLOCKS_PRE_IO;
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 07/13] ext4: drop extent cache before splitting extent
  2025-11-26 12:24   ` Ojaswin Mujoo
@ 2025-11-27  7:27     ` Zhang Yi
  2025-11-28  8:16       ` Ojaswin Mujoo
  0 siblings, 1 reply; 52+ messages in thread
From: Zhang Yi @ 2025-11-27  7:27 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On 11/26/2025 8:24 PM, Ojaswin Mujoo wrote:
> On Fri, Nov 21, 2025 at 02:08:05PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When splitting an unwritten extent in the middle and converting it to
>> initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and
>> EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent.
>>
>> Assume we have an unwritten file and buffered write in the middle of it
>> without dioread_nolock enabled, it will allocate blocks as written
>> extent.
>>
>>        0  A      B  N
>>        [UUUUUUUUUUUU] on-disk extent      U: unwritten extent
>>        [UUUUUUUUUUUU] extent status tree
>>        [--DDDDDDDD--]                     D: valid data
>>           |<-  ->| ----> this range needs to be initialized
>>
>> ext4_split_extent() first try to split this extent at B with
>> EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
>> ext4_split_extent_at() failed to split this extent due to temporary lack
>> of space. It zeroout B to N and leave the entire extent as unwritten.
>>
>>        0  A      B  N
>>        [UUUUUUUUUUUU] on-disk extent
>>        [UUUUUUUUUUUU] extent status tree
>>        [--DDDDDDDDZZ]                     Z: zeroed data
>>
>> ext4_split_extent() then try to split this extent at A with
>> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and
>> leave
>> an written extent from A to N.
> 
> Hi Yi, 
> 
> thanks for the detailed description. I'm trying to understand the
> codepath a bit and I believe you are talking about:
> 
> ext4_ext_handle_unwritten_extents()
>   ext4_ext_convert_to_initialized()
> 	  // Case 5: split 1 unwrit into 3 parts and convert to init
> 		ext4_split_extent()

Yes, but in fact, it should be Case 1: split the extent into three
extents.

> 
> in which case, after the second split succeeds
>>
>>        0  A      B   N
>>        [UU|WWWWWWWWWW] on-disk extent     W: written extent
>>        [UU|UUUUUUUUUU] extent status tree
> 
> WHen will extent status get split into 2 unwrit extents as you show
> above? I seem to be missing that call since IIUC ext4_ext_insert_extent
> itself doesn't seem to be accounting for the newly inserted extent in es.
> 

Sorry for the confusion. This was drawn because I couldn't find a
suitable symbol, so I followed the representation method used for
on-disk extents. In fact, there is no splitting of extent status entries
here. I have updated the last two graphs as follows(different types of
extents are considered as different extents):

           0  A      B  N
           [UUWWWWWWWWWW] on-disk extent     W: written extent
           [UUUUUUUUUUUU] extent status tree
           [--DDDDDDDDZZ]

           0  A      B  N
           [UUWWWWWWWWWW] on-disk extent     W: written extent
           [UUWWWWWWWWUU] extent status tree
           [--DDDDDDDDZZ]

Will this make it easier to understand?

Cheers,
Yi.


> Regards,
> ojaswin
> 
>>        [--|DDDDDDDDZZ]
> 
>>
>> Finally ext4_map_create_blocks() only insert extent A to B to the extent
>> status tree, and leave an stale unwritten extent in the status tree.
>>
>>        0  A      B   N
>>        [UU|WWWWWWWWWW] on-disk extent     W: written extent
>>        [UU|WWWWWWWWUU] extent status tree
>>        [--|DDDDDDDDZZ]
>>
>> Fix this issue by always remove cached extent status entry before
>> splitting extent.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/extents.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 2b5aec3f8882..9bb80af4b5cf 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -3367,6 +3367,12 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>>  	ee_len = ext4_ext_get_actual_len(ex);
>>  	unwritten = ext4_ext_is_unwritten(ex);
>>  
>> +	/*
>> +	 * Drop extent cache to prevent stale unwritten extents remaining
>> +	 * after zeroing out.
>> +	 */
>> +	ext4_es_remove_extent(inode, ee_block, ee_len);
>> +
>>  	/* Do not cache extents that are in the process of being modified. */
>>  	flags |= EXT4_EX_NOCACHE;
>>  
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at()
  2025-11-21  6:07 ` [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at() Zhang Yi
  2025-11-26 11:26   ` Ojaswin Mujoo
@ 2025-11-27 12:02   ` Jan Kara
  2025-11-28  2:22     ` Zhang Yi
  2025-11-28  1:54   ` Baokun Li
  2 siblings, 1 reply; 52+ messages in thread
From: Jan Kara @ 2025-11-27 12:02 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri 21-11-25 14:07:59, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> zero_ex is a temporary variable used only for writing zeros and
> inserting extent status entry, it will not be directly inserted into the
> tree. Therefore, it can be assigned values from the target extent in
> various scenarios, eliminating the need to explicitly assign values to
> each variable individually.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Nice simplification. I'd just note that the new method copies also the
unwritten state of the original extent to zero_ex (the old method didn't do
this). It doesn't matter in this case but it might still be nice to add a
comment about it before the code doing the copying. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents.c | 63 ++++++++++++++++++-----------------------------
>  1 file changed, 24 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index c7d219e6c6d8..91682966597d 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3278,46 +3278,31 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>  	ex = path[depth].p_ext;
>  
>  	if (EXT4_EXT_MAY_ZEROOUT & split_flag) {
> -		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
> -			if (split_flag & EXT4_EXT_DATA_VALID1) {
> -				err = ext4_ext_zeroout(inode, ex2);
> -				zero_ex.ee_block = ex2->ee_block;
> -				zero_ex.ee_len = cpu_to_le16(
> -						ext4_ext_get_actual_len(ex2));
> -				ext4_ext_store_pblock(&zero_ex,
> -						      ext4_ext_pblock(ex2));
> -			} else {
> -				err = ext4_ext_zeroout(inode, ex);
> -				zero_ex.ee_block = ex->ee_block;
> -				zero_ex.ee_len = cpu_to_le16(
> -						ext4_ext_get_actual_len(ex));
> -				ext4_ext_store_pblock(&zero_ex,
> -						      ext4_ext_pblock(ex));
> -			}
> -		} else {
> -			err = ext4_ext_zeroout(inode, &orig_ex);
> -			zero_ex.ee_block = orig_ex.ee_block;
> -			zero_ex.ee_len = cpu_to_le16(
> -						ext4_ext_get_actual_len(&orig_ex));
> -			ext4_ext_store_pblock(&zero_ex,
> -					      ext4_ext_pblock(&orig_ex));
> -		}
> +		if (split_flag & EXT4_EXT_DATA_VALID1)
> +			memcpy(&zero_ex, ex2, sizeof(zero_ex));
> +		else if (split_flag & EXT4_EXT_DATA_VALID2)
> +			memcpy(&zero_ex, ex, sizeof(zero_ex));
> +		else
> +			memcpy(&zero_ex, &orig_ex, sizeof(zero_ex));
>  
> -		if (!err) {
> -			/* update the extent length and mark as initialized */
> -			ex->ee_len = cpu_to_le16(ee_len);
> -			ext4_ext_try_to_merge(handle, inode, path, ex);
> -			err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> -			if (!err)
> -				/* update extent status tree */
> -				ext4_zeroout_es(inode, &zero_ex);
> -			/* If we failed at this point, we don't know in which
> -			 * state the extent tree exactly is so don't try to fix
> -			 * length of the original extent as it may do even more
> -			 * damage.
> -			 */
> -			goto out;
> -		}
> +		err = ext4_ext_zeroout(inode, &zero_ex);
> +		if (err)
> +			goto fix_extent_len;
> +
> +		/* update the extent length and mark as initialized */
> +		ex->ee_len = cpu_to_le16(ee_len);
> +		ext4_ext_try_to_merge(handle, inode, path, ex);
> +		err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> +		if (!err)
> +			/* update extent status tree */
> +			ext4_zeroout_es(inode, &zero_ex);
> +		/*
> +		 * If we failed at this point, we don't know in which
> +		 * state the extent tree exactly is so don't try to fix
> +		 * length of the original extent as it may do even more
> +		 * damage.
> +		 */
> +		goto out;
>  	}
>  
>  fix_extent_len:
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-24  5:04   ` Zhang Yi
  2025-11-24 12:50     ` Ojaswin Mujoo
@ 2025-11-27 12:24     ` Jan Kara
  2025-11-28  4:37       ` Zhang Yi
  1 sibling, 1 reply; 52+ messages in thread
From: Jan Kara @ 2025-11-27 12:24 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Ojaswin Mujoo, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, jack, yi.zhang, yizhang089, libaokun1, yangerkun

Hi Yi!

On Mon 24-11-25 13:04:04, Zhang Yi wrote:
> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
> > On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
> >> Changes since v1:
> >>  - Rebase the codes based on the latest linux-next 20251120.
> >>  - Add patches 01-05, fix two stale data problems caused by
> > 
> > Hi Zhang, thanks for the patches.
> > 
> 
> Thank you for take time to look at this series.
> 
> > I've always felt uncomfortable with the ZEROOUT code here because it
> > seems to have many such bugs as you pointed out in the series. Its very
> > fragile and the bugs are easy to miss behind all the data valid and
> > split flags mess. 
> > 
> 
> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
> significantly increased the complexity of split extents and the
> potential for bugs.

Yep, that code is complex and prone to bugs.

> > As per my understanding, ZEROOUT logic seems to be a special best-effort
> > try to make the split/convert operation "work" when dealing with
> > transient errors like ENOSPC etc. I was just wondering if it makes sense
> > to just get rid of the whole ZEROOUT logic completely and just reset the
> > extent to orig state if there is any error. This allows us to get rid of
> > DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
> > slightly less messy.
> > 
> > Maybe we can have a retry loop at the top level caller if we want to try
> > again for say ENOSPC or ENOMEM. 
> > 
> > Would love to hear your thoughts on it.
> 
> I think this is a direction worth exploring. However, what I am
> currently considering is that we need to address this scenario of
> splitting extent during the I/O completion. Although the ZEROOUT logic
> is fragile and has many issues recently, it currently serves as a
> fallback solution for handling ENOSPC errors that arise when splitting
> extents during I/O completion. It ensures that I/O operations do not
> fail due to insufficient extent blocks.

Also partial extent zeroout offers a good performance win when the
portion needing zeroout is small (we can save extent splitting). And I
agree it is a good safety net for ENOSPC issues - otherwise there's no
guarantee page writeback can finish without hitting ENOSPC. We do have
reserved blocks for these cases but the pool is limited so you can still
run out of blocks if you try hard enough.

> Please see ext4_convert_unwritten_extents_endio(). Although we have made
> our best effort to tried to split extents using
> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
> covered all scenarios. Moreover, after converting the buffered I/O path
> to the iomap infrastructure in the future, we may need to split extents
> during the I/O completion worker[1].

Yes, this might be worth exploring. The advantage of doing extent splitting
in advance is that on IO submission you have the opportunity of restarting
the transaction on ENOSPC to possibly release some blocks. This is not
easily doable e.g. on writeback completion so the pressure on the pool of
reserved blocks is going to be more common (previously you needed reserved
blocks only when writeback was racing with fallocate or similar, now you
may need them each time you write in the middle of unwritten extent). So I
think the change will need some testing whether it isn't too easy to hit
ENOSPC conditions on IO completion without EXT4_GET_BLOCKS_IO_CREATE_EXT.
But otherwise it's always nice to remove code :)
 
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 08/13] ext4: cleanup useless out tag in __es_remove_extent()
  2025-11-21  6:08 ` [PATCH v2 08/13] ext4: cleanup useless out tag in __es_remove_extent() Zhang Yi
@ 2025-11-27 12:43   ` Jan Kara
  2025-11-28  8:20   ` Ojaswin Mujoo
  1 sibling, 0 replies; 52+ messages in thread
From: Jan Kara @ 2025-11-27 12:43 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri 21-11-25 14:08:06, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The out tag in __es_remove_extent() is just return err value, we can
          ^^^ this should be 'label'

> return it directly if something bad happens. Therefore, remove the
> useless out tag and rename out_get_reserved to out.
              ^^^ label

> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Otherwise looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents_status.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index e04fbf10fe4f..04d56f8f6c0c 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -1434,7 +1434,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  	struct extent_status orig_es;
>  	ext4_lblk_t len1, len2;
>  	ext4_fsblk_t block;
> -	int err = 0;
> +	int err;
>  	bool count_reserved = true;
>  	struct rsvd_count rc;
>  
> @@ -1443,9 +1443,9 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  
>  	es = __es_tree_search(&tree->root, lblk);
>  	if (!es)
> -		goto out;
> +		return 0;
>  	if (es->es_lblk > end)
> -		goto out;
> +		return 0;
>  
>  	/* Simply invalidate cache_es. */
>  	tree->cache_es = NULL;
> @@ -1480,7 +1480,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  
>  				es->es_lblk = orig_es.es_lblk;
>  				es->es_len = orig_es.es_len;
> -				goto out;
> +				return err;
>  			}
>  		} else {
>  			es->es_lblk = end + 1;
> @@ -1494,7 +1494,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  		if (count_reserved)
>  			count_rsvd(inode, orig_es.es_lblk + len1,
>  				   orig_es.es_len - len1 - len2, &orig_es, &rc);
> -		goto out_get_reserved;
> +		goto out;
>  	}
>  
>  	if (len1 > 0) {
> @@ -1536,11 +1536,10 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  		}
>  	}
>  
> -out_get_reserved:
> +out:
>  	if (count_reserved)
>  		*reserved = get_rsvd(inode, end, es, &rc);
> -out:
> -	return err;
> +	return 0;
>  }
>  
>  /*
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-21  6:08 ` [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 Zhang Yi
  2025-11-26 11:29   ` Ojaswin Mujoo
@ 2025-11-27 13:41   ` Jan Kara
  2025-11-28  3:45     ` Zhang Yi
  2025-11-28  7:28     ` Ojaswin Mujoo
  1 sibling, 2 replies; 52+ messages in thread
From: Jan Kara @ 2025-11-27 13:41 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri 21-11-25 14:08:01, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When allocating initialized blocks from a large unwritten extent, or
> when splitting an unwritten extent during end I/O and converting it to
> initialized, there is currently a potential issue of stale data if the
> extent needs to be split in the middle.
> 
>        0  A      B  N
>        [UUUUUUUUUUUU]    U: unwritten extent
>        [--DDDDDDDD--]    D: valid data
>           |<-  ->| ----> this range needs to be initialized
> 
> ext4_split_extent() first try to split this extent at B with
> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> ext4_split_extent_at() failed to split this extent due to temporary lack
> of space. It zeroout B to N and mark the entire extent from 0 to N
> as written.
> 
>        0  A      B  N
>        [WWWWWWWWWWWW]    W: written extent
>        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
> 
> ext4_split_extent() then try to split this extent at A with
> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
> a stale written extent from 0 to A.
> 
>        0  A      B   N
>        [WW|WWWWWWWWWW]
>        [SS|DDDDDDDDZZ]
> 
> Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
> when splitting at B, don't convert the entire extent to written and left
> it as unwritten after zeroing out B to N. The remaining work is just
> like the standard two-part split. ext4_split_extent() will pass the
> EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
> second time, allowing it to properly handle the split. If the split is
> successful, it will keep extent from 0 to A as unwritten.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Good catch on the data exposure issue! First I'd like to discuss whether
there isn't a way to fix these problems in a way that doesn't make the
already complex code even more complex. My observation is that
EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
in ext4_split_convert_extents() which both call ext4_split_extent(). The
actual extent zeroing happens in ext4_split_extent_at() and in
ext4_ext_convert_to_initialized(). I think the code would be much clearer
if we just centralized all the zeroing in ext4_split_extent(). At that
place the situation is actually pretty simple:

1) 'ex' is unwritten, 'map' describes part with already written data which
we want to convert to initialized (generally IO completion situation) => we
can zero out boundaries if they are smaller than max_zeroout or if extent
split fails.

2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
submission) => the split is opportunistic here, if we cannot split due to
ENOSPC, just go on and deal with it at IO completion time. No zeroing
needed.

3) 'ex' is written, 'map' describes part that should be converted to
unwritten => we can zero out the 'map' part if smaller than max_zeroout or
if extent split fails.

This should all result in a relatively straightforward code where we can
distinguish the three cases based on 'ex' and passed flags, we should be
able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
the data exposure issues at the same time. What do you think? Am I missing
some case?

								Honza

> ---
>  fs/ext4/extents.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index f7aa497e5d6c..cafe66cb562f 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>  		err = ext4_ext_zeroout(inode, &zero_ex);
>  		if (err)
>  			goto fix_extent_len;
> +		/*
> +		 * The first half contains partially valid data, the splitting
> +		 * of this extent has not been completed, fix extent length
> +		 * and ext4_split_extent() split will the first half again.
> +		 */
> +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
> +			goto fix_extent_len;
>  
>  		/* update the extent length and mark as initialized */
>  		ex->ee_len = cpu_to_le16(ee_len);
> @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>  				       EXT4_EXT_MARK_UNWRIT2;
>  		if (split_flag & EXT4_EXT_DATA_VALID2)
> -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
> +			split_flag1 |= map->m_lblk > ee_block ?
> +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
> +				       EXT4_EXT_DATA_ENTIRE_VALID1;
>  		path = ext4_split_extent_at(handle, inode, path,
>  				map->m_lblk + map->m_len, split_flag1, flags1);
>  		if (IS_ERR(path))
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at()
  2025-11-21  6:07 ` [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at() Zhang Yi
  2025-11-26 11:26   ` Ojaswin Mujoo
  2025-11-27 12:02   ` Jan Kara
@ 2025-11-28  1:54   ` Baokun Li
  2 siblings, 0 replies; 52+ messages in thread
From: Baokun Li @ 2025-11-28  1:54 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, yangerkun

On 2025-11-21 14:07, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> zero_ex is a temporary variable used only for writing zeros and
> inserting extent status entry, it will not be directly inserted into the
> tree. Therefore, it can be assigned values from the target extent in
> various scenarios, eliminating the need to explicitly assign values to
> each variable individually.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Nice cleanup! Feel free to add:

Reviewed-by: Baokun Li <libaokun1@huawei.com>

> ---
>  fs/ext4/extents.c | 63 ++++++++++++++++++-----------------------------
>  1 file changed, 24 insertions(+), 39 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index c7d219e6c6d8..91682966597d 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3278,46 +3278,31 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>  	ex = path[depth].p_ext;
>  
>  	if (EXT4_EXT_MAY_ZEROOUT & split_flag) {
> -		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
> -			if (split_flag & EXT4_EXT_DATA_VALID1) {
> -				err = ext4_ext_zeroout(inode, ex2);
> -				zero_ex.ee_block = ex2->ee_block;
> -				zero_ex.ee_len = cpu_to_le16(
> -						ext4_ext_get_actual_len(ex2));
> -				ext4_ext_store_pblock(&zero_ex,
> -						      ext4_ext_pblock(ex2));
> -			} else {
> -				err = ext4_ext_zeroout(inode, ex);
> -				zero_ex.ee_block = ex->ee_block;
> -				zero_ex.ee_len = cpu_to_le16(
> -						ext4_ext_get_actual_len(ex));
> -				ext4_ext_store_pblock(&zero_ex,
> -						      ext4_ext_pblock(ex));
> -			}
> -		} else {
> -			err = ext4_ext_zeroout(inode, &orig_ex);
> -			zero_ex.ee_block = orig_ex.ee_block;
> -			zero_ex.ee_len = cpu_to_le16(
> -						ext4_ext_get_actual_len(&orig_ex));
> -			ext4_ext_store_pblock(&zero_ex,
> -					      ext4_ext_pblock(&orig_ex));
> -		}
> +		if (split_flag & EXT4_EXT_DATA_VALID1)
> +			memcpy(&zero_ex, ex2, sizeof(zero_ex));
> +		else if (split_flag & EXT4_EXT_DATA_VALID2)
> +			memcpy(&zero_ex, ex, sizeof(zero_ex));
> +		else
> +			memcpy(&zero_ex, &orig_ex, sizeof(zero_ex));
>  
> -		if (!err) {
> -			/* update the extent length and mark as initialized */
> -			ex->ee_len = cpu_to_le16(ee_len);
> -			ext4_ext_try_to_merge(handle, inode, path, ex);
> -			err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> -			if (!err)
> -				/* update extent status tree */
> -				ext4_zeroout_es(inode, &zero_ex);
> -			/* If we failed at this point, we don't know in which
> -			 * state the extent tree exactly is so don't try to fix
> -			 * length of the original extent as it may do even more
> -			 * damage.
> -			 */
> -			goto out;
> -		}
> +		err = ext4_ext_zeroout(inode, &zero_ex);
> +		if (err)
> +			goto fix_extent_len;
> +
> +		/* update the extent length and mark as initialized */
> +		ex->ee_len = cpu_to_le16(ee_len);
> +		ext4_ext_try_to_merge(handle, inode, path, ex);
> +		err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> +		if (!err)
> +			/* update extent status tree */
> +			ext4_zeroout_es(inode, &zero_ex);
> +		/*
> +		 * If we failed at this point, we don't know in which
> +		 * state the extent tree exactly is so don't try to fix
> +		 * length of the original extent as it may do even more
> +		 * damage.
> +		 */
> +		goto out;
>  	}
>  
>  fix_extent_len:



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at()
  2025-11-27 12:02   ` Jan Kara
@ 2025-11-28  2:22     ` Zhang Yi
  0 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-28  2:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	yi.zhang, yizhang089, libaokun1, yangerkun

On 11/27/2025 8:02 PM, Jan Kara wrote:
> On Fri 21-11-25 14:07:59, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> zero_ex is a temporary variable used only for writing zeros and
>> inserting extent status entry, it will not be directly inserted into the
>> tree. Therefore, it can be assigned values from the target extent in
>> various scenarios, eliminating the need to explicitly assign values to
>> each variable individually.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Nice simplification. I'd just note that the new method copies also the
> unwritten state of the original extent to zero_ex (the old method didn't do
> this). It doesn't matter in this case but it might still be nice to add a
> comment about it before the code doing the copying. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> 								Honza

Thank you a lot for reviewing this series! It seems that calling
ext4_ext_mark_initialized() after copying is also acceptable.

Cheers,
Yi.

> 
>> ---
>>  fs/ext4/extents.c | 63 ++++++++++++++++++-----------------------------
>>  1 file changed, 24 insertions(+), 39 deletions(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index c7d219e6c6d8..91682966597d 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -3278,46 +3278,31 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>>  	ex = path[depth].p_ext;
>>  
>>  	if (EXT4_EXT_MAY_ZEROOUT & split_flag) {
>> -		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
>> -			if (split_flag & EXT4_EXT_DATA_VALID1) {
>> -				err = ext4_ext_zeroout(inode, ex2);
>> -				zero_ex.ee_block = ex2->ee_block;
>> -				zero_ex.ee_len = cpu_to_le16(
>> -						ext4_ext_get_actual_len(ex2));
>> -				ext4_ext_store_pblock(&zero_ex,
>> -						      ext4_ext_pblock(ex2));
>> -			} else {
>> -				err = ext4_ext_zeroout(inode, ex);
>> -				zero_ex.ee_block = ex->ee_block;
>> -				zero_ex.ee_len = cpu_to_le16(
>> -						ext4_ext_get_actual_len(ex));
>> -				ext4_ext_store_pblock(&zero_ex,
>> -						      ext4_ext_pblock(ex));
>> -			}
>> -		} else {
>> -			err = ext4_ext_zeroout(inode, &orig_ex);
>> -			zero_ex.ee_block = orig_ex.ee_block;
>> -			zero_ex.ee_len = cpu_to_le16(
>> -						ext4_ext_get_actual_len(&orig_ex));
>> -			ext4_ext_store_pblock(&zero_ex,
>> -					      ext4_ext_pblock(&orig_ex));
>> -		}
>> +		if (split_flag & EXT4_EXT_DATA_VALID1)
>> +			memcpy(&zero_ex, ex2, sizeof(zero_ex));
>> +		else if (split_flag & EXT4_EXT_DATA_VALID2)
>> +			memcpy(&zero_ex, ex, sizeof(zero_ex));
>> +		else
>> +			memcpy(&zero_ex, &orig_ex, sizeof(zero_ex));
>>  
>> -		if (!err) {
>> -			/* update the extent length and mark as initialized */
>> -			ex->ee_len = cpu_to_le16(ee_len);
>> -			ext4_ext_try_to_merge(handle, inode, path, ex);
>> -			err = ext4_ext_dirty(handle, inode, path + path->p_depth);
>> -			if (!err)
>> -				/* update extent status tree */
>> -				ext4_zeroout_es(inode, &zero_ex);
>> -			/* If we failed at this point, we don't know in which
>> -			 * state the extent tree exactly is so don't try to fix
>> -			 * length of the original extent as it may do even more
>> -			 * damage.
>> -			 */
>> -			goto out;
>> -		}
>> +		err = ext4_ext_zeroout(inode, &zero_ex);
>> +		if (err)
>> +			goto fix_extent_len;
>> +
>> +		/* update the extent length and mark as initialized */
>> +		ex->ee_len = cpu_to_le16(ee_len);
>> +		ext4_ext_try_to_merge(handle, inode, path, ex);
>> +		err = ext4_ext_dirty(handle, inode, path + path->p_depth);
>> +		if (!err)
>> +			/* update extent status tree */
>> +			ext4_zeroout_es(inode, &zero_ex);
>> +		/*
>> +		 * If we failed at this point, we don't know in which
>> +		 * state the extent tree exactly is so don't try to fix
>> +		 * length of the original extent as it may do even more
>> +		 * damage.
>> +		 */
>> +		goto out;
>>  	}
>>  
>>  fix_extent_len:
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-27 13:41   ` Jan Kara
@ 2025-11-28  3:45     ` Zhang Yi
  2025-11-28 10:58       ` Jan Kara
  2025-11-28  7:28     ` Ojaswin Mujoo
  1 sibling, 1 reply; 52+ messages in thread
From: Zhang Yi @ 2025-11-28  3:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	yi.zhang, yizhang089, libaokun1, yangerkun

On 11/27/2025 9:41 PM, Jan Kara wrote:
> On Fri 21-11-25 14:08:01, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When allocating initialized blocks from a large unwritten extent, or
>> when splitting an unwritten extent during end I/O and converting it to
>> initialized, there is currently a potential issue of stale data if the
>> extent needs to be split in the middle.
>>
>>        0  A      B  N
>>        [UUUUUUUUUUUU]    U: unwritten extent
>>        [--DDDDDDDD--]    D: valid data
>>           |<-  ->| ----> this range needs to be initialized
>>
>> ext4_split_extent() first try to split this extent at B with
>> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
>> ext4_split_extent_at() failed to split this extent due to temporary lack
>> of space. It zeroout B to N and mark the entire extent from 0 to N
>> as written.
>>
>>        0  A      B  N
>>        [WWWWWWWWWWWW]    W: written extent
>>        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
>>
>> ext4_split_extent() then try to split this extent at A with
>> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
>> a stale written extent from 0 to A.
>>
>>        0  A      B   N
>>        [WW|WWWWWWWWWW]
>>        [SS|DDDDDDDDZZ]
>>
>> Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
>> when splitting at B, don't convert the entire extent to written and left
>> it as unwritten after zeroing out B to N. The remaining work is just
>> like the standard two-part split. ext4_split_extent() will pass the
>> EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
>> second time, allowing it to properly handle the split. If the split is
>> successful, it will keep extent from 0 to A as unwritten.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Good catch on the data exposure issue! First I'd like to discuss whether
> there isn't a way to fix these problems in a way that doesn't make the
> already complex code even more complex. My observation is that
> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> in ext4_split_convert_extents() which both call ext4_split_extent(). The
> actual extent zeroing happens in ext4_split_extent_at() and in
> ext4_ext_convert_to_initialized().

Yes.

> I think the code would be much clearer
> if we just centralized all the zeroing in ext4_split_extent(). At that
> place the situation is actually pretty simple:

Thank you for your suggestion!

> 
> 1) 'ex' is unwritten, 'map' describes part with already written data which
> we want to convert to initialized (generally IO completion situation) => we
> can zero out boundaries if they are smaller than max_zeroout or if extent
> split fails.
> 

Yes. Agree.

> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> submission) => the split is opportunistic here, if we cannot split due to
> ENOSPC, just go on and deal with it at IO completion time. No zeroing
> needed.

Yes. At the same time, if we can indeed move the entire split unwritten
operation to be handled after I/O completion in the future, it would also be
more convenient to remove this segment of logic.

> 
> 3) 'ex' is written, 'map' describes part that should be converted to
> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> if extent split fails.

This makes sense to me! This case it originates from the fallocate with Zero
Range operation. Currently, the zero-out operation will not be performed if
the split operation fails, instead, it immediately returns a failure.

I agree with you that we can do zero out if the 'map' part smaller than
max_zeroout instead of split extents. However, if the 'map' part is bigger
than max_zeroout and if extent split fails, I don't think zero out is a good
idea, Because it might cause zero-range calls to take a long time to execute.
Although fallocate doesn't explicitly specify how ZERO_RANGE should be
implemented, users expect it to be very fast. Therefore, in this case, if the
split fails, it would be better to simply return an error, leave things as
they are. What do you think?

> 
> This should all result in a relatively straightforward code where we can
> distinguish the three cases based on 'ex' and passed flags, we should be
> able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
> drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
> the data exposure issues at the same time. What do you think? Am I missing
> some case?
> 

Indeed, I think the overall solution is a nice cleanup idea. :-)
But this would involve a significant amount of refactoring and logical changes.
Could we first merge the current set of patches(it could be more easier to
backport to the early LTS version), and then I can start a new series to
address this optimization?

Cheers,
Yi.

> 								Honza
> 
>> ---
>>  fs/ext4/extents.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index f7aa497e5d6c..cafe66cb562f 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>>  		err = ext4_ext_zeroout(inode, &zero_ex);
>>  		if (err)
>>  			goto fix_extent_len;
>> +		/*
>> +		 * The first half contains partially valid data, the splitting
>> +		 * of this extent has not been completed, fix extent length
>> +		 * and ext4_split_extent() split will the first half again.
>> +		 */
>> +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
>> +			goto fix_extent_len;
>>  
>>  		/* update the extent length and mark as initialized */
>>  		ex->ee_len = cpu_to_le16(ee_len);
>> @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>>  				       EXT4_EXT_MARK_UNWRIT2;
>>  		if (split_flag & EXT4_EXT_DATA_VALID2)
>> -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
>> +			split_flag1 |= map->m_lblk > ee_block ?
>> +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
>> +				       EXT4_EXT_DATA_ENTIRE_VALID1;
>>  		path = ext4_split_extent_at(handle, inode, path,
>>  				map->m_lblk + map->m_len, split_flag1, flags1);
>>  		if (IS_ERR(path))
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-27 12:24     ` Jan Kara
@ 2025-11-28  4:37       ` Zhang Yi
  0 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-28  4:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ojaswin Mujoo, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, yi.zhang, yizhang089, libaokun1, yangerkun

On 11/27/2025 8:24 PM, Jan Kara wrote:
> Hi Yi!
> 
> On Mon 24-11-25 13:04:04, Zhang Yi wrote:
>> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
>>> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
>>>> Changes since v1:
>>>>  - Rebase the codes based on the latest linux-next 20251120.
>>>>  - Add patches 01-05, fix two stale data problems caused by
>>>
>>> Hi Zhang, thanks for the patches.
>>>
>>
>> Thank you for take time to look at this series.
>>
>>> I've always felt uncomfortable with the ZEROOUT code here because it
>>> seems to have many such bugs as you pointed out in the series. Its very
>>> fragile and the bugs are easy to miss behind all the data valid and
>>> split flags mess. 
>>>
>>
>> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
>> significantly increased the complexity of split extents and the
>> potential for bugs.
> 
> Yep, that code is complex and prone to bugs.
> 
>>> As per my understanding, ZEROOUT logic seems to be a special best-effort
>>> try to make the split/convert operation "work" when dealing with
>>> transient errors like ENOSPC etc. I was just wondering if it makes sense
>>> to just get rid of the whole ZEROOUT logic completely and just reset the
>>> extent to orig state if there is any error. This allows us to get rid of
>>> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
>>> slightly less messy.
>>>
>>> Maybe we can have a retry loop at the top level caller if we want to try
>>> again for say ENOSPC or ENOMEM. 
>>>
>>> Would love to hear your thoughts on it.
>>
>> I think this is a direction worth exploring. However, what I am
>> currently considering is that we need to address this scenario of
>> splitting extent during the I/O completion. Although the ZEROOUT logic
>> is fragile and has many issues recently, it currently serves as a
>> fallback solution for handling ENOSPC errors that arise when splitting
>> extents during I/O completion. It ensures that I/O operations do not
>> fail due to insufficient extent blocks.
> 
> Also partial extent zeroout offers a good performance win when the
> portion needing zeroout is small (we can save extent splitting). And I
> agree it is a good safety net for ENOSPC issues - otherwise there's no
> guarantee page writeback can finish without hitting ENOSPC. We do have
> reserved blocks for these cases but the pool is limited so you can still
> run out of blocks if you try hard enough.

Yes, Indeed!

> 
>> Please see ext4_convert_unwritten_extents_endio(). Although we have made
>> our best effort to tried to split extents using
>> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
>> covered all scenarios. Moreover, after converting the buffered I/O path
>> to the iomap infrastructure in the future, we may need to split extents
>> during the I/O completion worker[1].
> 
> Yes, this might be worth exploring. The advantage of doing extent splitting
> in advance is that on IO submission you have the opportunity of restarting
> the transaction on ENOSPC to possibly release some blocks. This is not
> easily doable e.g. on writeback completion so the pressure on the pool of
> reserved blocks is going to be more common (previously you needed reserved
> blocks only when writeback was racing with fallocate or similar, now you
> may need them each time you write in the middle of unwritten extent). So I
> think the change will need some testing whether it isn't too easy to hit
> ENOSPC conditions on IO completion without EXT4_GET_BLOCKS_IO_CREATE_EXT.
> But otherwise it's always nice to remove code :)
>  
> 								Honza

Yes, we need to be very careful about this, I have written a POC patch and
am currently conducting related tests. I hope it works.  :-)

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-27 13:41   ` Jan Kara
  2025-11-28  3:45     ` Zhang Yi
@ 2025-11-28  7:28     ` Ojaswin Mujoo
  2025-11-28 11:14       ` Jan Kara
  1 sibling, 1 reply; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-28  7:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, yi.zhang, yizhang089, libaokun1, yangerkun

On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> On Fri 21-11-25 14:08:01, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > When allocating initialized blocks from a large unwritten extent, or
> > when splitting an unwritten extent during end I/O and converting it to
> > initialized, there is currently a potential issue of stale data if the
> > extent needs to be split in the middle.
> > 
> >        0  A      B  N
> >        [UUUUUUUUUUUU]    U: unwritten extent
> >        [--DDDDDDDD--]    D: valid data
> >           |<-  ->| ----> this range needs to be initialized
> > 
> > ext4_split_extent() first try to split this extent at B with
> > EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> > ext4_split_extent_at() failed to split this extent due to temporary lack
> > of space. It zeroout B to N and mark the entire extent from 0 to N
> > as written.
> > 
> >        0  A      B  N
> >        [WWWWWWWWWWWW]    W: written extent
> >        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
> > 
> > ext4_split_extent() then try to split this extent at A with
> > EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
> > a stale written extent from 0 to A.
> > 
> >        0  A      B   N
> >        [WW|WWWWWWWWWW]
> >        [SS|DDDDDDDDZZ]
> > 
> > Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
> > when splitting at B, don't convert the entire extent to written and left
> > it as unwritten after zeroing out B to N. The remaining work is just
> > like the standard two-part split. ext4_split_extent() will pass the
> > EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
> > second time, allowing it to properly handle the split. If the split is
> > successful, it will keep extent from 0 to A as unwritten.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Good catch on the data exposure issue! First I'd like to discuss whether
> there isn't a way to fix these problems in a way that doesn't make the
> already complex code even more complex. My observation is that
> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> in ext4_split_convert_extents() which both call ext4_split_extent(). The
> actual extent zeroing happens in ext4_split_extent_at() and in
> ext4_ext_convert_to_initialized(). I think the code would be much clearer
> if we just centralized all the zeroing in ext4_split_extent(). At that
> place the situation is actually pretty simple:

This is exactly what I was playing with in my local tree to refactor this
particular part of code :). I agree that ext4_split_extent() is a much
better place to do the zeroout and it looks much cleaner but I agree
with Yi that it might be better to do it after fixing the stale
exposures so backports are straight forward. 

Am I correct in understanding that you are suggesting to zeroout
proactively if we are below max_zeroout before even trying to extent
split (which seems be done in ext4_ext_convert_to_initialized() as well)?

In this case, I have 2 concerns:

> 
> 1) 'ex' is unwritten, 'map' describes part with already written data which
> we want to convert to initialized (generally IO completion situation) => we
> can zero out boundaries if they are smaller than max_zeroout or if extent
> split fails.

Firstly, I know you mentioned in another email that zeroout of small ranges
gives us a performance win but is it really faster on average than
extent manipulation?

For example, for case 1 where both zeroout and splitting need
journalling, I understand that splitting has high journal overhead in worst case,
where tree might grow, but more often than not we would be manipulating
within the same leaf so journalling only 1 bh (same as zeroout). In which case
seems like zeroout might be slower no matter how fast the IO can be
done. So proactive zeroout might be for beneficial for case 3 than case
1.

> 
> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> submission) => the split is opportunistic here, if we cannot split due to
> ENOSPC, just go on and deal with it at IO completion time. No zeroing
> needed.
> 
> 3) 'ex' is written, 'map' describes part that should be converted to
> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> if extent split fails.

Proactive zeroout before trying split does seem benficial to help us
avoid journal overhead for split. However, judging from
ext4_ext_convert_to_initialized(), max zeroout comes from
sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
the IO device, so that means theres a chance a zeroout might be pretty
slow if say we are doing it on a device than doesn't support accelerated
zeroout operations. Maybe we need to be more intelligent in setting
s_extent_max_zeroout_kb?

> 
> This should all result in a relatively straightforward code where we can
> distinguish the three cases based on 'ex' and passed flags, we should be
> able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
> drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
> the data exposure issues at the same time. What do you think? Am I missing
> some case?
> 
> 								Honza
> 
> > ---
> >  fs/ext4/extents.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index f7aa497e5d6c..cafe66cb562f 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
> >  		err = ext4_ext_zeroout(inode, &zero_ex);
> >  		if (err)
> >  			goto fix_extent_len;
> > +		/*
> > +		 * The first half contains partially valid data, the splitting
> > +		 * of this extent has not been completed, fix extent length
> > +		 * and ext4_split_extent() split will the first half again.
> > +		 */
> > +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
> > +			goto fix_extent_len;
> >  
> >  		/* update the extent length and mark as initialized */
> >  		ex->ee_len = cpu_to_le16(ee_len);
> > @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
> >  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
> >  				       EXT4_EXT_MARK_UNWRIT2;
> >  		if (split_flag & EXT4_EXT_DATA_VALID2)
> > -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
> > +			split_flag1 |= map->m_lblk > ee_block ?
> > +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
> > +				       EXT4_EXT_DATA_ENTIRE_VALID1;
> >  		path = ext4_split_extent_at(handle, inode, path,
> >  				map->m_lblk + map->m_len, split_flag1, flags1);
> >  		if (IS_ERR(path))
> > -- 
> > 2.46.1
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 07/13] ext4: drop extent cache before splitting extent
  2025-11-27  7:27     ` Zhang Yi
@ 2025-11-28  8:16       ` Ojaswin Mujoo
  2025-11-29  1:36         ` Zhang Yi
  0 siblings, 1 reply; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-28  8:16 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Thu, Nov 27, 2025 at 03:27:26PM +0800, Zhang Yi wrote:
> On 11/26/2025 8:24 PM, Ojaswin Mujoo wrote:
> > On Fri, Nov 21, 2025 at 02:08:05PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> When splitting an unwritten extent in the middle and converting it to
> >> initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and
> >> EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent.
> >>
> >> Assume we have an unwritten file and buffered write in the middle of it
> >> without dioread_nolock enabled, it will allocate blocks as written
> >> extent.
> >>
> >>        0  A      B  N
> >>        [UUUUUUUUUUUU] on-disk extent      U: unwritten extent
> >>        [UUUUUUUUUUUU] extent status tree
> >>        [--DDDDDDDD--]                     D: valid data
> >>           |<-  ->| ----> this range needs to be initialized
> >>
> >> ext4_split_extent() first try to split this extent at B with
> >> EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> >> ext4_split_extent_at() failed to split this extent due to temporary lack
> >> of space. It zeroout B to N and leave the entire extent as unwritten.
> >>
> >>        0  A      B  N
> >>        [UUUUUUUUUUUU] on-disk extent
> >>        [UUUUUUUUUUUU] extent status tree
> >>        [--DDDDDDDDZZ]                     Z: zeroed data
> >>
> >> ext4_split_extent() then try to split this extent at A with
> >> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and
> >> leave
> >> an written extent from A to N.
> > 
> > Hi Yi, 
> > 
> > thanks for the detailed description. I'm trying to understand the
> > codepath a bit and I believe you are talking about:
> > 
> > ext4_ext_handle_unwritten_extents()
> >   ext4_ext_convert_to_initialized()
> > 	  // Case 5: split 1 unwrit into 3 parts and convert to init
> > 		ext4_split_extent()
> 
> Yes, but in fact, it should be Case 1: split the extent into three
> extents.

Yes right my bad :) 

> 
> > 
> > in which case, after the second split succeeds
> >>
> >>        0  A      B   N
> >>        [UU|WWWWWWWWWW] on-disk extent     W: written extent
> >>        [UU|UUUUUUUUUU] extent status tree
> > 
> > WHen will extent status get split into 2 unwrit extents as you show
> > above? I seem to be missing that call since IIUC ext4_ext_insert_extent
> > itself doesn't seem to be accounting for the newly inserted extent in es.
> > 
> 
> Sorry for the confusion. This was drawn because I couldn't find a
> suitable symbol, so I followed the representation method used for
> on-disk extents. In fact, there is no splitting of extent status entries
> here. I have updated the last two graphs as follows(different types of
> extents are considered as different extents):
> 
>            0  A      B  N
>            [UUWWWWWWWWWW] on-disk extent     W: written extent
>            [UUUUUUUUUUUU] extent status tree
>            [--DDDDDDDDZZ]
> 
>            0  A      B  N
>            [UUWWWWWWWWWW] on-disk extent     W: written extent
>            [UUWWWWWWWWUU] extent status tree
>            [--DDDDDDDDZZ]

Thanks for confirming, I think according to our representation, the
following is what happens:

            0  A      B  N
            [UU|WWWWWWWWWW] on-disk extent   W: written extent  <--split
            [UUUUUUUUUUUUU] extent status tree   <---- no split
            [---DDDDDDDDZZ]
 
            0  A      B  N
            [UU|WWWWWWWWWW] on-disk extent   W: written extent  <--split
            [UU|WWWWWWW|UU] extent status tree   <--- split 
            [---DDDDDDDZZZ]

Anyways, I just had this doubt while trying to understand this codepath
so thanks for clarifying :)

> 
> Will this make it easier to understand?
> 
> Cheers,
> Yi.
> 
> 
> > Regards,
> > ojaswin
> > 
> >>        [--|DDDDDDDDZZ]
> > 
> >>
> >> Finally ext4_map_create_blocks() only insert extent A to B to the extent
> >> status tree, and leave an stale unwritten extent in the status tree.
> >>
> >>        0  A      B   N
> >>        [UU|WWWWWWWWWW] on-disk extent     W: written extent
> >>        [UU|WWWWWWWWUU] extent status tree
> >>        [--|DDDDDDDDZZ]
> >>
> >> Fix this issue by always remove cached extent status entry before
> >> splitting extent.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/extents.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> >> index 2b5aec3f8882..9bb80af4b5cf 100644
> >> --- a/fs/ext4/extents.c
> >> +++ b/fs/ext4/extents.c
> >> @@ -3367,6 +3367,12 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
> >>  	ee_len = ext4_ext_get_actual_len(ex);
> >>  	unwritten = ext4_ext_is_unwritten(ex);
> >>  
> >> +	/*
> >> +	 * Drop extent cache to prevent stale unwritten extents remaining
> >> +	 * after zeroing out.
> >> +	 */
> >> +	ext4_es_remove_extent(inode, ee_block, ee_len);
> >> +

Okay this makes sense, there are many different combinations of how the
on disk extents might turn out and if will become complicated to keep
the es in sync to those, so this seems simpler.

There might be a small performance penalty of dropping the es here tho
but idk if it's anything measurable. As a middle ground do you think it
makes more sense to drop the es cache in ext4_split_extent_at() instead,
when we know we are about to go for zeroout. Since the default non
zeroout path seems to be okay?

Regards,
ojaswin



> >>  	/* Do not cache extents that are in the process of being modified. */
> >>  	flags |= EXT4_EX_NOCACHE;
> >>  
> >> -- 
> >> 2.46.1
> >>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 06/13] ext4: don't cache extent during splitting extent
  2025-11-27  7:01     ` Zhang Yi
@ 2025-11-28  8:18       ` Ojaswin Mujoo
  0 siblings, 0 replies; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-28  8:18 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Thu, Nov 27, 2025 at 03:01:27PM +0800, Zhang Yi wrote:
> On 11/26/2025 8:04 PM, Ojaswin Mujoo wrote:
> > On Fri, Nov 21, 2025 at 02:08:04PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Caching extents during the splitting process is risky, as it may result
> >> in stale extents remaining in the status tree. Moreover, in most cases,
> >> the corresponding extent block entries are likely already cached before
> >> the split happens, making caching here not particularly useful.
> >>
> >> Assume we have an unwritten extent, and then DIO writes the first half.
> >>
> >>   [UUUUUUUUUUUUUUUU] on-disk extent        U: unwritten extent
> >>   [UUUUUUUUUUUUUUUU] extent status tree
> >>   |<-   ->| ----> dio write this range
> >>
> >> First, when ext4_split_extent_at() splits this extent, it truncates the
> >> existing extent and then inserts a new one. During this process, this
> >> extent status entry may be shrunk, and calls to ext4_find_extent() and
> >> ext4_cache_extents() may occur, which could potentially insert the
> >> truncated range as a hole into the extent status tree. After the split
> >> is completed, this hole is not replaced with the correct status.
> >>
> >>   [UUUUUUU|UUUUUUUU] on-disk extent        U: unwritten extent
> >>   [UUUUUUU|HHHHHHHH] extent status tree    H: hole
> >>
> >> Then, the outer calling functions will not correct this remaining hole
> >> extent either. Finally, if we perform a delayed buffer write on this
> >> latter part, it will re-insert the delayed extent and cause an error in
> >> space accounting.
> > 
> > Okay, makes sense. So one basic question, I see that in
> > ext4_ext_insert_extent() doesnt really care about updating es unless as a
> > side effect of ext4_find_extent().  For example, if we end up at goto
> > has_space; we don't add the new extent and niether do we update that the
> > exsisting extent has shrunk. 
> > 
> > Similarly, the splitting code also doesn't seem to care about the es
> > cache other than zeroout in the error handling.
> > 
> > Is there a reason for this? Do we expect the upper layers to maintain
> > the es cache?
> 
> Yeah, if we don't consider the zeroout case caused by a failed split,
> under typical circumstances, the ext4_es_insert_extent() in
> ext4_map_create_blocks() is called to insert or update the cached
> extent entries.
> 
> However, ext4_map_create_blocks() only insert or update
> the range that the caller want to map, it can't know the actual
> initialized range if this extent has been zeroed out, so we have to
> update the extent cache in ext4_split_extent_at() for this special case.
> Please see commit adb2355104b2 ("ext4: update extent status tree after
> an extent is zeroed out") for details.
> 
> Unfortunately, the legacy scenario described in this patch remains
> unhandled.

Got it thanks for the details.

Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> 
> Cheers,
> Yi.
> 
> >>
> >> In adition, if the unwritten extent cache is not shrunk during the
> >> splitting, ext4_cache_extents() also conflicts with existing extents
> >> when caching extents. In the future, we will add checks when caching
> >> extents, which will trigger a warning. Therefore, Do not cache extents
> >> that are being split.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/extents.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> >> index 19338f488550..2b5aec3f8882 100644
> >> --- a/fs/ext4/extents.c
> >> +++ b/fs/ext4/extents.c
> >> @@ -3199,6 +3199,9 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
> >>  	BUG_ON((split_flag & EXT4_EXT_DATA_VALID1) &&
> >>  	       (split_flag & EXT4_EXT_DATA_VALID2));
> >>  
> >> +	/* Do not cache extents that are in the process of being modified. */
> >> +	flags |= EXT4_EX_NOCACHE;
> >> +
> >>  	ext_debug(inode, "logical block %llu\n", (unsigned long long)split);
> >>  
> >>  	ext4_ext_show_leaf(inode, path);
> >> @@ -3364,6 +3367,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
> >>  	ee_len = ext4_ext_get_actual_len(ex);
> >>  	unwritten = ext4_ext_is_unwritten(ex);
> >>  
> >> +	/* Do not cache extents that are in the process of being modified. */
> >> +	flags |= EXT4_EX_NOCACHE;
> >> +
> >>  	if (map->m_lblk + map->m_len < ee_block + ee_len) {
> >>  		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
> >>  		flags1 = flags | EXT4_GET_BLOCKS_PRE_IO;
> >> -- 
> >> 2.46.1
> >>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 08/13] ext4: cleanup useless out tag in __es_remove_extent()
  2025-11-21  6:08 ` [PATCH v2 08/13] ext4: cleanup useless out tag in __es_remove_extent() Zhang Yi
  2025-11-27 12:43   ` Jan Kara
@ 2025-11-28  8:20   ` Ojaswin Mujoo
  1 sibling, 0 replies; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-28  8:20 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 21, 2025 at 02:08:06PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The out tag in __es_remove_extent() is just return err value, we can
> return it directly if something bad happens. Therefore, remove the
> useless out tag and rename out_get_reserved to out.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> ---
>  fs/ext4/extents_status.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index e04fbf10fe4f..04d56f8f6c0c 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -1434,7 +1434,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  	struct extent_status orig_es;
>  	ext4_lblk_t len1, len2;
>  	ext4_fsblk_t block;
> -	int err = 0;
> +	int err;
>  	bool count_reserved = true;
>  	struct rsvd_count rc;
>  
> @@ -1443,9 +1443,9 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  
>  	es = __es_tree_search(&tree->root, lblk);
>  	if (!es)
> -		goto out;
> +		return 0;
>  	if (es->es_lblk > end)
> -		goto out;
> +		return 0;
>  
>  	/* Simply invalidate cache_es. */
>  	tree->cache_es = NULL;
> @@ -1480,7 +1480,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  
>  				es->es_lblk = orig_es.es_lblk;
>  				es->es_len = orig_es.es_len;
> -				goto out;
> +				return err;
>  			}
>  		} else {
>  			es->es_lblk = end + 1;
> @@ -1494,7 +1494,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  		if (count_reserved)
>  			count_rsvd(inode, orig_es.es_lblk + len1,
>  				   orig_es.es_len - len1 - len2, &orig_es, &rc);
> -		goto out_get_reserved;
> +		goto out;
>  	}
>  
>  	if (len1 > 0) {
> @@ -1536,11 +1536,10 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  		}
>  	}
>  
> -out_get_reserved:
> +out:
>  	if (count_reserved)
>  		*reserved = get_rsvd(inode, end, es, &rc);
> -out:
> -	return err;
> +	return 0;
>  }
>  
>  /*
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-28  3:45     ` Zhang Yi
@ 2025-11-28 10:58       ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2025-11-28 10:58 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri 28-11-25 11:45:51, Zhang Yi wrote:
> On 11/27/2025 9:41 PM, Jan Kara wrote:
> > I think the code would be much clearer
> > if we just centralized all the zeroing in ext4_split_extent(). At that
> > place the situation is actually pretty simple:
> 
> Thank you for your suggestion!
> 
> > 
> > 1) 'ex' is unwritten, 'map' describes part with already written data which
> > we want to convert to initialized (generally IO completion situation) => we
> > can zero out boundaries if they are smaller than max_zeroout or if extent
> > split fails.
> > 
> 
> Yes. Agree.
> 
> > 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> > submission) => the split is opportunistic here, if we cannot split due to
> > ENOSPC, just go on and deal with it at IO completion time. No zeroing
> > needed.
> 
> Yes. At the same time, if we can indeed move the entire split unwritten
> operation to be handled after I/O completion in the future, it would also be
> more convenient to remove this segment of logic.

Yes.

> > 3) 'ex' is written, 'map' describes part that should be converted to
> > unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> > if extent split fails.
> 
> This makes sense to me! This case it originates from the fallocate with Zero
> Range operation. Currently, the zero-out operation will not be performed if
> the split operation fails, instead, it immediately returns a failure.
> 
> I agree with you that we can do zero out if the 'map' part smaller than
> max_zeroout instead of split extents. However, if the 'map' part is bigger
> than max_zeroout and if extent split fails, I don't think zero out is a good
> idea, Because it might cause zero-range calls to take a long time to execute.
> Although fallocate doesn't explicitly specify how ZERO_RANGE should be
> implemented, users expect it to be very fast. Therefore, in this case, if the
> split fails, it would be better to simply return an error, leave things as
> they are. What do you think?

True. Just returning the error is a good option in this case.

> > This should all result in a relatively straightforward code where we can
> > distinguish the three cases based on 'ex' and passed flags, we should be
> > able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
> > drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
> > the data exposure issues at the same time. What do you think? Am I missing
> > some case?
> 
> Indeed, I think the overall solution is a nice cleanup idea. :-)
> But this would involve a significant amount of refactoring and logical changes.
> Could we first merge the current set of patches(it could be more easier to
> backport to the early LTS version), and then I can start a new series to
> address this optimization?

I agree the changes are rather intrusive and the code is complex. Since you
have direct fixes already written, let's merge them first and cleanup
afterwards as you suggest.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-28  7:28     ` Ojaswin Mujoo
@ 2025-11-28 11:14       ` Jan Kara
  2025-11-28 14:20         ` Ojaswin Mujoo
  2025-11-28 19:52         ` Andreas Dilger
  0 siblings, 2 replies; 52+ messages in thread
From: Jan Kara @ 2025-11-28 11:14 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
	tytso, adilger.kernel, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
> On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> > Good catch on the data exposure issue! First I'd like to discuss whether
> > there isn't a way to fix these problems in a way that doesn't make the
> > already complex code even more complex. My observation is that
> > EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> > in ext4_split_convert_extents() which both call ext4_split_extent(). The
> > actual extent zeroing happens in ext4_split_extent_at() and in
> > ext4_ext_convert_to_initialized(). I think the code would be much clearer
> > if we just centralized all the zeroing in ext4_split_extent(). At that
> > place the situation is actually pretty simple:
> 
> This is exactly what I was playing with in my local tree to refactor this
> particular part of code :). I agree that ext4_split_extent() is a much
> better place to do the zeroout and it looks much cleaner but I agree
> with Yi that it might be better to do it after fixing the stale
> exposures so backports are straight forward. 
> 
> Am I correct in understanding that you are suggesting to zeroout
> proactively if we are below max_zeroout before even trying to extent
> split (which seems be done in ext4_ext_convert_to_initialized() as well)?

Yes. I was suggesting to effectively keep the behavior from
ext4_ext_convert_to_initialized().

> In this case, I have 2 concerns:
> 
> > 
> > 1) 'ex' is unwritten, 'map' describes part with already written data which
> > we want to convert to initialized (generally IO completion situation) => we
> > can zero out boundaries if they are smaller than max_zeroout or if extent
> > split fails.
> 
> Firstly, I know you mentioned in another email that zeroout of small ranges
> gives us a performance win but is it really faster on average than
> extent manipulation?

I guess it depends on the storage and the details of the extent tree. But
it definitely does help in cases like when you have large unwritten extent
and then start writing randomly 4k blocks into it because this zeroout
logic effectively limits the fragmentation of the extent tree. Overall
sequentially writing a few blocks more of zeros is very cheap practically
with any storage while fragmenting the extent tree becomes expensive rather
quickly (you generally get deeper extent tree due to smaller extents etc.).

> For example, for case 1 where both zeroout and splitting need
> journalling, I understand that splitting has high journal overhead in worst case,
> where tree might grow, but more often than not we would be manipulating
> within the same leaf so journalling only 1 bh (same as zeroout). In which case
> seems like zeroout might be slower no matter how fast the IO can be
> done. So proactive zeroout might be for beneficial for case 3 than case
> 1.

I agree that initially while the split extents still fit into the same leaf
block, zero out is likely to be somewhat slower but over the longer term
the gains from less extent fragmentation win.

> > 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> > submission) => the split is opportunistic here, if we cannot split due to
> > ENOSPC, just go on and deal with it at IO completion time. No zeroing
> > needed.
> > 
> > 3) 'ex' is written, 'map' describes part that should be converted to
> > unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> > if extent split fails.
> 
> Proactive zeroout before trying split does seem benficial to help us
> avoid journal overhead for split. However, judging from
> ext4_ext_convert_to_initialized(), max zeroout comes from
> sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
> the IO device, so that means theres a chance a zeroout might be pretty
> slow if say we are doing it on a device than doesn't support accelerated
> zeroout operations. Maybe we need to be more intelligent in setting
> s_extent_max_zeroout_kb?

You can also tune the value in sysfs. I'm not 100% sure how the kernel
could do a better guess. Also I think 32k works mostly because it is small
enough to be cheap to write but already large enough to noticeably reduce
fragmentation for some pathological workloads (you can easily get 1/4 of
the extents than without this logic). But I'm open to ideas if you have
some.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-28 11:14       ` Jan Kara
@ 2025-11-28 14:20         ` Ojaswin Mujoo
  2025-11-28 19:52         ` Andreas Dilger
  1 sibling, 0 replies; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-28 14:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 28, 2025 at 12:14:56PM +0100, Jan Kara wrote:
> On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
> > On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> > > Good catch on the data exposure issue! First I'd like to discuss whether
> > > there isn't a way to fix these problems in a way that doesn't make the
> > > already complex code even more complex. My observation is that
> > > EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> > > in ext4_split_convert_extents() which both call ext4_split_extent(). The
> > > actual extent zeroing happens in ext4_split_extent_at() and in
> > > ext4_ext_convert_to_initialized(). I think the code would be much clearer
> > > if we just centralized all the zeroing in ext4_split_extent(). At that
> > > place the situation is actually pretty simple:
> > 
> > This is exactly what I was playing with in my local tree to refactor this
> > particular part of code :). I agree that ext4_split_extent() is a much
> > better place to do the zeroout and it looks much cleaner but I agree
> > with Yi that it might be better to do it after fixing the stale
> > exposures so backports are straight forward. 
> > 
> > Am I correct in understanding that you are suggesting to zeroout
> > proactively if we are below max_zeroout before even trying to extent
> > split (which seems be done in ext4_ext_convert_to_initialized() as well)?
> 
> Yes. I was suggesting to effectively keep the behavior from
> ext4_ext_convert_to_initialized().
> 
> > In this case, I have 2 concerns:
> > 
> > > 
> > > 1) 'ex' is unwritten, 'map' describes part with already written data which
> > > we want to convert to initialized (generally IO completion situation) => we
> > > can zero out boundaries if they are smaller than max_zeroout or if extent
> > > split fails.
> > 
> > Firstly, I know you mentioned in another email that zeroout of small ranges
> > gives us a performance win but is it really faster on average than
> > extent manipulation?
> 
> I guess it depends on the storage and the details of the extent tree. But
> it definitely does help in cases like when you have large unwritten extent
> and then start writing randomly 4k blocks into it because this zeroout
> logic effectively limits the fragmentation of the extent tree. Overall
> sequentially writing a few blocks more of zeros is very cheap practically
> with any storage while fragmenting the extent tree becomes expensive rather
> quickly (you generally get deeper extent tree due to smaller extents etc.).

Got it, makes sense. This approach is definitely worth a try and then
maybe we can run a few benchmarks and see how proactive zeroout works.

> 
> > For example, for case 1 where both zeroout and splitting need
> > journalling, I understand that splitting has high journal overhead in worst case,
> > where tree might grow, but more often than not we would be manipulating
> > within the same leaf so journalling only 1 bh (same as zeroout). In which case
> > seems like zeroout might be slower no matter how fast the IO can be
> > done. So proactive zeroout might be for beneficial for case 3 than case
> > 1.
> 
> I agree that initially while the split extents still fit into the same leaf
> block, zero out is likely to be somewhat slower but over the longer term
> the gains from less extent fragmentation win.
> 
> > > 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> > > submission) => the split is opportunistic here, if we cannot split due to
> > > ENOSPC, just go on and deal with it at IO completion time. No zeroing
> > > needed.
> > > 
> > > 3) 'ex' is written, 'map' describes part that should be converted to
> > > unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> > > if extent split fails.
> > 
> > Proactive zeroout before trying split does seem benficial to help us
> > avoid journal overhead for split. However, judging from
> > ext4_ext_convert_to_initialized(), max zeroout comes from
> > sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
> > the IO device, so that means theres a chance a zeroout might be pretty
> > slow if say we are doing it on a device than doesn't support accelerated
> > zeroout operations. Maybe we need to be more intelligent in setting
> > s_extent_max_zeroout_kb?
> 
> You can also tune the value in sysfs. I'm not 100% sure how the kernel

Yeah but I feel the average users dont usually tune sysfs values

> could do a better guess. Also I think 32k works mostly because it is small
> enough to be cheap to write but already large enough to noticeably reduce
> fragmentation for some pathological workloads (you can easily get 1/4 of
> the extents than without this logic). But I'm open to ideas if you have
> some.

Yes, Im yet to check in depth about it but the block layer does advertise a
max_write_zeroes_sectors() which might help us tune the value a better,
and perhaps even support higher zeroout without losing performance. I'll
look into it a bit more.

Regards,
ojaswin

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
                   ` (13 preceding siblings ...)
  2025-11-23 10:55 ` [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Ojaswin Mujoo
@ 2025-11-28 16:49 ` Theodore Tso
  2025-11-29  1:32   ` Zhang Yi
  14 siblings, 1 reply; 52+ messages in thread
From: Theodore Tso @ 2025-11-28 16:49 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, adilger.kernel, jack,
	yi.zhang, yizhang089, libaokun1, yangerkun

Thanks to Jan and Ojaswin for their comments and reviews on this patch set.

As you may have noticed this version of the patchset is currently in
the ext4.git tree, and has shown up in linux-next.

Given that we have some suggested changes, there are a couple of ways
we can handle this:

1) Zhang Yi could produce a new version of the patch set, and I'll
replace the v2 version of the patch set currently in the ext4.git tree
with a newer version.

2) We could append fix-up patches to the ext4.git tree to reflect
those changes.

3) I could drop the patch set entirely and we apply a later version
of the patch series after the merge window.

What are folks' preferences?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-28 11:14       ` Jan Kara
  2025-11-28 14:20         ` Ojaswin Mujoo
@ 2025-11-28 19:52         ` Andreas Dilger
  2025-11-29 18:41           ` Ojaswin Mujoo
  1 sibling, 1 reply; 52+ messages in thread
From: Andreas Dilger @ 2025-11-28 19:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ojaswin Mujoo, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
	tytso, yi.zhang, yizhang089, libaokun1, yangerkun



> On Nov 28, 2025, at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
>> On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
>>> Good catch on the data exposure issue! First I'd like to discuss whether
>>> there isn't a way to fix these problems in a way that doesn't make the
>>> already complex code even more complex. My observation is that
>>> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
>>> in ext4_split_convert_extents() which both call ext4_split_extent(). The
>>> actual extent zeroing happens in ext4_split_extent_at() and in
>>> ext4_ext_convert_to_initialized(). I think the code would be much clearer
>>> if we just centralized all the zeroing in ext4_split_extent(). At that
>>> place the situation is actually pretty simple:
>> 
>> This is exactly what I was playing with in my local tree to refactor this
>> particular part of code :). I agree that ext4_split_extent() is a much
>> better place to do the zeroout and it looks much cleaner but I agree
>> with Yi that it might be better to do it after fixing the stale
>> exposures so backports are straight forward. 
>> 
>> Am I correct in understanding that you are suggesting to zeroout
>> proactively if we are below max_zeroout before even trying to extent
>> split (which seems be done in ext4_ext_convert_to_initialized() as well)?
> 
> Yes. I was suggesting to effectively keep the behavior from
> ext4_ext_convert_to_initialized().
> 
>> In this case, I have 2 concerns:
>> 
>>> 
>>> 1) 'ex' is unwritten, 'map' describes part with already written data which
>>> we want to convert to initialized (generally IO completion situation) => we
>>> can zero out boundaries if they are smaller than max_zeroout or if extent
>>> split fails.
>> 
>> Firstly, I know you mentioned in another email that zeroout of small ranges
>> gives us a performance win but is it really faster on average than
>> extent manipulation?
> 
> I guess it depends on the storage and the details of the extent tree. But
> it definitely does help in cases like when you have large unwritten extent
> and then start writing randomly 4k blocks into it because this zeroout
> logic effectively limits the fragmentation of the extent tree. Overall
> sequentially writing a few blocks more of zeros is very cheap practically
> with any storage while fragmenting the extent tree becomes expensive rather
> quickly (you generally get deeper extent tree due to smaller extents etc.).

The zeroout logic is not primarily an issue with the extent tree complexity.
I agree with Ojaswin that in the common case the extent split would not
cause a new index block to be written, though it can become unwieldy in the
extreme case.

As Jan wrote, the main performance win is to avoid writing a bunch of
small discontiguous blocks.  For HDD *and* flash, the overhead of writing
several separate small blocks is much higher than writing a single 32KiB
or 64KiB block to the storage.  Multiple separate blocks means more items
in the queue and submitted to storage, separate seeks on an HDD and/or read-
modify-write on a RAID controller, or erase blocks on a flash device.

It also defers the conversion of those unwritten extents to a later time,
when they would need to be processed again anyway if the blocks were written.

I was also considering whether the unwritten blocks would save on reads,
but I suspect that would not be the case either.  Doing sequential reads
would need to submit multiple small reads to the device and then zeroout
the unwritten blocks instead of a single 32KiB or 64KiB read (which is
basically free once the request is processed.

> 
>> For example, for case 1 where both zeroout and splitting need
>> journalling, I understand that splitting has high journal overhead in
>> worst case, where tree might grow, but more often than not we would be
>> manipulating within the same leaf so journalling only 1 bh (same as
>> zeroout). In which case seems like zeroout might be slower no matter
>> how fast the IO can be done. So proactive zeroout might be for beneficial
>> for case 3 than case 1.
> 
> I agree that initially while the split extents still fit into the same leaf
> block, zero out is likely to be somewhat slower but over the longer term
> the gains from less extent fragmentation win.

I doubt that writing a single contiguous 32KiB chunk is ever going to be
slower than writing 2 or 3 separate 4KiB chunks to the storage.  _Maybe_
if it was NVRAM, but I don't think that would be the common case?

>>> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
>>> submission) => the split is opportunistic here, if we cannot split due to
>>> ENOSPC, just go on and deal with it at IO completion time. No zeroing
>>> needed.
>>> 
>>> 3) 'ex' is written, 'map' describes part that should be converted to
>>> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
>>> if extent split fails.
>> 
>> Proactive zeroout before trying split does seem benficial to help us
>> avoid journal overhead for split. However, judging from
>> ext4_ext_convert_to_initialized(), max zeroout comes from
>> sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
>> the IO device, so that means theres a chance a zeroout might be pretty
>> slow if say we are doing it on a device than doesn't support accelerated
>> zeroout operations. Maybe we need to be more intelligent in setting
>> s_extent_max_zeroout_kb?
> 
> You can also tune the value in sysfs. I'm not 100% sure how the kernel
> could do a better guess. Also I think 32k works mostly because it is small
> enough to be cheap to write but already large enough to noticeably reduce
> fragmentation for some pathological workloads (you can easily get 1/4 of
> the extents than without this logic). But I'm open to ideas if you have
> some.

Aligning this size with the flash erase block size might be a win?
It may be that 32KiB is still large enough today (I've heard of 16KiB
sector flash devices arriving soon, and IIRC 64KiB sectors are the
norm for HDDs if anyone still cares).  Having this tuned automatically
by the physical device characteristics (like max(32KiB, sector size) or
similar if the flash erase block size is available somehow in the kernel)
would future proof this as device sizes continue to grow.


Cheers, Andreas






^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-28 16:49 ` Theodore Tso
@ 2025-11-29  1:32   ` Zhang Yi
  2025-11-29  3:52     ` Theodore Tso
  0 siblings, 1 reply; 52+ messages in thread
From: Zhang Yi @ 2025-11-29  1:32 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-ext4, linux-fsdevel, linux-kernel, adilger.kernel, jack,
	yi.zhang, yizhang089, libaokun1, yangerkun

Hi, Ted!

On 11/29/2025 12:49 AM, Theodore Tso wrote:
> Thanks to Jan and Ojaswin for their comments and reviews on this patch set.
> 
> As you may have noticed this version of the patchset is currently in
> the ext4.git tree, and has shown up in linux-next.

The ext4.git tree[1] shows that only the first three patches from this
v2 version have been merged, possibly because the fourth patch conflicted
with ErKun's patch[2]. I've written a new version locally, adding two fix
patches for the three merged patches, and then rebase the subsequent
patches and modify them directly. I can send them out after texting.
Is this okay?

Thanks,
Yi.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev
[2] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=dac092195b6a35bc7c9f11e2884cfecb1b25e20c

> 
> Given that we have some suggested changes, there are a couple of ways
> we can handle this:
> 
> 1) Zhang Yi could produce a new version of the patch set, and I'll
> replace the v2 version of the patch set currently in the ext4.git tree
> with a newer version.
> 
> 2) We could append fix-up patches to the ext4.git tree to reflect
> those changes.
> 
> 3) I could drop the patch set entirely and we apply a later version
> of the patch series after the merge window.
> 
> What are folks' preferences?
> 
> Thanks,
> 
> 						- Ted


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 07/13] ext4: drop extent cache before splitting extent
  2025-11-28  8:16       ` Ojaswin Mujoo
@ 2025-11-29  1:36         ` Zhang Yi
  0 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-29  1:36 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, yi.zhang, yizhang089, libaokun1, yangerkun

On 11/28/2025 4:16 PM, Ojaswin Mujoo wrote:
> On Thu, Nov 27, 2025 at 03:27:26PM +0800, Zhang Yi wrote:
>> On 11/26/2025 8:24 PM, Ojaswin Mujoo wrote:
>>> On Fri, Nov 21, 2025 at 02:08:05PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
[...]
>>>>
>>>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>>>> index 2b5aec3f8882..9bb80af4b5cf 100644
>>>> --- a/fs/ext4/extents.c
>>>> +++ b/fs/ext4/extents.c
>>>> @@ -3367,6 +3367,12 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>>>>  	ee_len = ext4_ext_get_actual_len(ex);
>>>>  	unwritten = ext4_ext_is_unwritten(ex);
>>>>  
>>>> +	/*
>>>> +	 * Drop extent cache to prevent stale unwritten extents remaining
>>>> +	 * after zeroing out.
>>>> +	 */
>>>> +	ext4_es_remove_extent(inode, ee_block, ee_len);
>>>> +
> 
> Okay this makes sense, there are many different combinations of how the
> on disk extents might turn out and if will become complicated to keep
> the es in sync to those, so this seems simpler.
> 
> There might be a small performance penalty of dropping the es here tho
> but idk if it's anything measurable. As a middle ground do you think it
> makes more sense to drop the es cache in ext4_split_extent_at() instead,
> when we know we are about to go for zeroout. Since the default non
> zeroout path seems to be okay?
> 
> Regards,
> ojaswin
> 
> 

Yes, this makes sense to me! I will move it to ext4_split_extent_at()
in my next iteration.

Thanks,
Yi.

> 
>>>>  	/* Do not cache extents that are in the process of being modified. */
>>>>  	flags |= EXT4_EX_NOCACHE;
>>>>  
>>>> -- 
>>>> 2.46.1
>>>>
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-29  1:32   ` Zhang Yi
@ 2025-11-29  3:52     ` Theodore Tso
  2025-11-29  4:44       ` Zhang Yi
  0 siblings, 1 reply; 52+ messages in thread
From: Theodore Tso @ 2025-11-29  3:52 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, adilger.kernel, jack,
	yi.zhang, yizhang089, libaokun1, yangerkun

On Sat, Nov 29, 2025 at 09:32:25AM +0800, Zhang Yi wrote:
> 
> The ext4.git tree[1] shows that only the first three patches from this
> v2 version have been merged, possibly because the fourth patch conflicted
> with ErKun's patch[2].

Yeah, oops.  Sorry about that.  I think I had checked some other
branch via a git reset --hard, and forgot that I had accidentally
aborted the git am after the patch conflict.

I've rearranged the dev brach so that those first three patches are
not at the tip of the dev branch.  So if you want to grab the dev
branch, and then rebase your new series on top of commit 91ef18b567da,
you can do that.

* 1e366d088866 - (HEAD -> dev, ext4/dev) ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 (10 minutes ago)
* 6afae3414127 - ext4: subdivide EXT4_EXT_DATA_VALID1 (10 minutes ago)
* a4e2cc74a11b - ext4: cleanup zeroout in ext4_split_extent_at() (10 minutes ago)
* 91ef18b567da - ext4: mark inodes without acls in __ext4_iget() (10 minutes ago)

Note that the merge window opens on this coming Sunday, but a good
number of these patches are bug fixes (and not in the sense of adding
a new feature :-), so I'd prefer to land them this cycle if possible.

> I've written a new version locally, adding two fix
> patches for the three merged patches, and then rebase the subsequent
> patches and modify them directly. I can send them out after texting.
> Is this okay?

Why don't you modify your series so it applies on top of 91ef18b567da,
so we don't hace to have the fixup patches, and I may just simply push
everything up to 91ef18b567da to Linus, and we can then just apply the
next version of your series on top of that commit, and I'll push it to
Linus right after -rc1.

Does that seem workable to you?

							- Ted

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
  2025-11-29  3:52     ` Theodore Tso
@ 2025-11-29  4:44       ` Zhang Yi
  0 siblings, 0 replies; 52+ messages in thread
From: Zhang Yi @ 2025-11-29  4:44 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-ext4, linux-fsdevel, linux-kernel, adilger.kernel, jack,
	yi.zhang, yizhang089, libaokun1, yangerkun

On 11/29/2025 11:52 AM, Theodore Tso wrote:
> On Sat, Nov 29, 2025 at 09:32:25AM +0800, Zhang Yi wrote:
>>
>> The ext4.git tree[1] shows that only the first three patches from this
>> v2 version have been merged, possibly because the fourth patch conflicted
>> with ErKun's patch[2].
> 
> Yeah, oops.  Sorry about that.  I think I had checked some other
> branch via a git reset --hard, and forgot that I had accidentally
> aborted the git am after the patch conflict.
> 
> I've rearranged the dev brach so that those first three patches are
> not at the tip of the dev branch.  So if you want to grab the dev
> branch, and then rebase your new series on top of commit 91ef18b567da,
> you can do that.
> 
> * 1e366d088866 - (HEAD -> dev, ext4/dev) ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 (10 minutes ago)
> * 6afae3414127 - ext4: subdivide EXT4_EXT_DATA_VALID1 (10 minutes ago)
> * a4e2cc74a11b - ext4: cleanup zeroout in ext4_split_extent_at() (10 minutes ago)
> * 91ef18b567da - ext4: mark inodes without acls in __ext4_iget() (10 minutes ago)
> 
> Note that the merge window opens on this coming Sunday, but a good
> number of these patches are bug fixes (and not in the sense of adding
> a new feature :-), so I'd prefer to land them this cycle if possible.
> 
>> I've written a new version locally, adding two fix
>> patches for the three merged patches, and then rebase the subsequent
>> patches and modify them directly. I can send them out after texting.
>> Is this okay?
> 
> Why don't you modify your series so it applies on top of 91ef18b567da,
> so we don't hace to have the fixup patches, and I may just simply push
> everything up to 91ef18b567da to Linus, and we can then just apply the
> next version of your series on top of that commit, and I'll push it to
> Linus right after -rc1.
> 
> Does that seem workable to you?
> 
> 							- Ted

Sure, I will rebase my series on top of 91ef18b567da and send out today.

Cheers,
Yi.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  2025-11-28 19:52         ` Andreas Dilger
@ 2025-11-29 18:41           ` Ojaswin Mujoo
  0 siblings, 0 replies; 52+ messages in thread
From: Ojaswin Mujoo @ 2025-11-29 18:41 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Kara, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
	tytso, yi.zhang, yizhang089, libaokun1, yangerkun

On Fri, Nov 28, 2025 at 12:52:30PM -0700, Andreas Dilger wrote:
> 
> 
> > On Nov 28, 2025, at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> > 
> > On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
> >> On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> >>> Good catch on the data exposure issue! First I'd like to discuss whether
> >>> there isn't a way to fix these problems in a way that doesn't make the
> >>> already complex code even more complex. My observation is that
> >>> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> >>> in ext4_split_convert_extents() which both call ext4_split_extent(). The
> >>> actual extent zeroing happens in ext4_split_extent_at() and in
> >>> ext4_ext_convert_to_initialized(). I think the code would be much clearer
> >>> if we just centralized all the zeroing in ext4_split_extent(). At that
> >>> place the situation is actually pretty simple:
> >> 
> >> This is exactly what I was playing with in my local tree to refactor this
> >> particular part of code :). I agree that ext4_split_extent() is a much
> >> better place to do the zeroout and it looks much cleaner but I agree
> >> with Yi that it might be better to do it after fixing the stale
> >> exposures so backports are straight forward. 
> >> 
> >> Am I correct in understanding that you are suggesting to zeroout
> >> proactively if we are below max_zeroout before even trying to extent
> >> split (which seems be done in ext4_ext_convert_to_initialized() as well)?
> > 
> > Yes. I was suggesting to effectively keep the behavior from
> > ext4_ext_convert_to_initialized().
> > 
> >> In this case, I have 2 concerns:
> >> 
> >>> 
> >>> 1) 'ex' is unwritten, 'map' describes part with already written data which
> >>> we want to convert to initialized (generally IO completion situation) => we
> >>> can zero out boundaries if they are smaller than max_zeroout or if extent
> >>> split fails.
> >> 
> >> Firstly, I know you mentioned in another email that zeroout of small ranges
> >> gives us a performance win but is it really faster on average than
> >> extent manipulation?
> > 
> > I guess it depends on the storage and the details of the extent tree. But
> > it definitely does help in cases like when you have large unwritten extent
> > and then start writing randomly 4k blocks into it because this zeroout
> > logic effectively limits the fragmentation of the extent tree. Overall
> > sequentially writing a few blocks more of zeros is very cheap practically
> > with any storage while fragmenting the extent tree becomes expensive rather
> > quickly (you generally get deeper extent tree due to smaller extents etc.).
> 
> The zeroout logic is not primarily an issue with the extent tree complexity.
> I agree with Ojaswin that in the common case the extent split would not
> cause a new index block to be written, though it can become unwieldy in the
> extreme case.
> 
> As Jan wrote, the main performance win is to avoid writing a bunch of
> small discontiguous blocks.  For HDD *and* flash, the overhead of writing
> several separate small blocks is much higher than writing a single 32KiB
> or 64KiB block to the storage.  Multiple separate blocks means more items
> in the queue and submitted to storage, separate seeks on an HDD and/or read-
> modify-write on a RAID controller, or erase blocks on a flash device.

Okay makes sense, so basically zeroout helps in the long run by reducing
fragmentation.

> 
> It also defers the conversion of those unwritten extents to a later time,
> when they would need to be processed again anyway if the blocks were written.
> 
> I was also considering whether the unwritten blocks would save on reads,
> but I suspect that would not be the case either.  Doing sequential reads
> would need to submit multiple small reads to the device and then zeroout
> the unwritten blocks instead of a single 32KiB or 64KiB read (which is
> basically free once the request is processed.
> 
> > 
> >> For example, for case 1 where both zeroout and splitting need
> >> journalling, I understand that splitting has high journal overhead in
> >> worst case, where tree might grow, but more often than not we would be
> >> manipulating within the same leaf so journalling only 1 bh (same as
> >> zeroout). In which case seems like zeroout might be slower no matter
> >> how fast the IO can be done. So proactive zeroout might be for beneficial
> >> for case 3 than case 1.
> > 
> > I agree that initially while the split extents still fit into the same leaf
> > block, zero out is likely to be somewhat slower but over the longer term
> > the gains from less extent fragmentation win.
> 
> I doubt that writing a single contiguous 32KiB chunk is ever going to be
> slower than writing 2 or 3 separate 4KiB chunks to the storage.  _Maybe_
> if it was NVRAM, but I don't think that would be the common case?
> 
> >>> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> >>> submission) => the split is opportunistic here, if we cannot split due to
> >>> ENOSPC, just go on and deal with it at IO completion time. No zeroing
> >>> needed.
> >>> 
> >>> 3) 'ex' is written, 'map' describes part that should be converted to
> >>> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> >>> if extent split fails.
> >> 
> >> Proactive zeroout before trying split does seem benficial to help us
> >> avoid journal overhead for split. However, judging from
> >> ext4_ext_convert_to_initialized(), max zeroout comes from
> >> sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
> >> the IO device, so that means theres a chance a zeroout might be pretty
> >> slow if say we are doing it on a device than doesn't support accelerated
> >> zeroout operations. Maybe we need to be more intelligent in setting
> >> s_extent_max_zeroout_kb?
> > 
> > You can also tune the value in sysfs. I'm not 100% sure how the kernel
> > could do a better guess. Also I think 32k works mostly because it is small
> > enough to be cheap to write but already large enough to noticeably reduce
> > fragmentation for some pathological workloads (you can easily get 1/4 of
> > the extents than without this logic). But I'm open to ideas if you have
> > some.
> 
> Aligning this size with the flash erase block size might be a win?
> It may be that 32KiB is still large enough today (I've heard of 16KiB
> sector flash devices arriving soon, and IIRC 64KiB sectors are the
> norm for HDDs if anyone still cares).  Having this tuned automatically
> by the physical device characteristics (like max(32KiB, sector size) or
> similar if the flash erase block size is available somehow in the kernel)
> would future proof this as device sizes continue to grow.

okay so do you think it makes sense to consider the
max_write_zeroes_sectors() value of the underlying device. Enterprise
NVMEs and SCSI disks can utilize WRITE_ZEROES or WRITE_SAME which can
allow us to potentially zeroout much larger ranges with minimal overhead
of data movement. Maybe we can leverage these to zeroout more
aggressively without much of a performance penalty.

Let me try to see if i can find a disk and try a POC for this.

Regards,
ojaswin

> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2025-11-29 18:42 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
2025-11-21  6:07 ` [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at() Zhang Yi
2025-11-26 11:26   ` Ojaswin Mujoo
2025-11-27 12:02   ` Jan Kara
2025-11-28  2:22     ` Zhang Yi
2025-11-28  1:54   ` Baokun Li
2025-11-21  6:08 ` [PATCH v2 02/13] ext4: subdivide EXT4_EXT_DATA_VALID1 Zhang Yi
2025-11-26 11:27   ` Ojaswin Mujoo
2025-11-26 11:55     ` Ojaswin Mujoo
2025-11-27  6:09       ` Zhang Yi
2025-11-21  6:08 ` [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 Zhang Yi
2025-11-26 11:29   ` Ojaswin Mujoo
2025-11-27  6:13     ` Zhang Yi
2025-11-27 13:41   ` Jan Kara
2025-11-28  3:45     ` Zhang Yi
2025-11-28 10:58       ` Jan Kara
2025-11-28  7:28     ` Ojaswin Mujoo
2025-11-28 11:14       ` Jan Kara
2025-11-28 14:20         ` Ojaswin Mujoo
2025-11-28 19:52         ` Andreas Dilger
2025-11-29 18:41           ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 04/13] ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/O Zhang Yi
2025-11-26 11:50   ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 05/13] ext4: correct the mapping status if the extent has been zeroed Zhang Yi
2025-11-26 11:56   ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 06/13] ext4: don't cache extent during splitting extent Zhang Yi
2025-11-26 12:04   ` Ojaswin Mujoo
2025-11-27  7:01     ` Zhang Yi
2025-11-28  8:18       ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 07/13] ext4: drop extent cache before " Zhang Yi
2025-11-26 12:24   ` Ojaswin Mujoo
2025-11-27  7:27     ` Zhang Yi
2025-11-28  8:16       ` Ojaswin Mujoo
2025-11-29  1:36         ` Zhang Yi
2025-11-21  6:08 ` [PATCH v2 08/13] ext4: cleanup useless out tag in __es_remove_extent() Zhang Yi
2025-11-27 12:43   ` Jan Kara
2025-11-28  8:20   ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 09/13] ext4: make __es_remove_extent() check extent status Zhang Yi
2025-11-21  6:08 ` [PATCH v2 10/13] ext4: make ext4_es_cache_extent() support overwrite existing extents Zhang Yi
2025-11-21  6:08 ` [PATCH v2 11/13] ext4: adjust the debug info in ext4_es_cache_extent() Zhang Yi
2025-11-21  6:08 ` [PATCH v2 12/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
2025-11-21  6:08 ` [PATCH v2 13/13] ext4: drop the TODO comment in ext4_es_insert_extent() Zhang Yi
2025-11-23 10:55 ` [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Ojaswin Mujoo
2025-11-24  5:04   ` Zhang Yi
2025-11-24 12:50     ` Ojaswin Mujoo
2025-11-24 14:05       ` Zhang Yi
2025-11-27 12:24     ` Jan Kara
2025-11-28  4:37       ` Zhang Yi
2025-11-28 16:49 ` Theodore Tso
2025-11-29  1:32   ` Zhang Yi
2025-11-29  3:52     ` Theodore Tso
2025-11-29  4:44       ` Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).