[PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES
@ 2026-07-01 14:20 yizhang089
  2026-07-01 14:20 ` [PATCH 1/6] ext4: move partial block zeroing earlier in ext4_zero_range() yizhang089
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: yizhang089 @ 2026-07-01 14:20 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, libaokun,
	jack, ojaswin, ritesh.list, yi.zhang, yi.zhang, yizhang089,
	chengzhihao1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Hi,

The current FALLOC_FL_WRITE_ZEROES implementation in ext4 does not
correctly handle unaligned edge blocks. FALLOC_FL_WRITE_ZEROES is
needed to convert the requested range to written extents with zeroed
content, but the existing code only partially zero the edges and never
guarantees that:

  1) edges that are clean unwritten extents or holes get promoted to
     written extents, and
  2) edges that are dirty unwritten or delalloc get their underlying
     extents converted to written before the syscall returns.

Both cases leave the on-disk extent type unwritten, violating the
WRITE_ZEROES contract, and in these cases a subsequent sync buffered
overwrite would still observe pending metadata changes from the
conversion work that should have been completed by WRITE_ZEROES itself.

This series fixes the unaligned edge handling and cleans up a related
redundant i_disksize update. The first three patches are preparatory:
moving ext4_zero_partial_blocks() earlier in ext4_zero_range(),
documenting the return semantics of ext4_load_tail_bh(), and
replacing the single bool output of ext4_zero_partial_blocks() with a
per-edge bitmask (EXT4_PARTIAL_ZERO_START/END). The next two patches
fix the two scenarios above by expanding the aligned allocation range
outward for skipped (clean unwritten or hole) edges, and by writing
back partial-zeroed edges so the underlying extents are converted to
written. The final patch drops the now-redundant
ext4_update_disksize_before_punch() call on the WRITE_ZEROES path,
since WRITE_ZEROES is mutually exclusive with KEEP_SIZE and the
i_disksize update is already handled by ext4_alloc_file_blocks().

Thanks,
Yi.

Discussion Link:
  https://lore.kernel.org/linux-xfs/0b7f1a4f-da1c-4297-8099-98d738070ab7@huaweicloud.com/

Zhang Yi (6):
  ext4: move partial block zeroing earlier in ext4_zero_range()
  ext4: clarify return semantics of ext4_load_tail_bh()
  ext4: track partial-zero outcome per edge in
    ext4_zero_partial_blocks()
  ext4: zero out whole block for clean edges in WRITE_ZEROES
  ext4: write back partial-zeroed edges in WRITE_ZEROES
  ext4: skip ext4_update_disksize_before_punch() in WRITE_ZEROES

 fs/ext4/ext4.h    |  5 ++++-
 fs/ext4/extents.c | 47 +++++++++++++++++++++++++++++++++++++----------
 fs/ext4/inode.c   | 42 +++++++++++++++++++++++++++++++++++-------
 3 files changed, 76 insertions(+), 18 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] ext4: move partial block zeroing earlier in ext4_zero_range()
  2026-07-01 14:20 [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES yizhang089
@ 2026-07-01 14:20 ` yizhang089
  2026-07-02  9:40   ` Jan Kara
  2026-07-01 14:20 ` [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh() yizhang089
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: yizhang089 @ 2026-07-01 14:20 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, libaokun,
	jack, ojaswin, ritesh.list, yi.zhang, yi.zhang, yizhang089,
	chengzhihao1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

In ext4_zero_range(), move the ext4_zero_partial_blocks() call, which
handles unaligned edges, into the same branch where the unaligned range
is preallocated, immediately after ext4_alloc_file_blocks(). This is
safe because there is no dependency between partial block handling and
the subsequent full block handling.

This change will be used by later patches that handle unaligned
FALLOC_FL_WRITE_ZEROES operations, which will need to check the partial
zeroed result.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 125f628e738a..27ca641e701b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4734,10 +4734,16 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	}
 
 	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
-	/* Preallocate the range including the unaligned edges */
+	/*
+	 * Preallocate the range including the unaligned edges, and zero
+	 * out partial blocks if they already contain data.
+	 */
 	if (!IS_ALIGNED(offset | end, blocksize)) {
 		ret = ext4_alloc_file_blocks(file, offset, len, new_size,
 					     flags);
+		if (!ret)
+			ret = ext4_zero_partial_blocks(inode, offset, len,
+						       &partial_zeroed);
 		if (ret)
 			return ret;
 	}
@@ -4770,10 +4776,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (IS_ALIGNED(offset | end, blocksize))
 		return ret;
 
-	/* Zero out partial block at the edges of the range */
-	ret = ext4_zero_partial_blocks(inode, offset, len, &partial_zeroed);
-	if (ret)
-		return ret;
 	if (((file->f_flags & O_SYNC) || IS_SYNC(inode)) && partial_zeroed) {
 		ret = filemap_write_and_wait_range(inode->i_mapping, offset,
 						   end - 1);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh()
  2026-07-01 14:20 [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES yizhang089
  2026-07-01 14:20 ` [PATCH 1/6] ext4: move partial block zeroing earlier in ext4_zero_range() yizhang089
@ 2026-07-01 14:20 ` yizhang089
  2026-07-02  9:53   ` Jan Kara
  2026-07-01 14:20 ` [PATCH 3/6] ext4: track partial-zero outcome per edge in ext4_zero_partial_blocks() yizhang089
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: yizhang089 @ 2026-07-01 14:20 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, libaokun,
	jack, ojaswin, ritesh.list, yi.zhang, yi.zhang, yizhang089,
	chengzhihao1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

ext4_load_tail_bh() returns NULL for both holes and clean unwritten
buffers, but the conditions that lead to this are not obvious from the
code alone. Document this behavior to clarify the return value, so that
readers do not mistakenly assume that only holes result in a NULL
return.

Also update the inline comment following the ext4_get_block() call to
reflect this: both holes and clean unwritten buffers fall through to the
"nothing to do" path.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..0b31fa873743 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4026,6 +4026,10 @@ void ext4_set_aops(struct inode *inode)
  * because it might have data in pagecache (eg, if called from ext4_zero_range,
  * ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
  * racing writeback can come later and flush the stale pagecache to disk.
+ *
+ * Return the loaded bh if it actually needs zeroing - in written, dirty
+ * unwritten, or delalloc state. Return NULL if it's clean (i.e., a hole or
+ * a clean unwritten block).
  */
 static struct buffer_head *ext4_load_tail_bh(struct inode *inode, loff_t from)
 {
@@ -4065,7 +4069,7 @@ static struct buffer_head *ext4_load_tail_bh(struct inode *inode, loff_t from)
 	if (!buffer_mapped(bh)) {
 		BUFFER_TRACE(bh, "unmapped");
 		ext4_get_block(inode, iblock, bh, 0);
-		/* unmapped? It's a hole - nothing to do */
+		/* It's a hole or a clean unwritten block - nothing to do */
 		if (!buffer_mapped(bh)) {
 			BUFFER_TRACE(bh, "still unmapped");
 			goto unlock;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6] ext4: track partial-zero outcome per edge in ext4_zero_partial_blocks()
  2026-07-01 14:20 [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES yizhang089
  2026-07-01 14:20 ` [PATCH 1/6] ext4: move partial block zeroing earlier in ext4_zero_range() yizhang089
  2026-07-01 14:20 ` [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh() yizhang089
@ 2026-07-01 14:20 ` yizhang089
  2026-07-02  9:57   ` Jan Kara
  2026-07-01 14:20 ` [PATCH 4/6] ext4: zero out whole block for clean edges in WRITE_ZEROES yizhang089
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: yizhang089 @ 2026-07-01 14:20 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, libaokun,
	jack, ojaswin, ritesh.list, yi.zhang, yi.zhang, yizhang089,
	chengzhihao1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Replace the single bool did_zero output of ext4_zero_partial_blocks()
with a bitmask that records which edge (start, end, or both in the
single-block case) was actually partial-zeroed. This allows callers to
distinguish which edges have been zeroed, preparing for unaligned
FALLOC_FL_WRITE_ZEROES handling in later patches.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h    |  5 ++++-
 fs/ext4/extents.c |  2 +-
 fs/ext4/inode.c   | 36 ++++++++++++++++++++++++++++++------
 3 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 94283a991e5c..b2e55876d0e3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3099,8 +3099,11 @@ extern int ext4_chunk_trans_extent(struct inode *inode, int nrblocks);
 extern int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
 				  int pextents);
 extern int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end);
+
+#define EXT4_PARTIAL_ZERO_START	0x1
+#define EXT4_PARTIAL_ZERO_END	0x2
 extern int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart,
-				    loff_t length, bool *did_zero);
+				    loff_t length, unsigned int *partial_zeroed);
 extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
 extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 27ca641e701b..7a8f9f188a29 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4715,7 +4715,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	loff_t align_start, align_end, new_size = 0;
 	loff_t end = offset + len;
 	unsigned int blocksize = i_blocksize(inode);
-	bool partial_zeroed = false;
+	unsigned int partial_zeroed = 0;
 	int ret, flags;
 
 	trace_ext4_zero_range(inode, offset, len, mode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0b31fa873743..d68f7be4ff02 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4256,13 +4256,26 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
 	return 0;
 }
 
+/*
+ * Zero out the unaligned head and tail of the [lstart, lstart+length)
+ * range.
+ *
+ * On return, @partial_zeroed records which edges actually got
+ * partial-zeroed.  Set EXT4_PARTIAL_ZERO_START/EXT4_PARTIAL_ZERO_END if
+ * the head/tail block got actually partially zeroed (in written, dirty
+ * unwritten or delalloc state).  Cleared if the head/tail block is a
+ * hole or a clean unwritten block, in which case there is nothing needs
+ * to zero.  When the head and tail land in the same block, both bits
+ * are set together on a successful zero.
+ */
 int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, loff_t length,
-			     bool *did_zero)
+			     unsigned int *partial_zeroed)
 {
 	struct super_block *sb = inode->i_sb;
 	unsigned partial_start, partial_end;
 	ext4_fsblk_t start, end;
 	loff_t byte_end = (lstart + length - 1);
+	bool did_zero = false;
 	int err = 0;
 
 	partial_start = lstart & (sb->s_blocksize - 1);
@@ -4274,21 +4287,32 @@ int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, loff_t length,
 	/* Handle partial zero within the single block */
 	if (start == end &&
 	    (partial_start || (partial_end != sb->s_blocksize - 1))) {
-		err = ext4_block_zero_range(inode, lstart, length, did_zero,
+		err = ext4_block_zero_range(inode, lstart, length, &did_zero,
 					    NULL);
+		if (did_zero)
+			*partial_zeroed |= (EXT4_PARTIAL_ZERO_START |
+					    EXT4_PARTIAL_ZERO_END);
 		return err;
 	}
 	/* Handle partial zero out on the start of the range */
 	if (partial_start) {
 		err = ext4_block_zero_range(inode, lstart, sb->s_blocksize,
-					    did_zero, NULL);
+					    &did_zero, NULL);
 		if (err)
 			return err;
+		if (did_zero)
+			*partial_zeroed |= EXT4_PARTIAL_ZERO_START;
 	}
 	/* Handle partial zero out on the end of the range */
-	if (partial_end != sb->s_blocksize - 1)
+	if (partial_end != sb->s_blocksize - 1) {
+		did_zero = false;
 		err = ext4_block_zero_range(inode, byte_end - partial_end,
-					    partial_end + 1, did_zero, NULL);
+					    partial_end + 1, &did_zero, NULL);
+		if (err)
+			return err;
+		if (did_zero)
+			*partial_zeroed |= EXT4_PARTIAL_ZERO_END;
+	}
 	return err;
 }
 
@@ -4437,7 +4461,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 	loff_t end = offset + length;
 	handle_t *handle;
 	unsigned int credits;
-	bool partial_zeroed = false;
+	unsigned int partial_zeroed = 0;
 	int ret;
 
 	trace_ext4_punch_hole(inode, offset, length, 0);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] ext4: zero out whole block for clean edges in WRITE_ZEROES
  2026-07-01 14:20 [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES yizhang089
                   ` (2 preceding siblings ...)
  2026-07-01 14:20 ` [PATCH 3/6] ext4: track partial-zero outcome per edge in ext4_zero_partial_blocks() yizhang089
@ 2026-07-01 14:20 ` yizhang089
  2026-07-01 14:20 ` [PATCH 5/6] ext4: write back partial-zeroed " yizhang089
  2026-07-01 14:20 ` [PATCH 6/6] ext4: skip ext4_update_disksize_before_punch() " yizhang089
  5 siblings, 0 replies; 17+ messages in thread
From: yizhang089 @ 2026-07-01 14:20 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, libaokun,
	jack, ojaswin, ritesh.list, yi.zhang, yi.zhang, yizhang089,
	chengzhihao1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

FALLOC_FL_WRITE_ZEROES requires that all blocks in the requested range
end up as written extents with zeroed content. For unaligned edges that
were already allocated, ext4_zero_partial_blocks() zeros them directly.
However, for unaligned edges whose underlying extent is a clean
unwritten extent or a hole, the extent type remains unwritten after
partial zeroing, which does not align with the semantics of
WRITE_ZEROES.

Therefore, when ext4_zero_partial_blocks() skips partial zeroing, it
indicates that the corresponding edges are clean unwritten extents or
holes. In this case, we need to expand the aligned allocation range
outward to cover such edges, so that ext4_alloc_file_blocks() can
correctly allocate blocks for the unaligned range. Edges that were
partial-zeroed (i.e., written or dirty) are left untouched.

Fixes: f4265b8d32c4 ("ext4: add FALLOC_FL_WRITE_ZEROES support")
Cc: stable@vger.kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7a8f9f188a29..c34505765521 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4760,6 +4760,21 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	/* Zero range excluding the unaligned edges */
 	align_start = round_up(offset, blocksize);
 	align_end = round_down(end, blocksize);
+
+	/*
+	 * In WRITE_ZEROES mode, edges that were not partial-zeroed (clean
+	 * unwritten or hole) must be allocated and zeroed as whole blocks.
+	 * Expand the aligned range outward to cover them.
+	 */
+	if (mode & FALLOC_FL_WRITE_ZEROES) {
+		if (!IS_ALIGNED(offset, blocksize) &&
+		    !(partial_zeroed & EXT4_PARTIAL_ZERO_START))
+			align_start = round_down(offset, blocksize);
+		if (!IS_ALIGNED(end, blocksize) &&
+		    !(partial_zeroed & EXT4_PARTIAL_ZERO_END))
+			align_end = round_up(end, blocksize);
+	}
+
 	if (align_end > align_start) {
 		if (mode & FALLOC_FL_WRITE_ZEROES)
 			flags = EXT4_GET_BLOCKS_CREATE_ZERO | EXT4_EX_NOCACHE;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6] ext4: write back partial-zeroed edges in WRITE_ZEROES
  2026-07-01 14:20 [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES yizhang089
                   ` (3 preceding siblings ...)
  2026-07-01 14:20 ` [PATCH 4/6] ext4: zero out whole block for clean edges in WRITE_ZEROES yizhang089
@ 2026-07-01 14:20 ` yizhang089
  2026-07-02 10:05   ` Jan Kara
  2026-07-01 14:20 ` [PATCH 6/6] ext4: skip ext4_update_disksize_before_punch() " yizhang089
  5 siblings, 1 reply; 17+ messages in thread
From: yizhang089 @ 2026-07-01 14:20 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, libaokun,
	jack, ojaswin, ritesh.list, yi.zhang, yi.zhang, yizhang089,
	chengzhihao1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

FALLOC_FL_WRITE_ZEROES requires that all blocks in the requested range
end up as written extents with zeroed content. For unaligned edges that
were partial-zeroed in dirty unwritten or delalloc state, the buffer
is left dirty while the underlying extent may not yet be converted to
written. As a result, a subsequent SYNC write to this range would still
trigger metadata changes, which violates the semantics of WRITE_ZEROES.

Fix this by calling filemap_write_and_wait_range() for partial-zeroed
edges to flush out the zeroed data and ensure the extent conversion
is complete.

Fixes: f4265b8d32c4 ("ext4: add FALLOC_FL_WRITE_ZEROES support")
Cc: stable@vger.kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c34505765521..c083703bf704 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4791,7 +4791,15 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (IS_ALIGNED(offset | end, blocksize))
 		return ret;
 
-	if (((file->f_flags & O_SYNC) || IS_SYNC(inode)) && partial_zeroed) {
+	/*
+	 * In FALLOC_FL_WRITE_ZEROES mode, edges that have been partially
+	 * zeroed must be written back to ensure the entire zeroed range
+	 * is converted to the written state. In SYNC mode, writeback is
+	 * also required to persist the zeroed data to disk.
+	 */
+	if (partial_zeroed &&
+	    ((mode & FALLOC_FL_WRITE_ZEROES) ||
+	     (file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ret = filemap_write_and_wait_range(inode->i_mapping, offset,
 						   end - 1);
 		if (ret)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6] ext4: skip ext4_update_disksize_before_punch() in WRITE_ZEROES
  2026-07-01 14:20 [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES yizhang089
                   ` (4 preceding siblings ...)
  2026-07-01 14:20 ` [PATCH 5/6] ext4: write back partial-zeroed " yizhang089
@ 2026-07-01 14:20 ` yizhang089
  2026-07-02 10:19   ` Jan Kara
  5 siblings, 1 reply; 17+ messages in thread
From: yizhang089 @ 2026-07-01 14:20 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, libaokun,
	jack, ojaswin, ritesh.list, yi.zhang, yi.zhang, yizhang089,
	chengzhihao1, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

FALLOC_FL_WRITE_ZEROES is mutually exclusive with FALLOC_FL_KEEP_SIZE.
In ext4_zero_range(), when end > i_disksize, new_size is always set to
end for WRITE_ZEROES, so the i_disksize update should not be lost.

The ext4_update_disksize_before_punch() call is only needed for the
ZERO_RANGE path in ext4_zero_range(), as it handles the case where
KEEP_SIZE may preserve the original disk size. Skip it for WRITE_ZEROES
to avoid unnecessary work and potential confusion.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c083703bf704..38753c2ef910 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4748,9 +4748,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 			return ret;
 	}
 
-	ret = ext4_update_disksize_before_punch(inode, offset, len);
-	if (ret)
-		return ret;
+	if (mode & FALLOC_FL_ZERO_RANGE) {
+		ret = ext4_update_disksize_before_punch(inode, offset, len);
+		if (ret)
+			return ret;
+	}
 
 	/* Now release the pages and zero block aligned part of pages */
 	ret = ext4_truncate_page_cache_block_range(inode, offset, end);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] ext4: move partial block zeroing earlier in ext4_zero_range()
  2026-07-01 14:20 ` [PATCH 1/6] ext4: move partial block zeroing earlier in ext4_zero_range() yizhang089
@ 2026-07-02  9:40   ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2026-07-02  9:40 UTC (permalink / raw)
  To: yizhang089
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, yi.zhang, yi.zhang,
	chengzhihao1, yangerkun

On Wed 01-07-26 22:20:04, yizhang089@gmail.com wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> In ext4_zero_range(), move the ext4_zero_partial_blocks() call, which
> handles unaligned edges, into the same branch where the unaligned range
> is preallocated, immediately after ext4_alloc_file_blocks(). This is
> safe because there is no dependency between partial block handling and
> the subsequent full block handling.
> 
> This change will be used by later patches that handle unaligned
> FALLOC_FL_WRITE_ZEROES operations, which will need to check the partial
> zeroed result.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 125f628e738a..27ca641e701b 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4734,10 +4734,16 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	}
>  
>  	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
> -	/* Preallocate the range including the unaligned edges */
> +	/*
> +	 * Preallocate the range including the unaligned edges, and zero
> +	 * out partial blocks if they already contain data.
> +	 */
>  	if (!IS_ALIGNED(offset | end, blocksize)) {
>  		ret = ext4_alloc_file_blocks(file, offset, len, new_size,
>  					     flags);
> +		if (!ret)
> +			ret = ext4_zero_partial_blocks(inode, offset, len,
> +						       &partial_zeroed);
>  		if (ret)
>  			return ret;
>  	}
> @@ -4770,10 +4776,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	if (IS_ALIGNED(offset | end, blocksize))
>  		return ret;
>  
> -	/* Zero out partial block at the edges of the range */
> -	ret = ext4_zero_partial_blocks(inode, offset, len, &partial_zeroed);
> -	if (ret)
> -		return ret;
>  	if (((file->f_flags & O_SYNC) || IS_SYNC(inode)) && partial_zeroed) {
>  		ret = filemap_write_and_wait_range(inode->i_mapping, offset,
>  						   end - 1);
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh()
  2026-07-01 14:20 ` [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh() yizhang089
@ 2026-07-02  9:53   ` Jan Kara
  2026-07-02 14:40     ` Zhang Yi
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2026-07-02  9:53 UTC (permalink / raw)
  To: yizhang089
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, yi.zhang, yi.zhang,
	chengzhihao1, yangerkun

On Wed 01-07-26 22:20:05, yizhang089@gmail.com wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> ext4_load_tail_bh() returns NULL for both holes and clean unwritten
> buffers, but the conditions that lead to this are not obvious from the
> code alone. Document this behavior to clarify the return value, so that
> readers do not mistakenly assume that only holes result in a NULL
> return.
> 
> Also update the inline comment following the ext4_get_block() call to
> reflect this: both holes and clean unwritten buffers fall through to the
> "nothing to do" path.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c2c2d6ac7f3d..0b31fa873743 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4026,6 +4026,10 @@ void ext4_set_aops(struct inode *inode)
>   * because it might have data in pagecache (eg, if called from ext4_zero_range,
>   * ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
>   * racing writeback can come later and flush the stale pagecache to disk.
> + *
> + * Return the loaded bh if it actually needs zeroing - in written, dirty
> + * unwritten, or delalloc state. Return NULL if it's clean (i.e., a hole or
> + * a clean unwritten block).
>   */

Great that you're adding this comment because when I was reading previous
patch, I've spent like 10 minutes trying to figure that out :). But as far
as I'm reading the code, ext4_load_tail_bh() will return the unwritten bh
even if it is clean - map_bh() in _ext4_get_block() will set
buffer_mapped() even for unwritten extent. What am I missing? BTW, it also
seems to be ext4_load_tail_bh() could attempt to ext4_read_bh_lock() on
unwritten bh if things align wrongly...

								Honza

>  static struct buffer_head *ext4_load_tail_bh(struct inode *inode, loff_t from)
>  {
> @@ -4065,7 +4069,7 @@ static struct buffer_head *ext4_load_tail_bh(struct inode *inode, loff_t from)
>  	if (!buffer_mapped(bh)) {
>  		BUFFER_TRACE(bh, "unmapped");
>  		ext4_get_block(inode, iblock, bh, 0);
> -		/* unmapped? It's a hole - nothing to do */
> +		/* It's a hole or a clean unwritten block - nothing to do */
>  		if (!buffer_mapped(bh)) {
>  			BUFFER_TRACE(bh, "still unmapped");
>  			goto unlock;
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/6] ext4: track partial-zero outcome per edge in ext4_zero_partial_blocks()
  2026-07-01 14:20 ` [PATCH 3/6] ext4: track partial-zero outcome per edge in ext4_zero_partial_blocks() yizhang089
@ 2026-07-02  9:57   ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2026-07-02  9:57 UTC (permalink / raw)
  To: yizhang089
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, yi.zhang, yi.zhang,
	chengzhihao1, yangerkun

On Wed 01-07-26 22:20:06, yizhang089@gmail.com wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Replace the single bool did_zero output of ext4_zero_partial_blocks()
> with a bitmask that records which edge (start, end, or both in the
> single-block case) was actually partial-zeroed. This allows callers to
> distinguish which edges have been zeroed, preparing for unaligned
> FALLOC_FL_WRITE_ZEROES handling in later patches.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Just some text corrections below. Otherwise feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0b31fa873743..d68f7be4ff02 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4256,13 +4256,26 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  	return 0;
>  }
>  
> +/*
> + * Zero out the unaligned head and tail of the [lstart, lstart+length)
> + * range.
> + *
> + * On return, @partial_zeroed records which edges actually got
> + * partial-zeroed.  Set EXT4_PARTIAL_ZERO_START/EXT4_PARTIAL_ZERO_END if
> + * the head/tail block got actually partially zeroed (in written, dirty
> + * unwritten or delalloc state).  Cleared if the head/tail block is a
> + * hole or a clean unwritten block, in which case there is nothing needs
								      ^^^
that needs zeroing.

> + * to zero.  When the head and tail land in the same block, both bits
> + * are set together on a successful zero.
					^^^ zeroing

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] ext4: write back partial-zeroed edges in WRITE_ZEROES
  2026-07-01 14:20 ` [PATCH 5/6] ext4: write back partial-zeroed " yizhang089
@ 2026-07-02 10:05   ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2026-07-02 10:05 UTC (permalink / raw)
  To: yizhang089
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, yi.zhang, yi.zhang,
	chengzhihao1, yangerkun

On Wed 01-07-26 22:20:08, yizhang089@gmail.com wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> FALLOC_FL_WRITE_ZEROES requires that all blocks in the requested range
> end up as written extents with zeroed content. For unaligned edges that
> were partial-zeroed in dirty unwritten or delalloc state, the buffer
> is left dirty while the underlying extent may not yet be converted to
> written. As a result, a subsequent SYNC write to this range would still
> trigger metadata changes, which violates the semantics of WRITE_ZEROES.
> 
> Fix this by calling filemap_write_and_wait_range() for partial-zeroed
> edges to flush out the zeroed data and ensure the extent conversion
> is complete.
> 
> Fixes: f4265b8d32c4 ("ext4: add FALLOC_FL_WRITE_ZEROES support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index c34505765521..c083703bf704 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4791,7 +4791,15 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	if (IS_ALIGNED(offset | end, blocksize))
>  		return ret;
>  
> -	if (((file->f_flags & O_SYNC) || IS_SYNC(inode)) && partial_zeroed) {
> +	/*
> +	 * In FALLOC_FL_WRITE_ZEROES mode, edges that have been partially
> +	 * zeroed must be written back to ensure the entire zeroed range
> +	 * is converted to the written state. In SYNC mode, writeback is
> +	 * also required to persist the zeroed data to disk.
> +	 */
> +	if (partial_zeroed &&
> +	    ((mode & FALLOC_FL_WRITE_ZEROES) ||
> +	     (file->f_flags & O_SYNC) || IS_SYNC(inode))) {
>  		ret = filemap_write_and_wait_range(inode->i_mapping, offset,
>  						   end - 1);
>  		if (ret)
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] ext4: skip ext4_update_disksize_before_punch() in WRITE_ZEROES
  2026-07-01 14:20 ` [PATCH 6/6] ext4: skip ext4_update_disksize_before_punch() " yizhang089
@ 2026-07-02 10:19   ` Jan Kara
  2026-07-02 14:51     ` Zhang Yi
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2026-07-02 10:19 UTC (permalink / raw)
  To: yizhang089
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, yi.zhang, yi.zhang,
	chengzhihao1, yangerkun

On Wed 01-07-26 22:20:09, yizhang089@gmail.com wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> FALLOC_FL_WRITE_ZEROES is mutually exclusive with FALLOC_FL_KEEP_SIZE.
> In ext4_zero_range(), when end > i_disksize, new_size is always set to
> end for WRITE_ZEROES, so the i_disksize update should not be lost.
> 
> The ext4_update_disksize_before_punch() call is only needed for the
> ZERO_RANGE path in ext4_zero_range(), as it handles the case where
> KEEP_SIZE may preserve the original disk size. Skip it for WRITE_ZEROES
> to avoid unnecessary work and potential confusion.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Hum, but suppose i_disksize is before the start of our zeroed range, i_size
is beyond the end of the zeroed range. Now
ext4_alloc_file_blocks(EXT4_GET_BLOCKS_CREATE_ZERO) will create some zeroed
written extents beyond i_disksize and if we crash before
ext4_update_inode_size() runs, we have inconsistent filesystem with
written extents beyond i_disksize... Hum, actually that could have been the
case even before, just now there are more cases where this can happen. So I
think FALLOC_FL_WRITE_ZEROES mode needs to add the inode to the orphan list
to truncate the inode properly in case of crash.

								Honza

> ---
>  fs/ext4/extents.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index c083703bf704..38753c2ef910 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4748,9 +4748,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  			return ret;
>  	}
>  
> -	ret = ext4_update_disksize_before_punch(inode, offset, len);
> -	if (ret)
> -		return ret;
> +	if (mode & FALLOC_FL_ZERO_RANGE) {
> +		ret = ext4_update_disksize_before_punch(inode, offset, len);
> +		if (ret)
> +			return ret;
> +	}
>  
>  	/* Now release the pages and zero block aligned part of pages */
>  	ret = ext4_truncate_page_cache_block_range(inode, offset, end);
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh()
  2026-07-02  9:53   ` Jan Kara
@ 2026-07-02 14:40     ` Zhang Yi
  2026-07-02 16:43       ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2026-07-02 14:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, yi.zhang, yi.zhang, chengzhihao1,
	yangerkun

On 7/2/2026 5:53 PM, Jan Kara wrote:
> On Wed 01-07-26 22:20:05, yizhang089@gmail.com wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> ext4_load_tail_bh() returns NULL for both holes and clean unwritten
>> buffers, but the conditions that lead to this are not obvious from the
>> code alone. Document this behavior to clarify the return value, so that
>> readers do not mistakenly assume that only holes result in a NULL
>> return.
>>
>> Also update the inline comment following the ext4_get_block() call to
>> reflect this: both holes and clean unwritten buffers fall through to the
>> "nothing to do" path.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>   fs/ext4/inode.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index c2c2d6ac7f3d..0b31fa873743 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4026,6 +4026,10 @@ void ext4_set_aops(struct inode *inode)
>>    * because it might have data in pagecache (eg, if called from ext4_zero_range,
>>    * ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
>>    * racing writeback can come later and flush the stale pagecache to disk.
>> + *
>> + * Return the loaded bh if it actually needs zeroing - in written, dirty
>> + * unwritten, or delalloc state. Return NULL if it's clean (i.e., a hole or
>> + * a clean unwritten block).
>>    */
> 
> Great that you're adding this comment because when I was reading previous
> patch, I've spent like 10 minutes trying to figure that out :). But as far
> as I'm reading the code, ext4_load_tail_bh() will return the unwritten bh
> even if it is clean - map_bh() in _ext4_get_block() will set
> buffer_mapped() even for unwritten extent. What am I missing? BTW, it also
> seems to be ext4_load_tail_bh() could attempt to ext4_read_bh_lock() on
> unwritten bh if things align wrongly...
> 
> 								Honza

Thank you for the review!

Please note the ext4_update_bh_state(bh, map.m_flags) call in
_ext4_get_block() — it restores the mapped flag back to unwritten. As a
result, the !buffer_mapped(bh) check will evaluate to true for a clean
unwritten block, the function will return NULL.

Thanks,
Yi.

> 
>>   static struct buffer_head *ext4_load_tail_bh(struct inode *inode, loff_t from)
>>   {
>> @@ -4065,7 +4069,7 @@ static struct buffer_head *ext4_load_tail_bh(struct inode *inode, loff_t from)
>>   	if (!buffer_mapped(bh)) {
>>   		BUFFER_TRACE(bh, "unmapped");
>>   		ext4_get_block(inode, iblock, bh, 0);
>> -		/* unmapped? It's a hole - nothing to do */
>> +		/* It's a hole or a clean unwritten block - nothing to do */
>>   		if (!buffer_mapped(bh)) {
>>   			BUFFER_TRACE(bh, "still unmapped");
>>   			goto unlock;
>> -- 
>> 2.53.0
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] ext4: skip ext4_update_disksize_before_punch() in WRITE_ZEROES
  2026-07-02 10:19   ` Jan Kara
@ 2026-07-02 14:51     ` Zhang Yi
  0 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2026-07-02 14:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, yi.zhang, yi.zhang, chengzhihao1,
	yangerkun

On 7/2/2026 6:19 PM, Jan Kara wrote:
> On Wed 01-07-26 22:20:09, yizhang089@gmail.com wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> FALLOC_FL_WRITE_ZEROES is mutually exclusive with FALLOC_FL_KEEP_SIZE.
>> In ext4_zero_range(), when end > i_disksize, new_size is always set to
>> end for WRITE_ZEROES, so the i_disksize update should not be lost.
>>
>> The ext4_update_disksize_before_punch() call is only needed for the
>> ZERO_RANGE path in ext4_zero_range(), as it handles the case where
>> KEEP_SIZE may preserve the original disk size. Skip it for WRITE_ZEROES
>> to avoid unnecessary work and potential confusion.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Hum, but suppose i_disksize is before the start of our zeroed range, i_size
> is beyond the end of the zeroed range. Now
> ext4_alloc_file_blocks(EXT4_GET_BLOCKS_CREATE_ZERO) will create some zeroed
> written extents beyond i_disksize and if we crash before
> ext4_update_inode_size() runs, we have inconsistent filesystem with
> written extents beyond i_disksize... Hum, actually that could have been the
> case even before, just now there are more cases where this can happen. So I
> think FALLOC_FL_WRITE_ZEROES mode needs to add the inode to the orphan list
> to truncate the inode properly in case of crash.
> 
> 								Honza

Yes, indeed. I overlooked this case previously. I will address it in the
next version.

Thanks,
Yi.

> 
>> ---
>>   fs/ext4/extents.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index c083703bf704..38753c2ef910 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -4748,9 +4748,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>>   			return ret;
>>   	}
>>   
>> -	ret = ext4_update_disksize_before_punch(inode, offset, len);
>> -	if (ret)
>> -		return ret;
>> +	if (mode & FALLOC_FL_ZERO_RANGE) {
>> +		ret = ext4_update_disksize_before_punch(inode, offset, len);
>> +		if (ret)
>> +			return ret;
>> +	}
>>   
>>   	/* Now release the pages and zero block aligned part of pages */
>>   	ret = ext4_truncate_page_cache_block_range(inode, offset, end);
>> -- 
>> 2.53.0
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh()
  2026-07-02 14:40     ` Zhang Yi
@ 2026-07-02 16:43       ` Jan Kara
  2026-07-03  3:39         ` Zhang Yi
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2026-07-02 16:43 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, ojaswin, ritesh.list, yi.zhang,
	yi.zhang, chengzhihao1, yangerkun

On Thu 02-07-26 22:40:17, Zhang Yi wrote:
> On 7/2/2026 5:53 PM, Jan Kara wrote:
> > On Wed 01-07-26 22:20:05, yizhang089@gmail.com wrote:
> > > From: Zhang Yi <yi.zhang@huawei.com>
> > > 
> > > ext4_load_tail_bh() returns NULL for both holes and clean unwritten
> > > buffers, but the conditions that lead to this are not obvious from the
> > > code alone. Document this behavior to clarify the return value, so that
> > > readers do not mistakenly assume that only holes result in a NULL
> > > return.
> > > 
> > > Also update the inline comment following the ext4_get_block() call to
> > > reflect this: both holes and clean unwritten buffers fall through to the
> > > "nothing to do" path.
> > > 
> > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > ---
> > >   fs/ext4/inode.c | 6 +++++-
> > >   1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index c2c2d6ac7f3d..0b31fa873743 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -4026,6 +4026,10 @@ void ext4_set_aops(struct inode *inode)
> > >    * because it might have data in pagecache (eg, if called from ext4_zero_range,
> > >    * ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
> > >    * racing writeback can come later and flush the stale pagecache to disk.
> > > + *
> > > + * Return the loaded bh if it actually needs zeroing - in written, dirty
> > > + * unwritten, or delalloc state. Return NULL if it's clean (i.e., a hole or
> > > + * a clean unwritten block).
> > >    */
> > 
> > Great that you're adding this comment because when I was reading previous
> > patch, I've spent like 10 minutes trying to figure that out :). But as far
> > as I'm reading the code, ext4_load_tail_bh() will return the unwritten bh
> > even if it is clean - map_bh() in _ext4_get_block() will set
> > buffer_mapped() even for unwritten extent. What am I missing? BTW, it also
> > seems to be ext4_load_tail_bh() could attempt to ext4_read_bh_lock() on
> > unwritten bh if things align wrongly...
> > 
> > 								Honza
> 
> Thank you for the review!
> 
> Please note the ext4_update_bh_state(bh, map.m_flags) call in
> _ext4_get_block() — it restores the mapped flag back to unwritten. As a
> result, the !buffer_mapped(bh) check will evaluate to true for a clean
> unwritten block, the function will return NULL.

Argh, right. But ext4_ext_map_blocks() sets both BH_Unwritten and BH_Mapped
in the returned map->m_flags for unwritten extents. So I think my comment
still applies. Plus there's a very strange inconsistency between this and
the lookup in extent status tree in ext4_map_blocks() where (as you
indicate) we only set BH_Unwritten but *not* BH_Mapped. AFAICT there's
something buggy in here :) Note that when we don't set BH_Mapped flag e.g.
from ext4_get_block_unwritten() (which gets used when delalloc is disabled),
then writes to unwritten extent will get lost because
ext4_bio_write_folio() will just treat them as holes...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh()
  2026-07-02 16:43       ` Jan Kara
@ 2026-07-03  3:39         ` Zhang Yi
  2026-07-03 16:01           ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2026-07-03  3:39 UTC (permalink / raw)
  To: Jan Kara, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, yi.zhang, chengzhihao1, yangerkun

On 7/3/2026 12:43 AM, Jan Kara wrote:
> On Thu 02-07-26 22:40:17, Zhang Yi wrote:
>> On 7/2/2026 5:53 PM, Jan Kara wrote:
>>> On Wed 01-07-26 22:20:05, yizhang089@gmail.com wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> ext4_load_tail_bh() returns NULL for both holes and clean unwritten
>>>> buffers, but the conditions that lead to this are not obvious from the
>>>> code alone. Document this behavior to clarify the return value, so that
>>>> readers do not mistakenly assume that only holes result in a NULL
>>>> return.
>>>>
>>>> Also update the inline comment following the ext4_get_block() call to
>>>> reflect this: both holes and clean unwritten buffers fall through to the
>>>> "nothing to do" path.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>   fs/ext4/inode.c | 6 +++++-
>>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index c2c2d6ac7f3d..0b31fa873743 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4026,6 +4026,10 @@ void ext4_set_aops(struct inode *inode)
>>>>    * because it might have data in pagecache (eg, if called from ext4_zero_range,
>>>>    * ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
>>>>    * racing writeback can come later and flush the stale pagecache to disk.
>>>> + *
>>>> + * Return the loaded bh if it actually needs zeroing - in written, dirty
>>>> + * unwritten, or delalloc state. Return NULL if it's clean (i.e., a hole or
>>>> + * a clean unwritten block).
>>>>    */
>>>
>>> Great that you're adding this comment because when I was reading previous
>>> patch, I've spent like 10 minutes trying to figure that out :). But as far
>>> as I'm reading the code, ext4_load_tail_bh() will return the unwritten bh
>>> even if it is clean - map_bh() in _ext4_get_block() will set
>>> buffer_mapped() even for unwritten extent. What am I missing? BTW, it also
>>> seems to be ext4_load_tail_bh() could attempt to ext4_read_bh_lock() on
>>> unwritten bh if things align wrongly...
>>>
>>> 								Honza
>>
>> Thank you for the review!
>>
>> Please note the ext4_update_bh_state(bh, map.m_flags) call in
>> _ext4_get_block() — it restores the mapped flag back to unwritten. As a
>> result, the !buffer_mapped(bh) check will evaluate to true for a clean
>> unwritten block, the function will return NULL.
> 
> Argh, right. But ext4_ext_map_blocks() sets both BH_Unwritten and BH_Mapped
> in the returned map->m_flags for unwritten extents. So I think my comment
> still applies. 

Because ext4_get_block() is called with create = 0, this is purely a
lookup operation. Therefore, in ext4_ext_map_blocks() ->
ext4_ext_handle_unwritten_extents(), only EXT4_MAP_UNWRITTEN is set,
and EXT4_MAP_MAPPED is not (please see the branch where
if (flags & EXT4_GET_BLOCKS_CREATE) == 0)). So at this point, whether
the lookup is served from the extent status cache or from the on-disk
query, it works correctly now.

> Plus there's a very strange inconsistency between this and
> the lookup in extent status tree in ext4_map_blocks() where (as you
> indicate) we only set BH_Unwritten but *not* BH_Mapped. AFAICT there's
> something buggy in here :) Note that when we don't set BH_Mapped flag e.g.
> from ext4_get_block_unwritten() (which gets used when delalloc is disabled),
> then writes to unwritten extent will get lost because
> ext4_bio_write_folio() will just treat them as holes...
> 
> 								Honza

In fact, there is no real problem here. For the write path that gets an
unwritten block, EXT4_GET_BLOCKS_UNWRIT_EXT and
EXT4_GET_BLOCKS_CREATE are always passed together. This ensures that
even if we find the unwritten extent in the extent status cache, we
still call ext4_map_create_blocks() (due to EXT4_GET_BLOCKS_CREATE)
and go through ext4_ext_handle_unwritten_extents(). Inside that
function, because the flags contain EXT4_GET_BLOCKS_UNWRIT_EXT,
EXT4_MAP_MAPPED is set. So for the write path, ext4_bio_write_folio()
still writes back correctly.

That said, the semantic inconsistency between the lookup-only and
allocation paths has been bothering me for a while. I had to be very
careful about this when developing the buffered I/O iomap path, because
this semantics also differs from the iomap infrastructure. In iomap, for
an unwritten extent, IOMAP_UNWRITTEN and IOMAP_MAPPED are not set
simultaneously; IOMAP_MAPPED always represents a written extent. The
good news is that this conversion is not complicated.

I asked an agent to look into the historical reasons, and here is what
he told me:

  When ext4 introduced fallocate + delalloc in 2008-2009, it followed
  the XFS convention that "unwritten is not mapped." However, ext4's
  writeback path (mpage_da_submit_io) depended on BH_MAPPED to issue
  I/O, which meant that unwritten buffers would never be written out.
  Commit bf068ee266f9 was the first fix - it converted unwritten to
  mapped during the writepages stage. Commit 29fa89d08894 was the
  second fix - it made unwritten carry mapped at the write-begin stage.
  Commit 2a8964d63d50 was a revert patch - it cleared the unwritten
  flag in the create = 1 path to avoid duplicate get_block() calls.

  Then the extent status cache was introduced in 2013. Commits
  a25a4e1a5d5d (Zheng Liu, 2013-02) and d100eef2440f (2013) added the
  ES-cache lookup path. In that path, ext4_es_is_unwritten returns
  EXT4_MAP_UNWRITTEN without EXT4_MAP_MAPPED - this was not a new
  design decision, but rather a continuation of the old 2009 convention
  that "unwritten is not mapped." Zheng Liu's commit message did not
  discuss whether MAPPED should be set at the same time; it only said
  "ext4_map_blocks needs to use this flag to determine the extent
  status." He was simply implementing the ES-cache lookup, and the
  status semantics followed the old implementation.

  The comment in ext4_map_blocks() - which says "if blocks have been
  preallocated ext4_ext_map_blocks() returns with buffer head unmapped"
  - was inherited from Aneesh's 2009 code. It was not a new decision
  made in 2013.

So my understanding is that this inconsistency in ext4_map_blocks() is
a historical legacy problem. There are many places where handling these
two flags is also semantically ambiguous. Some may treat those with the
BH_mapped flag set as written, such as ext4_load_tail_bh(). Therefore,
setting the mapped flag may cause many issues, and it can only be done
by setting the unwritten flag when performing a query. If I am wrong
about any of this, please correct me.

Going forward, I think we should revisit and clearly redefine the
semantics of these two flags, and unify the behavior of ext4_get_block().
Otherwise, carrying this historical baggage will only hinder future
development. :-)

Thanks,
Yi.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh()
  2026-07-03  3:39         ` Zhang Yi
@ 2026-07-03 16:01           ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2026-07-03 16:01 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
	tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, yi.zhang,
	chengzhihao1, yangerkun

On Fri 03-07-26 11:39:03, Zhang Yi wrote:
> On 7/3/2026 12:43 AM, Jan Kara wrote:
> > On Thu 02-07-26 22:40:17, Zhang Yi wrote:
> >> On 7/2/2026 5:53 PM, Jan Kara wrote:
> >>> On Wed 01-07-26 22:20:05, yizhang089@gmail.com wrote:
> >>>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>>
> >>>> ext4_load_tail_bh() returns NULL for both holes and clean unwritten
> >>>> buffers, but the conditions that lead to this are not obvious from the
> >>>> code alone. Document this behavior to clarify the return value, so that
> >>>> readers do not mistakenly assume that only holes result in a NULL
> >>>> return.
> >>>>
> >>>> Also update the inline comment following the ext4_get_block() call to
> >>>> reflect this: both holes and clean unwritten buffers fall through to the
> >>>> "nothing to do" path.
> >>>>
> >>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>>> ---
> >>>>   fs/ext4/inode.c | 6 +++++-
> >>>>   1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >>>> index c2c2d6ac7f3d..0b31fa873743 100644
> >>>> --- a/fs/ext4/inode.c
> >>>> +++ b/fs/ext4/inode.c
> >>>> @@ -4026,6 +4026,10 @@ void ext4_set_aops(struct inode *inode)
> >>>>    * because it might have data in pagecache (eg, if called from ext4_zero_range,
> >>>>    * ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
> >>>>    * racing writeback can come later and flush the stale pagecache to disk.
> >>>> + *
> >>>> + * Return the loaded bh if it actually needs zeroing - in written, dirty
> >>>> + * unwritten, or delalloc state. Return NULL if it's clean (i.e., a hole or
> >>>> + * a clean unwritten block).
> >>>>    */
> >>>
> >>> Great that you're adding this comment because when I was reading previous
> >>> patch, I've spent like 10 minutes trying to figure that out :). But as far
> >>> as I'm reading the code, ext4_load_tail_bh() will return the unwritten bh
> >>> even if it is clean - map_bh() in _ext4_get_block() will set
> >>> buffer_mapped() even for unwritten extent. What am I missing? BTW, it also
> >>> seems to be ext4_load_tail_bh() could attempt to ext4_read_bh_lock() on
> >>> unwritten bh if things align wrongly...
> >>>
> >>> 								Honza
> >>
> >> Thank you for the review!
> >>
> >> Please note the ext4_update_bh_state(bh, map.m_flags) call in
> >> _ext4_get_block() — it restores the mapped flag back to unwritten. As a
> >> result, the !buffer_mapped(bh) check will evaluate to true for a clean
> >> unwritten block, the function will return NULL.
> > 
> > Argh, right. But ext4_ext_map_blocks() sets both BH_Unwritten and BH_Mapped
> > in the returned map->m_flags for unwritten extents. So I think my comment
> > still applies. 
> 
> Because ext4_get_block() is called with create = 0, this is purely a
> lookup operation. Therefore, in ext4_ext_map_blocks() ->
> ext4_ext_handle_unwritten_extents(), only EXT4_MAP_UNWRITTEN is set,
> and EXT4_MAP_MAPPED is not (please see the branch where
> if (flags & EXT4_GET_BLOCKS_CREATE) == 0)). So at this point, whether
> the lookup is served from the extent status cache or from the on-disk
> query, it works correctly now.

Aah, now I remember. Yes, we have a catch in the mapping code like that.
Thanks for reminding me again. Now back to my original question: So 
ext4_load_tail_bh() returns unwritten bh when it is dirty because someone
must have called ext4_get_block() on it with EXT4_GET_BLOCKS_CREATE and at
that point we set the BH_Mapped flag on it? Subtle... It would be worth to
write this explanation to that comment as well.

> > Plus there's a very strange inconsistency between this and
> > the lookup in extent status tree in ext4_map_blocks() where (as you
> > indicate) we only set BH_Unwritten but *not* BH_Mapped. AFAICT there's
> > something buggy in here :) Note that when we don't set BH_Mapped flag e.g.
> > from ext4_get_block_unwritten() (which gets used when delalloc is disabled),
> > then writes to unwritten extent will get lost because
> > ext4_bio_write_folio() will just treat them as holes...
> 
> In fact, there is no real problem here. For the write path that gets an
> unwritten block, EXT4_GET_BLOCKS_UNWRIT_EXT and
> EXT4_GET_BLOCKS_CREATE are always passed together. This ensures that
> even if we find the unwritten extent in the extent status cache, we
> still call ext4_map_create_blocks() (due to EXT4_GET_BLOCKS_CREATE)
> and go through ext4_ext_handle_unwritten_extents(). Inside that
> function, because the flags contain EXT4_GET_BLOCKS_UNWRIT_EXT,
> EXT4_MAP_MAPPED is set. So for the write path, ext4_bio_write_folio()
> still writes back correctly.
> 
> That said, the semantic inconsistency between the lookup-only and
> allocation paths has been bothering me for a while. I had to be very
> careful about this when developing the buffered I/O iomap path, because
> this semantics also differs from the iomap infrastructure. In iomap, for
> an unwritten extent, IOMAP_UNWRITTEN and IOMAP_MAPPED are not set
> simultaneously; IOMAP_MAPPED always represents a written extent. The
> good news is that this conversion is not complicated.
> 
> I asked an agent to look into the historical reasons, and here is what
> he told me:
> 
>   When ext4 introduced fallocate + delalloc in 2008-2009, it followed
>   the XFS convention that "unwritten is not mapped." However, ext4's
>   writeback path (mpage_da_submit_io) depended on BH_MAPPED to issue
>   I/O, which meant that unwritten buffers would never be written out.
>   Commit bf068ee266f9 was the first fix - it converted unwritten to
>   mapped during the writepages stage. Commit 29fa89d08894 was the
>   second fix - it made unwritten carry mapped at the write-begin stage.
>   Commit 2a8964d63d50 was a revert patch - it cleared the unwritten
>   flag in the create = 1 path to avoid duplicate get_block() calls.
> 
>   Then the extent status cache was introduced in 2013. Commits
>   a25a4e1a5d5d (Zheng Liu, 2013-02) and d100eef2440f (2013) added the
>   ES-cache lookup path. In that path, ext4_es_is_unwritten returns
>   EXT4_MAP_UNWRITTEN without EXT4_MAP_MAPPED - this was not a new
>   design decision, but rather a continuation of the old 2009 convention
>   that "unwritten is not mapped." Zheng Liu's commit message did not
>   discuss whether MAPPED should be set at the same time; it only said
>   "ext4_map_blocks needs to use this flag to determine the extent
>   status." He was simply implementing the ES-cache lookup, and the
>   status semantics followed the old implementation.
> 
>   The comment in ext4_map_blocks() - which says "if blocks have been
>   preallocated ext4_ext_map_blocks() returns with buffer head unmapped"
>   - was inherited from Aneesh's 2009 code. It was not a new decision
>   made in 2013.
> 
> So my understanding is that this inconsistency in ext4_map_blocks() is
> a historical legacy problem. There are many places where handling these
> two flags is also semantically ambiguous. Some may treat those with the
> BH_mapped flag set as written, such as ext4_load_tail_bh(). Therefore,
> setting the mapped flag may cause many issues, and it can only be done
> by setting the unwritten flag when performing a query. If I am wrong
> about any of this, please correct me.
> 
> Going forward, I think we should revisit and clearly redefine the
> semantics of these two flags, and unify the behavior of ext4_get_block().
> Otherwise, carrying this historical baggage will only hinder future
> development. :-)

Yeah, frankly I think it would be good to cleanup. But as you say it is
currently working correctly.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-07-03 16:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 14:20 [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES yizhang089
2026-07-01 14:20 ` [PATCH 1/6] ext4: move partial block zeroing earlier in ext4_zero_range() yizhang089
2026-07-02  9:40   ` Jan Kara
2026-07-01 14:20 ` [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh() yizhang089
2026-07-02  9:53   ` Jan Kara
2026-07-02 14:40     ` Zhang Yi
2026-07-02 16:43       ` Jan Kara
2026-07-03  3:39         ` Zhang Yi
2026-07-03 16:01           ` Jan Kara
2026-07-01 14:20 ` [PATCH 3/6] ext4: track partial-zero outcome per edge in ext4_zero_partial_blocks() yizhang089
2026-07-02  9:57   ` Jan Kara
2026-07-01 14:20 ` [PATCH 4/6] ext4: zero out whole block for clean edges in WRITE_ZEROES yizhang089
2026-07-01 14:20 ` [PATCH 5/6] ext4: write back partial-zeroed " yizhang089
2026-07-02 10:05   ` Jan Kara
2026-07-01 14:20 ` [PATCH 6/6] ext4: skip ext4_update_disksize_before_punch() " yizhang089
2026-07-02 10:19   ` Jan Kara
2026-07-02 14:51     ` Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox