[PATCH v5 0/4] allow partial folio write with iomap_folio

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/4] allow partial folio write with iomap_folio_state
@ 2025-09-23  4:21 alexjlzheng
  2025-09-23  4:21 ` [PATCH v5 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: alexjlzheng @ 2025-09-23  4:21 UTC (permalink / raw)
  To: brauner, djwong, hch, kernel
  Cc: linux-xfs, linux-fsdevel, linux-kernel, yi.zhang, Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

Currently, if a partial write occurs in a buffer write, the entire write will
be discarded. While this is an uncommon case, it's still a bit wasteful and
we can do better.

With iomap_folio_state, we can identify uptodate states at the block
level, and a read_folio reading can correctly handle partially
uptodate folios.

Therefore, when a partial write occurs, accept the block-aligned
partial write instead of rejecting the entire write.

For example, suppose a folio is 2MB, blocksize is 4kB, and the copied
bytes are 2MB-3kB.

Without this patchset, we'd need to recopy from the beginning of the
folio in the next iteration, which means 2MB-3kB of bytes is copy
duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
 |<-------- 1MB -------->|                          next time we need copy (chunk /= 2)
                         |<-------- 1MB -------->|  next next time we need copy.

 |<------ 2MB-3kB bytes duplicate copy ---->|

With this patchset, we can accept 2MB-4kB of bytes, which is block-aligned.
This means we only need to process the remaining 4kB in the next iteration,
which means there's only 1kB we need to copy duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
                                         |<-4kB->|  next time we need copy

                                         |<>|
                              only 1kB bytes duplicate copy

Although partial writes are inherently a relatively unusual situation and do
not account for a large proportion of performance testing, the optimization
here still makes sense in large-scale data centers.

This patchset has been tested by xfstests' generic and xfs group, and
there's no new failed cases compared to the lastest upstream version kernel.

Changelog:

V5: patch[1]: use WARN_ON_ONCE() instead of WARN_ON(), suggested by Pankaj Raghav (Samsung)

V4: https://lore.kernel.org/linux-fsdevel/eyyshgzsxupyen6ms3izkh45ydh3ekxycpk5p4dbets6mpyhch@q4db2ayr4g3r/
    patch[4]: better documentation in code, and add motivation to the cover letter

V3: https://lore.kernel.org/linux-xfs/aMPIDGq7pVuURg1t@infradead.org/
    patch[1]: use WARN_ON() instead of BUG_ON()
    patch[2]: make commit message clear
    patch[3]: -
    patch[4]: make commit message clear

V2: https://lore.kernel.org/linux-fsdevel/20250810101554.257060-1-alexjlzheng@tencent.com/ 
    use & instead of % for 64 bit variable on m68k/xtensa, try to make them happy:
       m68k-linux-ld: fs/iomap/buffered-io.o: in function `iomap_adjust_read_range':
    >> buffered-io.c:(.text+0xa8a): undefined reference to `__moddi3'
    >> m68k-linux-ld: buffered-io.c:(.text+0xaa8): undefined reference to `__moddi3'

V1: https://lore.kernel.org/linux-fsdevel/20250810044806.3433783-1-alexjlzheng@tencent.com/

Jinliang Zheng (4):
  iomap: make sure iomap_adjust_read_range() are aligned with block_size
  iomap: move iter revert case out of the unwritten branch
  iomap: make iomap_write_end() return the number of written length again
  iomap: don't abandon the whole copy when we have iomap_folio_state

 fs/iomap/buffered-io.c | 80 +++++++++++++++++++++++++++++-------------
 1 file changed, 55 insertions(+), 25 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size
  2025-09-23  4:21 [PATCH v5 0/4] allow partial folio write with iomap_folio_state alexjlzheng
@ 2025-09-23  4:21 ` alexjlzheng
  2025-09-25 18:59   ` Brian Foster
  2025-09-23  4:21 ` [PATCH v5 2/4] iomap: move iter revert case out of the unwritten branch alexjlzheng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: alexjlzheng @ 2025-09-23  4:21 UTC (permalink / raw)
  To: brauner, djwong, hch, kernel
  Cc: linux-xfs, linux-fsdevel, linux-kernel, yi.zhang, Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

iomap_folio_state marks the uptodate state in units of block_size, so
it is better to check that pos and length are aligned with block_size.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/iomap/buffered-io.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index fd827398afd2..ee1b2cd8a4b4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -234,6 +234,9 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
 
+	WARN_ON_ONCE(*pos & (block_size - 1));
+	WARN_ON_ONCE(length & (block_size - 1));
+
 	/*
 	 * If the block size is smaller than the page size, we need to check the
 	 * per-block uptodate status and adjust the offset and length if needed
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 2/4] iomap: move iter revert case out of the unwritten branch
  2025-09-23  4:21 [PATCH v5 0/4] allow partial folio write with iomap_folio_state alexjlzheng
  2025-09-23  4:21 ` [PATCH v5 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
@ 2025-09-23  4:21 ` alexjlzheng
  2025-09-25 18:59   ` Brian Foster
  2025-09-23  4:21 ` [PATCH v5 3/4] iomap: make iomap_write_end() return the number of written length again alexjlzheng
  2025-09-23  4:21 ` [PATCH v5 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state alexjlzheng
  3 siblings, 1 reply; 9+ messages in thread
From: alexjlzheng @ 2025-09-23  4:21 UTC (permalink / raw)
  To: brauner, djwong, hch, kernel
  Cc: linux-xfs, linux-fsdevel, linux-kernel, yi.zhang, Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

The commit e1f453d4336d ("iomap: do some small logical cleanup in
buffered write") merged iomap_write_failed() and iov_iter_revert()
into the branch with written == 0. Because, at the time,
iomap_write_end() could never return a partial write length.

In the subsequent patch, iomap_write_end() will be modified to allow
to return block-aligned partial write length (partial write length
here is relative to the folio-sized write), which violated the above
patch's assumption.

This patch moves it back out to prepare for the subsequent patches.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/iomap/buffered-io.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ee1b2cd8a4b4..e130db3b761e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1019,6 +1019,11 @@ static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
 
 		if (old_size < pos)
 			pagecache_isize_extended(iter->inode, old_size, pos);
+		if (written < bytes)
+			iomap_write_failed(iter->inode, pos + written,
+					   bytes - written);
+		if (unlikely(copied != written))
+			iov_iter_revert(i, copied - written);
 
 		cond_resched();
 		if (unlikely(written == 0)) {
@@ -1028,9 +1033,6 @@ static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
 			 * halfway through, might be a race with munmap,
 			 * might be severe memory pressure.
 			 */
-			iomap_write_failed(iter->inode, pos, bytes);
-			iov_iter_revert(i, copied);
-
 			if (chunk > PAGE_SIZE)
 				chunk /= 2;
 			if (copied) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 3/4] iomap: make iomap_write_end() return the number of written length again
  2025-09-23  4:21 [PATCH v5 0/4] allow partial folio write with iomap_folio_state alexjlzheng
  2025-09-23  4:21 ` [PATCH v5 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
  2025-09-23  4:21 ` [PATCH v5 2/4] iomap: move iter revert case out of the unwritten branch alexjlzheng
@ 2025-09-23  4:21 ` alexjlzheng
  2025-09-25 19:00   ` Brian Foster
  2025-09-23  4:21 ` [PATCH v5 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state alexjlzheng
  3 siblings, 1 reply; 9+ messages in thread
From: alexjlzheng @ 2025-09-23  4:21 UTC (permalink / raw)
  To: brauner, djwong, hch, kernel
  Cc: linux-xfs, linux-fsdevel, linux-kernel, yi.zhang, Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

In the next patch, we allow iomap_write_end() to conditionally accept
partial writes, so this patch makes iomap_write_end() return the number
of accepted write bytes in preparation for the next patch.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/iomap/buffered-io.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e130db3b761e..6e516c7d9f04 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -873,7 +873,7 @@ static int iomap_write_begin(struct iomap_iter *iter,
 	return status;
 }
 
-static bool __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
+static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
 	flush_dcache_folio(folio);
@@ -890,11 +890,11 @@ static bool __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 * redo the whole thing.
 	 */
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
-		return false;
+		return 0;
 	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
 	iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
-	return true;
+	return copied;
 }
 
 static void iomap_write_end_inline(const struct iomap_iter *iter,
@@ -915,10 +915,10 @@ static void iomap_write_end_inline(const struct iomap_iter *iter,
 }
 
 /*
- * Returns true if all copied bytes have been written to the pagecache,
- * otherwise return false.
+ * Returns number of copied bytes have been written to the pagecache,
+ * zero if block is partial update.
  */
-static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
+static int iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
 		struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -926,7 +926,7 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
 
 	if (srcmap->type == IOMAP_INLINE) {
 		iomap_write_end_inline(iter, folio, pos, copied);
-		return true;
+		return copied;
 	}
 
 	if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
@@ -934,7 +934,7 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
 
 		bh_written = block_write_end(pos, len, copied, folio);
 		WARN_ON_ONCE(bh_written != copied && bh_written != 0);
-		return bh_written == copied;
+		return bh_written;
 	}
 
 	return __iomap_write_end(iter->inode, pos, len, copied, folio);
@@ -1000,8 +1000,7 @@ static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
 			flush_dcache_folio(folio);
 
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
-		written = iomap_write_end(iter, bytes, copied, folio) ?
-			  copied : 0;
+		written = iomap_write_end(iter, bytes, copied, folio);
 
 		/*
 		 * Update the in-memory inode size after copying the data into
@@ -1315,7 +1314,7 @@ static int iomap_unshare_iter(struct iomap_iter *iter,
 	do {
 		struct folio *folio;
 		size_t offset;
-		bool ret;
+		int ret;
 
 		bytes = min_t(u64, SIZE_MAX, bytes);
 		status = iomap_write_begin(iter, write_ops, &folio, &offset,
@@ -1327,7 +1326,7 @@ static int iomap_unshare_iter(struct iomap_iter *iter,
 
 		ret = iomap_write_end(iter, bytes, bytes, folio);
 		__iomap_put_folio(iter, write_ops, bytes, folio);
-		if (WARN_ON_ONCE(!ret))
+		if (WARN_ON_ONCE(ret != bytes))
 			return -EIO;
 
 		cond_resched();
@@ -1388,7 +1387,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 	do {
 		struct folio *folio;
 		size_t offset;
-		bool ret;
+		int ret;
 
 		bytes = min_t(u64, SIZE_MAX, bytes);
 		status = iomap_write_begin(iter, write_ops, &folio, &offset,
@@ -1406,7 +1405,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 
 		ret = iomap_write_end(iter, bytes, bytes, folio);
 		__iomap_put_folio(iter, write_ops, bytes, folio);
-		if (WARN_ON_ONCE(!ret))
+		if (WARN_ON_ONCE(ret != bytes))
 			return -EIO;
 
 		status = iomap_iter_advance(iter, &bytes);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state
  2025-09-23  4:21 [PATCH v5 0/4] allow partial folio write with iomap_folio_state alexjlzheng
                   ` (2 preceding siblings ...)
  2025-09-23  4:21 ` [PATCH v5 3/4] iomap: make iomap_write_end() return the number of written length again alexjlzheng
@ 2025-09-23  4:21 ` alexjlzheng
  2025-09-25 19:01   ` Brian Foster
  3 siblings, 1 reply; 9+ messages in thread
From: alexjlzheng @ 2025-09-23  4:21 UTC (permalink / raw)
  To: brauner, djwong, hch, kernel
  Cc: linux-xfs, linux-fsdevel, linux-kernel, yi.zhang, Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

Currently, if a partial write occurs in a buffer write, the entire write will
be discarded. While this is an uncommon case, it's still a bit wasteful and
we can do better.

With iomap_folio_state, we can identify uptodate states at the block
level, and a read_folio reading can correctly handle partially
uptodate folios.

Therefore, when a partial write occurs, accept the block-aligned
partial write instead of rejecting the entire write.

For example, suppose a folio is 2MB, blocksize is 4kB, and the copied
bytes are 2MB-3kB.

Without this patchset, we'd need to recopy from the beginning of the
folio in the next iteration, which means 2MB-3kB of bytes is copy
duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
 |<-------- 1MB -------->|                          next time we need copy (chunk /= 2)
                         |<-------- 1MB -------->|  next next time we need copy.

 |<------ 2MB-3kB bytes duplicate copy ---->|

With this patchset, we can accept 2MB-4kB of bytes, which is block-aligned.
This means we only need to process the remaining 4kB in the next iteration,
which means there's only 1kB we need to copy duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
                                         |<-4kB->|  next time we need copy

                                         |<>|
                              only 1kB bytes duplicate copy

Although partial writes are inherently a relatively unusual situation and do
not account for a large proportion of performance testing, the optimization
here still makes sense in large-scale data centers.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/iomap/buffered-io.c | 44 +++++++++++++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6e516c7d9f04..3304028ce64f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -873,6 +873,25 @@ static int iomap_write_begin(struct iomap_iter *iter,
 	return status;
 }
 
+static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
+		size_t copied, struct folio *folio)
+{
+	struct iomap_folio_state *ifs = folio->private;
+	unsigned block_size, last_blk, last_blk_bytes;
+
+	if (!ifs || !copied)
+		return 0;
+
+	block_size = 1 << inode->i_blkbits;
+	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
+	last_blk_bytes = (pos + copied) & (block_size - 1);
+
+	if (!ifs_block_is_uptodate(ifs, last_blk))
+		copied -= min(copied, last_blk_bytes);
+
+	return copied;
+}
+
 static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
@@ -881,17 +900,24 @@ static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	/*
 	 * The blocks that were entirely written will now be uptodate, so we
 	 * don't have to worry about a read_folio reading them and overwriting a
-	 * partial write.  However, if we've encountered a short write and only
-	 * partially written into a block, it will not be marked uptodate, so a
-	 * read_folio might come in and destroy our partial write.
+	 * partial write.
 	 *
-	 * Do the simplest thing and just treat any short write to a
-	 * non-uptodate page as a zero-length write, and force the caller to
-	 * redo the whole thing.
+	 * However, if we've encountered a short write and only partially
+	 * written into a block, we must discard the short-written _tail_ block
+	 * and not mark it uptodate in the ifs, to ensure a read_folio reading
+	 * can handle it correctly via iomap_adjust_read_range(). It's safe to
+	 * keep the non-tail block writes because we know that for a non-tail
+	 * block:
+	 * - is either fully written, since copy_from_user() is sequential
+	 * - or is a partially written head block that has already been read in
+	 *   and marked uptodate in the ifs by iomap_write_begin().
 	 */
-	if (unlikely(copied < len && !folio_test_uptodate(folio)))
-		return 0;
-	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
+	if (unlikely(copied < len && !folio_test_uptodate(folio))) {
+		copied = iomap_trim_tail_partial(inode, pos, copied, folio);
+		if (!copied)
+			return 0;
+	}
+	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), copied);
 	iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size
  2025-09-23  4:21 ` [PATCH v5 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
@ 2025-09-25 18:59   ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2025-09-25 18:59 UTC (permalink / raw)
  To: alexjlzheng
  Cc: brauner, djwong, hch, kernel, linux-xfs, linux-fsdevel,
	linux-kernel, yi.zhang, Jinliang Zheng

On Tue, Sep 23, 2025 at 12:21:55PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> iomap_folio_state marks the uptodate state in units of block_size, so
> it is better to check that pos and length are aligned with block_size.
> 
> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
> ---
>  fs/iomap/buffered-io.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index fd827398afd2..ee1b2cd8a4b4 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -234,6 +234,9 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  	unsigned first = poff >> block_bits;
>  	unsigned last = (poff + plen - 1) >> block_bits;
>  
> +	WARN_ON_ONCE(*pos & (block_size - 1));
> +	WARN_ON_ONCE(length & (block_size - 1));
> +

I thought Joanne's patch [1] enhanced this function to deal with this
sort of corner case. Is this necessary if we go that route or am I
missing something?

Brian

[1] https://lore.kernel.org/linux-fsdevel/20250922180042.1775241-1-joannelkoong@gmail.com/

>  	/*
>  	 * If the block size is smaller than the page size, we need to check the
>  	 * per-block uptodate status and adjust the offset and length if needed
> -- 
> 2.49.0
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/4] iomap: move iter revert case out of the unwritten branch
  2025-09-23  4:21 ` [PATCH v5 2/4] iomap: move iter revert case out of the unwritten branch alexjlzheng
@ 2025-09-25 18:59   ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2025-09-25 18:59 UTC (permalink / raw)
  To: alexjlzheng
  Cc: brauner, djwong, hch, kernel, linux-xfs, linux-fsdevel,
	linux-kernel, yi.zhang, Jinliang Zheng

On Tue, Sep 23, 2025 at 12:21:56PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> The commit e1f453d4336d ("iomap: do some small logical cleanup in
> buffered write") merged iomap_write_failed() and iov_iter_revert()
> into the branch with written == 0. Because, at the time,
> iomap_write_end() could never return a partial write length.
> 
> In the subsequent patch, iomap_write_end() will be modified to allow
> to return block-aligned partial write length (partial write length
> here is relative to the folio-sized write), which violated the above
> patch's assumption.
> 
> This patch moves it back out to prepare for the subsequent patches.
> 
> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/iomap/buffered-io.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ee1b2cd8a4b4..e130db3b761e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1019,6 +1019,11 @@ static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
>  
>  		if (old_size < pos)
>  			pagecache_isize_extended(iter->inode, old_size, pos);
> +		if (written < bytes)
> +			iomap_write_failed(iter->inode, pos + written,
> +					   bytes - written);
> +		if (unlikely(copied != written))
> +			iov_iter_revert(i, copied - written);
>  
>  		cond_resched();
>  		if (unlikely(written == 0)) {
> @@ -1028,9 +1033,6 @@ static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
>  			 * halfway through, might be a race with munmap,
>  			 * might be severe memory pressure.
>  			 */
> -			iomap_write_failed(iter->inode, pos, bytes);
> -			iov_iter_revert(i, copied);
> -
>  			if (chunk > PAGE_SIZE)
>  				chunk /= 2;
>  			if (copied) {
> -- 
> 2.49.0
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/4] iomap: make iomap_write_end() return the number of written length again
  2025-09-23  4:21 ` [PATCH v5 3/4] iomap: make iomap_write_end() return the number of written length again alexjlzheng
@ 2025-09-25 19:00   ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2025-09-25 19:00 UTC (permalink / raw)
  To: alexjlzheng
  Cc: brauner, djwong, hch, kernel, linux-xfs, linux-fsdevel,
	linux-kernel, yi.zhang, Jinliang Zheng

On Tue, Sep 23, 2025 at 12:21:57PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> In the next patch, we allow iomap_write_end() to conditionally accept
> partial writes, so this patch makes iomap_write_end() return the number
> of accepted write bytes in preparation for the next patch.
> 
> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
> ---
>  fs/iomap/buffered-io.c | 27 +++++++++++++--------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e130db3b761e..6e516c7d9f04 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
...
> @@ -915,10 +915,10 @@ static void iomap_write_end_inline(const struct iomap_iter *iter,
>  }
>  
>  /*
> - * Returns true if all copied bytes have been written to the pagecache,
> - * otherwise return false.
> + * Returns number of copied bytes have been written to the pagecache,
> + * zero if block is partial update.
>   */
> -static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
> +static int iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
>  		struct folio *folio)
>  {
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> @@ -926,7 +926,7 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
>  
>  	if (srcmap->type == IOMAP_INLINE) {
>  		iomap_write_end_inline(iter, folio, pos, copied);
> -		return true;
> +		return copied;
>  	}
>  
>  	if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> @@ -934,7 +934,7 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
>  
>  		bh_written = block_write_end(pos, len, copied, folio);
>  		WARN_ON_ONCE(bh_written != copied && bh_written != 0);
> -		return bh_written == copied;
> +		return bh_written;

I notice block_write_end() actually returns an int. Not sure it's an
issue really, but perhaps we should just change the type of bh_written
here as well. Otherwise seems reasonable.

Brian

>  	}
>  
>  	return __iomap_write_end(iter->inode, pos, len, copied, folio);
> @@ -1000,8 +1000,7 @@ static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
>  			flush_dcache_folio(folio);
>  
>  		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> -		written = iomap_write_end(iter, bytes, copied, folio) ?
> -			  copied : 0;
> +		written = iomap_write_end(iter, bytes, copied, folio);
>  
>  		/*
>  		 * Update the in-memory inode size after copying the data into
> @@ -1315,7 +1314,7 @@ static int iomap_unshare_iter(struct iomap_iter *iter,
>  	do {
>  		struct folio *folio;
>  		size_t offset;
> -		bool ret;
> +		int ret;
>  
>  		bytes = min_t(u64, SIZE_MAX, bytes);
>  		status = iomap_write_begin(iter, write_ops, &folio, &offset,
> @@ -1327,7 +1326,7 @@ static int iomap_unshare_iter(struct iomap_iter *iter,
>  
>  		ret = iomap_write_end(iter, bytes, bytes, folio);
>  		__iomap_put_folio(iter, write_ops, bytes, folio);
> -		if (WARN_ON_ONCE(!ret))
> +		if (WARN_ON_ONCE(ret != bytes))
>  			return -EIO;
>  
>  		cond_resched();
> @@ -1388,7 +1387,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
>  	do {
>  		struct folio *folio;
>  		size_t offset;
> -		bool ret;
> +		int ret;
>  
>  		bytes = min_t(u64, SIZE_MAX, bytes);
>  		status = iomap_write_begin(iter, write_ops, &folio, &offset,
> @@ -1406,7 +1405,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
>  
>  		ret = iomap_write_end(iter, bytes, bytes, folio);
>  		__iomap_put_folio(iter, write_ops, bytes, folio);
> -		if (WARN_ON_ONCE(!ret))
> +		if (WARN_ON_ONCE(ret != bytes))
>  			return -EIO;
>  
>  		status = iomap_iter_advance(iter, &bytes);
> -- 
> 2.49.0
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state
  2025-09-23  4:21 ` [PATCH v5 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state alexjlzheng
@ 2025-09-25 19:01   ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2025-09-25 19:01 UTC (permalink / raw)
  To: alexjlzheng
  Cc: brauner, djwong, hch, kernel, linux-xfs, linux-fsdevel,
	linux-kernel, yi.zhang, Jinliang Zheng

On Tue, Sep 23, 2025 at 12:21:58PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> Currently, if a partial write occurs in a buffer write, the entire write will
> be discarded. While this is an uncommon case, it's still a bit wasteful and
> we can do better.
> 
> With iomap_folio_state, we can identify uptodate states at the block
> level, and a read_folio reading can correctly handle partially
> uptodate folios.
> 
> Therefore, when a partial write occurs, accept the block-aligned
> partial write instead of rejecting the entire write.
> 
> For example, suppose a folio is 2MB, blocksize is 4kB, and the copied
> bytes are 2MB-3kB.
> 
> Without this patchset, we'd need to recopy from the beginning of the
> folio in the next iteration, which means 2MB-3kB of bytes is copy
> duplicately.
> 
>  |<-------------------- 2MB -------------------->|
>  +-------+-------+-------+-------+-------+-------+
>  | block |  ...  | block | block |  ...  | block | folio
>  +-------+-------+-------+-------+-------+-------+
>  |<-4kB->|
> 
>  |<--------------- copied 2MB-3kB --------->|       first time copied
>  |<-------- 1MB -------->|                          next time we need copy (chunk /= 2)
>                          |<-------- 1MB -------->|  next next time we need copy.
> 
>  |<------ 2MB-3kB bytes duplicate copy ---->|
> 
> With this patchset, we can accept 2MB-4kB of bytes, which is block-aligned.
> This means we only need to process the remaining 4kB in the next iteration,
> which means there's only 1kB we need to copy duplicately.
> 
>  |<-------------------- 2MB -------------------->|
>  +-------+-------+-------+-------+-------+-------+
>  | block |  ...  | block | block |  ...  | block | folio
>  +-------+-------+-------+-------+-------+-------+
>  |<-4kB->|
> 
>  |<--------------- copied 2MB-3kB --------->|       first time copied
>                                          |<-4kB->|  next time we need copy
> 
>                                          |<>|
>                               only 1kB bytes duplicate copy
> 
> Although partial writes are inherently a relatively unusual situation and do
> not account for a large proportion of performance testing, the optimization
> here still makes sense in large-scale data centers.
> 

Thanks for the nice writeup and diagrams.

> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
> ---
>  fs/iomap/buffered-io.c | 44 +++++++++++++++++++++++++++++++++---------
>  1 file changed, 35 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6e516c7d9f04..3304028ce64f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -873,6 +873,25 @@ static int iomap_write_begin(struct iomap_iter *iter,
>  	return status;
>  }
>  
> +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
> +		size_t copied, struct folio *folio)
> +{
> +	struct iomap_folio_state *ifs = folio->private;
> +	unsigned block_size, last_blk, last_blk_bytes;
> +
> +	if (!ifs || !copied)
> +		return 0;
> +
> +	block_size = 1 << inode->i_blkbits;

I'd move this assignment to declaration time.

> +	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
> +	last_blk_bytes = (pos + copied) & (block_size - 1);
> +
> +	if (!ifs_block_is_uptodate(ifs, last_blk))
> +		copied -= min(copied, last_blk_bytes);

So I think I follow the idea here and it seems reasonable at first
glance. IIUC, the high level issue is that for certain writes we don't
read blocks up front if the write is expected to fully overwrite
blocks/folios, as we can just mark things uptodate on write completion.
If the write is short however, we now have a partial write to a
!uptodate block, so have to toss the write.

A few initial thoughts..

1. I don't really love the function name here. Maybe something like
iomap_write_end_short() or something would be more clear, but maybe
there are other opinions.

2. It might be helpful to move some of the comment below up to around
here where we actually trim the copied value.

3. I see that in __iomap_write_begin() we don't necessarily always
attach ifs if a write is expected to fully overwrite the entire folio.
It looks like that is handled with the !ifs check above, but it also
makes me wonder how effective this change is.

For example, the example in the commit log description appears to be a
short write of an attempted overwrite of a 2MB folio, right? Would we
expect to have ifs in that situation?

I don't really object to having the logic even if it is a real corner
case, but it would be good to have some test coverage to verify
behavior. Do you have a test case or anything (even if contrived) along
those lines? Perhaps we could play some games with badly formed
syscalls. A quick test to call pwritev() with a bad iov_base pointer
seems to produce a short write, but I haven't confirmed that's
sufficient for testing here..

Brian

> +
> +	return copied;
> +}
> +
>  static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  		size_t copied, struct folio *folio)
>  {
> @@ -881,17 +900,24 @@ static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  	/*
>  	 * The blocks that were entirely written will now be uptodate, so we
>  	 * don't have to worry about a read_folio reading them and overwriting a
> -	 * partial write.  However, if we've encountered a short write and only
> -	 * partially written into a block, it will not be marked uptodate, so a
> -	 * read_folio might come in and destroy our partial write.
> +	 * partial write.
>  	 *
> -	 * Do the simplest thing and just treat any short write to a
> -	 * non-uptodate page as a zero-length write, and force the caller to
> -	 * redo the whole thing.
> +	 * However, if we've encountered a short write and only partially
> +	 * written into a block, we must discard the short-written _tail_ block
> +	 * and not mark it uptodate in the ifs, to ensure a read_folio reading
> +	 * can handle it correctly via iomap_adjust_read_range(). It's safe to
> +	 * keep the non-tail block writes because we know that for a non-tail
> +	 * block:
> +	 * - is either fully written, since copy_from_user() is sequential
> +	 * - or is a partially written head block that has already been read in
> +	 *   and marked uptodate in the ifs by iomap_write_begin().
>  	 */
> -	if (unlikely(copied < len && !folio_test_uptodate(folio)))
> -		return 0;
> -	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
> +	if (unlikely(copied < len && !folio_test_uptodate(folio))) {
> +		copied = iomap_trim_tail_partial(inode, pos, copied, folio);
> +		if (!copied)
> +			return 0;
> +	}
> +	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), copied);
>  	iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied);
>  	filemap_dirty_folio(inode->i_mapping, folio);
>  	return copied;
> -- 
> 2.49.0
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-09-25 18:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-23  4:21 [PATCH v5 0/4] allow partial folio write with iomap_folio_state alexjlzheng
2025-09-23  4:21 ` [PATCH v5 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
2025-09-25 18:59   ` Brian Foster
2025-09-23  4:21 ` [PATCH v5 2/4] iomap: move iter revert case out of the unwritten branch alexjlzheng
2025-09-25 18:59   ` Brian Foster
2025-09-23  4:21 ` [PATCH v5 3/4] iomap: make iomap_write_end() return the number of written length again alexjlzheng
2025-09-25 19:00   ` Brian Foster
2025-09-23  4:21 ` [PATCH v5 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state alexjlzheng
2025-09-25 19:01   ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).