[PATCH v4 0/4] allow partial folio write with iomap_folio

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] allow partial folio write with iomap_folio_state
@ 2025-09-13  3:37 alexjlzheng
  2025-09-13  3:37 ` [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: alexjlzheng @ 2025-09-13  3:37 UTC (permalink / raw)
  To: hch, brauner
  Cc: djwong, yi.zhang, linux-xfs, linux-fsdevel, linux-kernel,
	Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

Currently, if a partial write occurs in a buffer write, the entire write will
be discarded. While this is an uncommon case, it's still a bit wasteful and
we can do better.

With iomap_folio_state, we can identify uptodate states at the block
level, and a read_folio reading can correctly handle partially
uptodate folios.

Therefore, when a partial write occurs, accept the block-aligned
partial write instead of rejecting the entire write.

For example, suppose a folio is 2MB, blocksize is 4kB, and the copied
bytes are 2MB-3kB.

Without this patchset, we'd need to recopy from the beginning of the
folio in the next iteration, which means 2MB-3kB of bytes is copy
duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
 |<-------- 1MB -------->|                          next time we need copy (chunk /= 2)
                         |<-------- 1MB -------->|  next next time we need copy.

 |<------ 2MB-3kB bytes duplicate copy ---->|

With this patchset, we can accept 2MB-4kB of bytes, which is block-aligned.
This means we only need to process the remaining 4kB in the next iteration,
which means there's only 1kB we need to copy duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
                                         |<-4kB->|  next time we need copy

                                         |<>|
                              only 1kB bytes duplicate copy

Although partial writes are inherently a relatively unusual situation and do
not account for a large proportion of performance testing, the optimization
here still makes sense in large-scale data centers.

This patchset has been tested by xfstests' generic and xfs group, and
there's no new failed cases compared to the lastest upstream version kernel.

Changelog:

V4: path[4]: better documentation in code, and add motivation to the cover letter

V3: https://lore.kernel.org/linux-xfs/aMPIDGq7pVuURg1t@infradead.org/
    patch[1]: use WARN_ON() instead of BUG_ON()
    patch[2]: make commit message clear
    patch[3]: -
    patch[4]: make commit message clear

V2: https://lore.kernel.org/linux-fsdevel/20250810101554.257060-1-alexjlzheng@tencent.com/ 
    use & instead of % for 64 bit variable on m68k/xtensa, try to make them happy:
       m68k-linux-ld: fs/iomap/buffered-io.o: in function `iomap_adjust_read_range':
    >> buffered-io.c:(.text+0xa8a): undefined reference to `__moddi3'
    >> m68k-linux-ld: buffered-io.c:(.text+0xaa8): undefined reference to `__moddi3'

V1: https://lore.kernel.org/linux-fsdevel/20250810044806.3433783-1-alexjlzheng@tencent.com/

Jinliang Zheng (4):
  iomap: make sure iomap_adjust_read_range() are aligned with block_size
  iomap: move iter revert case out of the unwritten branch
  iomap: make iomap_write_end() return the number of written length again
  iomap: don't abandon the whole copy when we have iomap_folio_state

 fs/iomap/buffered-io.c | 80 +++++++++++++++++++++++++++++-------------
 1 file changed, 55 insertions(+), 25 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size
  2025-09-13  3:37 [PATCH v4 0/4] allow partial folio write with iomap_folio_state alexjlzheng
@ 2025-09-13  3:37 ` alexjlzheng
  2025-09-14 11:45   ` Pankaj Raghav (Samsung)
  2025-09-13  3:37 ` [PATCH 2/4] iomap: move iter revert case out of the unwritten branch alexjlzheng
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: alexjlzheng @ 2025-09-13  3:37 UTC (permalink / raw)
  To: hch, brauner
  Cc: djwong, yi.zhang, linux-xfs, linux-fsdevel, linux-kernel,
	Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

iomap_folio_state marks the uptodate state in units of block_size, so
it is better to check that pos and length are aligned with block_size.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/iomap/buffered-io.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index fd827398afd2..0c38333933c6 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -234,6 +234,9 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
 
+	WARN_ON(*pos & (block_size - 1));
+	WARN_ON(length & (block_size - 1));
+
 	/*
 	 * If the block size is smaller than the page size, we need to check the
 	 * per-block uptodate status and adjust the offset and length if needed
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size
  2025-09-13  3:37 ` [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
@ 2025-09-14 11:45   ` Pankaj Raghav (Samsung)
  2025-09-14 12:40     ` Jinliang Zheng
  0 siblings, 1 reply; 14+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-09-14 11:45 UTC (permalink / raw)
  To: alexjlzheng
  Cc: hch, brauner, djwong, yi.zhang, linux-xfs, linux-fsdevel,
	linux-kernel, Jinliang Zheng

On Sat, Sep 13, 2025 at 11:37:15AM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> iomap_folio_state marks the uptodate state in units of block_size, so
> it is better to check that pos and length are aligned with block_size.
> 
> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
> ---
>  fs/iomap/buffered-io.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index fd827398afd2..0c38333933c6 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -234,6 +234,9 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  	unsigned first = poff >> block_bits;
>  	unsigned last = (poff + plen - 1) >> block_bits;
>  
> +	WARN_ON(*pos & (block_size - 1));
> +	WARN_ON(length & (block_size - 1));
Any reason you chose WARN_ON instead of WARN_ON_ONCE?

I don't see WARN_ON being used in iomap/buffered-io.c.
--
Pankaj

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size
  2025-09-14 11:45   ` Pankaj Raghav (Samsung)
@ 2025-09-14 12:40     ` Jinliang Zheng
  2025-09-15  8:54       ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 14+ messages in thread
From: Jinliang Zheng @ 2025-09-14 12:40 UTC (permalink / raw)
  To: kernel
  Cc: alexjlzheng, alexjlzheng, brauner, djwong, hch, linux-fsdevel,
	linux-kernel, linux-xfs, yi.zhang

On Sun, 14 Sep 2025 13:45:16 +0200, kernel@pankajraghav.com wrote:
> On Sat, Sep 14, 2025 at 11:37:15AM +0800, alexjlzheng@gmail.com wrote:
> > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > 
> > iomap_folio_state marks the uptodate state in units of block_size, so
> > it is better to check that pos and length are aligned with block_size.
> > 
> > Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
> > ---
> >  fs/iomap/buffered-io.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index fd827398afd2..0c38333933c6 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -234,6 +234,9 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
> >  	unsigned first = poff >> block_bits;
> >  	unsigned last = (poff + plen - 1) >> block_bits;
> >  
> > +	WARN_ON(*pos & (block_size - 1));
> > +	WARN_ON(length & (block_size - 1));
> Any reason you chose WARN_ON instead of WARN_ON_ONCE?

I just think it's a fatal error that deserves attention every time
it's triggered.

> 
> I don't see WARN_ON being used in iomap/buffered-io.c.

I'm not sure if there are any community guidelines for using these
two macros. If there are, please let me know and I'll be happy to
follow them as a guide.

thanks,
Jinliang Zheng. :)

> --
> Pankaj

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size
  2025-09-14 12:40     ` Jinliang Zheng
@ 2025-09-15  8:54       ` Pankaj Raghav (Samsung)
  2025-09-15  9:12         ` Jinliang Zheng
  0 siblings, 1 reply; 14+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-09-15  8:54 UTC (permalink / raw)
  To: Jinliang Zheng
  Cc: alexjlzheng, brauner, djwong, hch, linux-fsdevel, linux-kernel,
	linux-xfs, yi.zhang

On Sun, Sep 14, 2025 at 08:40:06PM +0800, Jinliang Zheng wrote:
> On Sun, 14 Sep 2025 13:45:16 +0200, kernel@pankajraghav.com wrote:
> > On Sat, Sep 14, 2025 at 11:37:15AM +0800, alexjlzheng@gmail.com wrote:
> > > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > > 
> > > iomap_folio_state marks the uptodate state in units of block_size, so
> > > it is better to check that pos and length are aligned with block_size.
> > > 
> > > Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
> > > ---
> > >  fs/iomap/buffered-io.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index fd827398afd2..0c38333933c6 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -234,6 +234,9 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
> > >  	unsigned first = poff >> block_bits;
> > >  	unsigned last = (poff + plen - 1) >> block_bits;
> > >  
> > > +	WARN_ON(*pos & (block_size - 1));
> > > +	WARN_ON(length & (block_size - 1));
> > Any reason you chose WARN_ON instead of WARN_ON_ONCE?
> 
> I just think it's a fatal error that deserves attention every time
> it's triggered.
> 

Is this a general change or does your later changes depend on these on
warning to work correctly?

> > 
> > I don't see WARN_ON being used in iomap/buffered-io.c.
> 
> I'm not sure if there are any community guidelines for using these
> two macros. If there are, please let me know and I'll be happy to
> follow them as a guide.

We typically use WARN_ON_ONCE to prevent spamming.

--
Pankaj

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size
  2025-09-15  8:54       ` Pankaj Raghav (Samsung)
@ 2025-09-15  9:12         ` Jinliang Zheng
  0 siblings, 0 replies; 14+ messages in thread
From: Jinliang Zheng @ 2025-09-15  9:12 UTC (permalink / raw)
  To: kernel
  Cc: alexjlzheng, alexjlzheng, brauner, djwong, hch, linux-fsdevel,
	linux-kernel, linux-xfs, yi.zhang

On Mon, 15 Sep 2025 10:54:00 +0200, kernel@pankajraghav.com wrote:
> On Sun, Sep 14, 2025 at 08:40:06PM +0800, Jinliang Zheng wrote:
> > On Sun, 14 Sep 2025 13:45:16 +0200, kernel@pankajraghav.com wrote:
> > > On Sat, Sep 14, 2025 at 11:37:15AM +0800, alexjlzheng@gmail.com wrote:
> > > > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > > > 
> > > > iomap_folio_state marks the uptodate state in units of block_size, so
> > > > it is better to check that pos and length are aligned with block_size.
> > > > 
> > > > Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
> > > > ---
> > > >  fs/iomap/buffered-io.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index fd827398afd2..0c38333933c6 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -234,6 +234,9 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
> > > >  	unsigned first = poff >> block_bits;
> > > >  	unsigned last = (poff + plen - 1) >> block_bits;
> > > >  
> > > > +	WARN_ON(*pos & (block_size - 1));
> > > > +	WARN_ON(length & (block_size - 1));
> > > Any reason you chose WARN_ON instead of WARN_ON_ONCE?
> > 
> > I just think it's a fatal error that deserves attention every time
> > it's triggered.
> > 
> 
> Is this a general change or does your later changes depend on these on
> warning to work correctly?

No, there is no functional change.

I added it only because the correctness of iomap_adjust_read_range() depends on
it, so it's better to hightlight it now.

```
	/* move forward for each leading block marked uptodate */
	for (i = first; i <= last; i++) {
		if (!ifs_block_is_uptodate(ifs, i))
			break;
		*pos += block_size; <-------------------- if not aligned, ...
		poff += block_size;
		plen -= block_size;
		first++;
	}
```

> 
> > > 
> > > I don't see WARN_ON being used in iomap/buffered-io.c.
> > 
> > I'm not sure if there are any community guidelines for using these
> > two macros. If there are, please let me know and I'll be happy to
> > follow them as a guide.
> 
> We typically use WARN_ON_ONCE to prevent spamming.

If you think it's better, I will send a new version.

thanks,
Jinliang Zheng. :)

> 
> --
> Pankaj

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/4] iomap: move iter revert case out of the unwritten branch
  2025-09-13  3:37 [PATCH v4 0/4] allow partial folio write with iomap_folio_state alexjlzheng
  2025-09-13  3:37 ` [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
@ 2025-09-13  3:37 ` alexjlzheng
  2025-09-13  3:37 ` [PATCH 3/4] iomap: make iomap_write_end() return the number of written length again alexjlzheng
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: alexjlzheng @ 2025-09-13  3:37 UTC (permalink / raw)
  To: hch, brauner
  Cc: djwong, yi.zhang, linux-xfs, linux-fsdevel, linux-kernel,
	Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

The commit e1f453d4336d ("iomap: do some small logical cleanup in
buffered write") merged iomap_write_failed() and iov_iter_revert()
into the branch with written == 0. Because, at the time,
iomap_write_end() could never return a partial write length.

In the subsequent patch, iomap_write_end() will be modified to allow
to return block-aligned partial write length (partial write length
here is relative to the folio-sized write), which violated the above
patch's assumption.

This patch moves it back out to prepare for the subsequent patches.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/iomap/buffered-io.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0c38333933c6..109c3bad6ccf 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1019,6 +1019,11 @@ static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
 
 		if (old_size < pos)
 			pagecache_isize_extended(iter->inode, old_size, pos);
+		if (written < bytes)
+			iomap_write_failed(iter->inode, pos + written,
+					   bytes - written);
+		if (unlikely(copied != written))
+			iov_iter_revert(i, copied - written);
 
 		cond_resched();
 		if (unlikely(written == 0)) {
@@ -1028,9 +1033,6 @@ static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
 			 * halfway through, might be a race with munmap,
 			 * might be severe memory pressure.
 			 */
-			iomap_write_failed(iter->inode, pos, bytes);
-			iov_iter_revert(i, copied);
-
 			if (chunk > PAGE_SIZE)
 				chunk /= 2;
 			if (copied) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] iomap: make iomap_write_end() return the number of written length again
  2025-09-13  3:37 [PATCH v4 0/4] allow partial folio write with iomap_folio_state alexjlzheng
  2025-09-13  3:37 ` [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
  2025-09-13  3:37 ` [PATCH 2/4] iomap: move iter revert case out of the unwritten branch alexjlzheng
@ 2025-09-13  3:37 ` alexjlzheng
  2025-09-13  3:37 ` [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state alexjlzheng
  2025-09-14 11:40 ` [PATCH v4 0/4] allow partial folio write with iomap_folio_state Pankaj Raghav (Samsung)
  4 siblings, 0 replies; 14+ messages in thread
From: alexjlzheng @ 2025-09-13  3:37 UTC (permalink / raw)
  To: hch, brauner
  Cc: djwong, yi.zhang, linux-xfs, linux-fsdevel, linux-kernel,
	Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

In the next patch, we allow iomap_write_end() to conditionally accept
partial writes, so this patch makes iomap_write_end() return the number
of accepted write bytes in preparation for the next patch.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/iomap/buffered-io.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 109c3bad6ccf..7b9193f8243a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -873,7 +873,7 @@ static int iomap_write_begin(struct iomap_iter *iter,
 	return status;
 }
 
-static bool __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
+static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
 	flush_dcache_folio(folio);
@@ -890,11 +890,11 @@ static bool __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 * redo the whole thing.
 	 */
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
-		return false;
+		return 0;
 	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
 	iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
-	return true;
+	return copied;
 }
 
 static void iomap_write_end_inline(const struct iomap_iter *iter,
@@ -915,10 +915,10 @@ static void iomap_write_end_inline(const struct iomap_iter *iter,
 }
 
 /*
- * Returns true if all copied bytes have been written to the pagecache,
- * otherwise return false.
+ * Returns number of copied bytes have been written to the pagecache,
+ * zero if block is partial update.
  */
-static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
+static int iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
 		struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -926,7 +926,7 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
 
 	if (srcmap->type == IOMAP_INLINE) {
 		iomap_write_end_inline(iter, folio, pos, copied);
-		return true;
+		return copied;
 	}
 
 	if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
@@ -934,7 +934,7 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
 
 		bh_written = block_write_end(pos, len, copied, folio);
 		WARN_ON_ONCE(bh_written != copied && bh_written != 0);
-		return bh_written == copied;
+		return bh_written;
 	}
 
 	return __iomap_write_end(iter->inode, pos, len, copied, folio);
@@ -1000,8 +1000,7 @@ static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
 			flush_dcache_folio(folio);
 
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
-		written = iomap_write_end(iter, bytes, copied, folio) ?
-			  copied : 0;
+		written = iomap_write_end(iter, bytes, copied, folio);
 
 		/*
 		 * Update the in-memory inode size after copying the data into
@@ -1315,7 +1314,7 @@ static int iomap_unshare_iter(struct iomap_iter *iter,
 	do {
 		struct folio *folio;
 		size_t offset;
-		bool ret;
+		int ret;
 
 		bytes = min_t(u64, SIZE_MAX, bytes);
 		status = iomap_write_begin(iter, write_ops, &folio, &offset,
@@ -1327,7 +1326,7 @@ static int iomap_unshare_iter(struct iomap_iter *iter,
 
 		ret = iomap_write_end(iter, bytes, bytes, folio);
 		__iomap_put_folio(iter, write_ops, bytes, folio);
-		if (WARN_ON_ONCE(!ret))
+		if (WARN_ON_ONCE(ret != bytes))
 			return -EIO;
 
 		cond_resched();
@@ -1388,7 +1387,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 	do {
 		struct folio *folio;
 		size_t offset;
-		bool ret;
+		int ret;
 
 		bytes = min_t(u64, SIZE_MAX, bytes);
 		status = iomap_write_begin(iter, write_ops, &folio, &offset,
@@ -1406,7 +1405,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 
 		ret = iomap_write_end(iter, bytes, bytes, folio);
 		__iomap_put_folio(iter, write_ops, bytes, folio);
-		if (WARN_ON_ONCE(!ret))
+		if (WARN_ON_ONCE(ret != bytes))
 			return -EIO;
 
 		status = iomap_iter_advance(iter, &bytes);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state
  2025-09-13  3:37 [PATCH v4 0/4] allow partial folio write with iomap_folio_state alexjlzheng
                   ` (2 preceding siblings ...)
  2025-09-13  3:37 ` [PATCH 3/4] iomap: make iomap_write_end() return the number of written length again alexjlzheng
@ 2025-09-13  3:37 ` alexjlzheng
  2025-09-15 10:50   ` Pankaj Raghav (Samsung)
  2025-09-14 11:40 ` [PATCH v4 0/4] allow partial folio write with iomap_folio_state Pankaj Raghav (Samsung)
  4 siblings, 1 reply; 14+ messages in thread
From: alexjlzheng @ 2025-09-13  3:37 UTC (permalink / raw)
  To: hch, brauner
  Cc: djwong, yi.zhang, linux-xfs, linux-fsdevel, linux-kernel,
	Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

Currently, if a partial write occurs in a buffer write, the entire write will
be discarded. While this is an uncommon case, it's still a bit wasteful and
we can do better.

With iomap_folio_state, we can identify uptodate states at the block
level, and a read_folio reading can correctly handle partially
uptodate folios.

Therefore, when a partial write occurs, accept the block-aligned
partial write instead of rejecting the entire write.

For example, suppose a folio is 2MB, blocksize is 4kB, and the copied
bytes are 2MB-3kB.

Without this patchset, we'd need to recopy from the beginning of the
folio in the next iteration, which means 2MB-3kB of bytes is copy
duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
 |<-------- 1MB -------->|                          next time we need copy (chunk /= 2)
                         |<-------- 1MB -------->|  next next time we need copy.

 |<------ 2MB-3kB bytes duplicate copy ---->|

With this patchset, we can accept 2MB-4kB of bytes, which is block-aligned.
This means we only need to process the remaining 4kB in the next iteration,
which means there's only 1kB we need to copy duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
                                         |<-4kB->|  next time we need copy

                                         |<>|
                              only 1kB bytes duplicate copy

Although partial writes are inherently a relatively unusual situation and do
not account for a large proportion of performance testing, the optimization
here still makes sense in large-scale data centers.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/iomap/buffered-io.c | 44 +++++++++++++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7b9193f8243a..0952a3debe11 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -873,6 +873,25 @@ static int iomap_write_begin(struct iomap_iter *iter,
 	return status;
 }
 
+static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
+		size_t copied, struct folio *folio)
+{
+	struct iomap_folio_state *ifs = folio->private;
+	unsigned block_size, last_blk, last_blk_bytes;
+
+	if (!ifs || !copied)
+		return 0;
+
+	block_size = 1 << inode->i_blkbits;
+	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
+	last_blk_bytes = (pos + copied) & (block_size - 1);
+
+	if (!ifs_block_is_uptodate(ifs, last_blk))
+		copied -= min(copied, last_blk_bytes);
+
+	return copied;
+}
+
 static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
@@ -881,17 +900,24 @@ static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	/*
 	 * The blocks that were entirely written will now be uptodate, so we
 	 * don't have to worry about a read_folio reading them and overwriting a
-	 * partial write.  However, if we've encountered a short write and only
-	 * partially written into a block, it will not be marked uptodate, so a
-	 * read_folio might come in and destroy our partial write.
+	 * partial write.
 	 *
-	 * Do the simplest thing and just treat any short write to a
-	 * non-uptodate page as a zero-length write, and force the caller to
-	 * redo the whole thing.
+	 * However, if we've encountered a short write and only partially
+	 * written into a block, we must discard the short-written _tail_ block
+	 * and not mark it uptodate in the ifs, to ensure a read_folio reading
+	 * can handle it correctly via iomap_adjust_read_range(). It's safe to
+	 * keep the non-tail block writes because we know that for a non-tail
+	 * block:
+	 * - is either fully written, since copy_from_user() is sequential
+	 * - or is a partially written head block that has already been read in
+	 *   and marked uptodate in the ifs by iomap_write_begin().
 	 */
-	if (unlikely(copied < len && !folio_test_uptodate(folio)))
-		return 0;
-	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
+	if (unlikely(copied < len && !folio_test_uptodate(folio))) {
+		copied = iomap_trim_tail_partial(inode, pos, copied, folio);
+		if (!copied)
+			return 0;
+	}
+	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), copied);
 	iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state
  2025-09-13  3:37 ` [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state alexjlzheng
@ 2025-09-15 10:50   ` Pankaj Raghav (Samsung)
  2025-09-15 11:12     ` Jinliang Zheng
  0 siblings, 1 reply; 14+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-09-15 10:50 UTC (permalink / raw)
  To: alexjlzheng
  Cc: hch, brauner, djwong, yi.zhang, linux-xfs, linux-fsdevel,
	linux-kernel, Jinliang Zheng

> +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
> +		size_t copied, struct folio *folio)
> +{
> +	struct iomap_folio_state *ifs = folio->private;
> +	unsigned block_size, last_blk, last_blk_bytes;
> +
> +	if (!ifs || !copied)
> +		return 0;
> +
> +	block_size = 1 << inode->i_blkbits;
> +	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
> +	last_blk_bytes = (pos + copied) & (block_size - 1);
> +
> +	if (!ifs_block_is_uptodate(ifs, last_blk))
> +		copied -= min(copied, last_blk_bytes);

If pos is aligned to block_size, is there a scenario where 
copied < last_blk_bytes?

Trying to understand why you are using a min() here.
--
Pankaj

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state
  2025-09-15 10:50   ` Pankaj Raghav (Samsung)
@ 2025-09-15 11:12     ` Jinliang Zheng
  2025-09-15 11:29       ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 14+ messages in thread
From: Jinliang Zheng @ 2025-09-15 11:12 UTC (permalink / raw)
  To: kernel
  Cc: alexjlzheng, alexjlzheng, brauner, djwong, hch, linux-fsdevel,
	linux-kernel, linux-xfs, yi.zhang

On Mon, 15 Sep 2025 12:50:54 +0200, kernel@pankajraghav.com wrote:
> > +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
> > +		size_t copied, struct folio *folio)
> > +{
> > +	struct iomap_folio_state *ifs = folio->private;
> > +	unsigned block_size, last_blk, last_blk_bytes;
> > +
> > +	if (!ifs || !copied)
> > +		return 0;
> > +
> > +	block_size = 1 << inode->i_blkbits;
> > +	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
> > +	last_blk_bytes = (pos + copied) & (block_size - 1);
> > +
> > +	if (!ifs_block_is_uptodate(ifs, last_blk))
> > +		copied -= min(copied, last_blk_bytes);
> 
> If pos is aligned to block_size, is there a scenario where 
> copied < last_blk_bytes?

I believe there is no other scenario. The min() here is specifically to handle cases where
pos is not aligned to block_size. But please note that the pos here is unrelated to the pos
in iomap_adjust_read_range().

thanks,
Jinliang Zheng. :)

> 
> Trying to understand why you are using a min() here.
> --
> Pankaj

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state
  2025-09-15 11:12     ` Jinliang Zheng
@ 2025-09-15 11:29       ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 14+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-09-15 11:29 UTC (permalink / raw)
  To: Jinliang Zheng
  Cc: alexjlzheng, brauner, djwong, hch, linux-fsdevel, linux-kernel,
	linux-xfs, yi.zhang

On Mon, Sep 15, 2025 at 07:12:28PM +0800, Jinliang Zheng wrote:
> On Mon, 15 Sep 2025 12:50:54 +0200, kernel@pankajraghav.com wrote:
> > > +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
> > > +		size_t copied, struct folio *folio)
> > > +{
> > > +	struct iomap_folio_state *ifs = folio->private;
> > > +	unsigned block_size, last_blk, last_blk_bytes;
> > > +
> > > +	if (!ifs || !copied)
> > > +		return 0;
> > > +
> > > +	block_size = 1 << inode->i_blkbits;
> > > +	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
> > > +	last_blk_bytes = (pos + copied) & (block_size - 1);
> > > +
> > > +	if (!ifs_block_is_uptodate(ifs, last_blk))
> > > +		copied -= min(copied, last_blk_bytes);
> > 
> > If pos is aligned to block_size, is there a scenario where 
> > copied < last_blk_bytes?
> 
> I believe there is no other scenario. The min() here is specifically to handle cases where
> pos is not aligned to block_size. But please note that the pos here is unrelated to the pos
> in iomap_adjust_read_range().

Ah, you are right. This is about write and not read. I got a bit
confused after reading both the patches back to back.

--
Pankaj

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/4] allow partial folio write with iomap_folio_state
  2025-09-13  3:37 [PATCH v4 0/4] allow partial folio write with iomap_folio_state alexjlzheng
                   ` (3 preceding siblings ...)
  2025-09-13  3:37 ` [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state alexjlzheng
@ 2025-09-14 11:40 ` Pankaj Raghav (Samsung)
  2025-09-14 13:30   ` Jinliang Zheng
  4 siblings, 1 reply; 14+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-09-14 11:40 UTC (permalink / raw)
  To: alexjlzheng
  Cc: hch, brauner, djwong, yi.zhang, linux-xfs, linux-fsdevel,
	linux-kernel, Jinliang Zheng

On Sat, Sep 13, 2025 at 11:37:14AM +0800, alexjlzheng@gmail.com wrote:
> This patchset has been tested by xfstests' generic and xfs group, and
> there's no new failed cases compared to the lastest upstream version kernel.

Do you know if there is a specific test from generic/ or xfs/ in
xfstests that is testing this path?

As this is slightly changing the behaviour of a partial write, it would
be nice to either add a test or highlight which test is hitting this
path in the cover letter.

--
Pankaj

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/4] allow partial folio write with iomap_folio_state
  2025-09-14 11:40 ` [PATCH v4 0/4] allow partial folio write with iomap_folio_state Pankaj Raghav (Samsung)
@ 2025-09-14 13:30   ` Jinliang Zheng
  0 siblings, 0 replies; 14+ messages in thread
From: Jinliang Zheng @ 2025-09-14 13:30 UTC (permalink / raw)
  To: kernel
  Cc: alexjlzheng, alexjlzheng, brauner, djwong, hch, linux-fsdevel,
	linux-kernel, linux-xfs, yi.zhang

On Sun, 14 Sep 2025 13:40:30 +0200, kernel@pankajraghav.com wrote:
> On Sat, Sep 14, 2025 at 11:37:14AM +0800, alexjlzheng@gmail.com wrote:
> > This patchset has been tested by xfstests' generic and xfs group, and
> > there's no new failed cases compared to the lastest upstream version kernel.
> 
> Do you know if there is a specific test from generic/ or xfs/ in
> xfstests that is testing this path?

It seems not. But there is a chance that the existing buffer write will hit.

thanks,
Jinliang Zheng. :)

> 
> As this is slightly changing the behaviour of a partial write, it would
> be nice to either add a test or highlight which test is hitting this
> path in the cover letter.
> 
> --
> Pankaj

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-09-15 11:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-13  3:37 [PATCH v4 0/4] allow partial folio write with iomap_folio_state alexjlzheng
2025-09-13  3:37 ` [PATCH 1/4] iomap: make sure iomap_adjust_read_range() are aligned with block_size alexjlzheng
2025-09-14 11:45   ` Pankaj Raghav (Samsung)
2025-09-14 12:40     ` Jinliang Zheng
2025-09-15  8:54       ` Pankaj Raghav (Samsung)
2025-09-15  9:12         ` Jinliang Zheng
2025-09-13  3:37 ` [PATCH 2/4] iomap: move iter revert case out of the unwritten branch alexjlzheng
2025-09-13  3:37 ` [PATCH 3/4] iomap: make iomap_write_end() return the number of written length again alexjlzheng
2025-09-13  3:37 ` [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state alexjlzheng
2025-09-15 10:50   ` Pankaj Raghav (Samsung)
2025-09-15 11:12     ` Jinliang Zheng
2025-09-15 11:29       ` Pankaj Raghav (Samsung)
2025-09-14 11:40 ` [PATCH v4 0/4] allow partial folio write with iomap_folio_state Pankaj Raghav (Samsung)
2025-09-14 13:30   ` Jinliang Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).