[PATCH] ext4: avoid full buffer walks for large folio partial writes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] ext4: avoid full buffer walks for large folio partial writes
@ 2026-06-03 13:48 Jia Zhu
  2026-06-03 18:11 ` Matthew Wilcox
  2026-06-08 12:01 ` [PATCH v2 0/2] ext4: avoid tail walks for cached large-folio writes Jia Zhu
  0 siblings, 2 replies; 11+ messages in thread
From: Jia Zhu @ 2026-06-03 13:48 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu

Ext4 buffered writes into large folios still walk every buffer_head in the
folio in ext4_block_write_begin() and again in block_commit_write(). Before
regular files used large folios this was cheap, but a large folio can
contain hundreds of buffer_heads. Small overwrites of an existing large
folio therefore pay work proportional to the folio size instead of the
write size.

This is visible when the page cache is first populated with large folios
and then a small range is overwritten. The numbers below come from a local
libMicro-based microbenchmark. Each round first drops caches, writes a
10 MiB file with dd to instantiate large page-cache folios, and then runs
libMicro's write, pwrite, or writev benchmark for a small buffered
overwrite. The writev cases use libMicro's default vector count of 10.

A representative pwrite round is:

	sync
	echo 3 > /proc/sys/vm/drop_caches
	dd if=/dev/zero of=$file bs=1024k count=10
	taskset -c 0 ./bin/pwrite -H -C 50 -D 3 -S -N pwrite_u1k \
		-s 1k -f $file

To avoid comparing this change with an older kernel, the benchmark uses two
kernels built from the same master tree: one with this change and one with
only this change reverted. With THP=always and 10 dd-prefill rounds, median
latencies were:

			nofix		patched		improvement
	write_u1k	1.418 usec	0.342 usec	75.9%
	write_u10k	1.887 usec	0.409 usec	78.3%
	write_u100k	4.114 usec	2.554 usec	37.9%
	pwrite_u1k	1.677 usec	0.335 usec	80.1%
	pwrite_u10k	1.903 usec	0.410 usec	78.5%
	pwrite_u100k	4.101 usec	2.563 usec	37.5%
	writev_u1k	2.285 usec	0.756 usec	66.9%
	writev_u10k	4.655 usec	3.025 usec	35.0%

Start the ext4 write_begin walk at the first buffer that overlaps the
write. For already-uptodate large folio overwrites, add a partial commit
path which marks only the written buffers uptodate and dirty. Leave
non-uptodate folios on the old full-buffer commit path so BH_New cleanup
and folio-uptodate discovery are preserved.

Partially uptodate large folios remain described by per-buffer state, which
is what block_is_partially_uptodate() and read_folio use for later reads.

Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/buffer.c     | 51 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/inode.c | 21 ++++++++++----------
 2 files changed, 62 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496e..e0c5868b088be 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2092,6 +2092,44 @@ int __block_write_begin(struct folio *folio, loff_t pos, unsigned len,
 }
 EXPORT_SYMBOL(__block_write_begin);
 
+static struct buffer_head *folio_buffer_seek(struct buffer_head *head,
+					     unsigned int blocksize,
+					     size_t offset,
+					     size_t *block_start)
+{
+	size_t nr = offset / blocksize;
+
+	*block_start = nr * blocksize;
+	while (nr--)
+		head = head->b_this_page;
+	return head;
+}
+
+static void block_commit_write_range(struct buffer_head *head,
+				     unsigned int blocksize, size_t from,
+				     size_t to)
+{
+	size_t block_start, block_end;
+	struct buffer_head *bh;
+
+	if (from == to)
+		return;
+	if (WARN_ON_ONCE(to > folio_size(head->b_folio)))
+		return;
+
+	bh = folio_buffer_seek(head, blocksize, from, &block_start);
+	do {
+		block_end = block_start + blocksize;
+		set_buffer_uptodate(bh);
+		mark_buffer_dirty(bh);
+		if (buffer_new(bh))
+			clear_buffer_new(bh);
+
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (block_start < to && bh != head);
+}
+
 void block_commit_write(struct folio *folio, size_t from, size_t to)
 {
 	size_t block_start, block_end;
@@ -2104,6 +2142,19 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 		return;
 	blocksize = bh->b_size;
 
+	/*
+	 * Large folios can carry hundreds of buffer_heads.  For partial writes,
+	 * keep commit work local to the written range; partially uptodate
+	 * reads remain governed by the buffer state.
+	 */
+	if (folio_test_large(folio) && from < to &&
+	    folio_test_uptodate(folio) &&
+	    to <= folio_size(folio) &&
+	    (from != 0 || to != folio_size(folio))) {
+		block_commit_write_range(head, blocksize, from, to);
+		return;
+	}
+
 	block_start = 0;
 	do {
 		block_end = block_start + blocksize;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d1..e58bba0289eba 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1180,7 +1180,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	unsigned int blocksize = i_blocksize(inode);
 	struct buffer_head *bh, *head, *wait[2];
 	int nr_wait = 0;
-	int i;
+	unsigned int i;
 	bool should_journal_data = ext4_should_journal_data(inode);
 
 	BUG_ON(!folio_test_locked(folio));
@@ -1191,17 +1191,18 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	head = folio_buffers(folio);
 	if (!head)
 		head = create_empty_buffers(folio, blocksize, 0);
-	block = EXT4_PG_TO_LBLK(inode, folio->index);
+	if (from == to)
+		return 0;
+	block_start = round_down(from, blocksize);
+	block = EXT4_PG_TO_LBLK(inode, folio->index) +
+		(block_start >> inode->i_blkbits);
+	bh = head;
+	for (i = 0; i < block_start; i += blocksize)
+		bh = bh->b_this_page;
 
-	for (bh = head, block_start = 0; bh != head || !block_start;
-	    block++, block_start = block_end, bh = bh->b_this_page) {
+	for (; block_start < to;
+	     block++, block_start = block_end, bh = bh->b_this_page) {
 		block_end = block_start + blocksize;
-		if (block_end <= from || block_start >= to) {
-			if (folio_test_uptodate(folio)) {
-				set_buffer_uptodate(bh);
-			}
-			continue;
-		}
 		if (WARN_ON_ONCE(buffer_new(bh)))
 			clear_buffer_new(bh);
 		if (!buffer_mapped(bh)) {

base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.39.5 (Apple Git-154)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
  2026-06-03 13:48 [PATCH] ext4: avoid full buffer walks for large folio partial writes Jia Zhu
@ 2026-06-03 18:11 ` Matthew Wilcox
  2026-06-05  9:02   ` Jia Zhu
  2026-06-08 12:01 ` [PATCH v2 0/2] ext4: avoid tail walks for cached large-folio writes Jia Zhu
  1 sibling, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2026-06-03 18:11 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel

On Wed, Jun 03, 2026 at 09:48:00PM +0800, Jia Zhu wrote:
> Ext4 buffered writes into large folios still walk every buffer_head in the
> folio in ext4_block_write_begin() and again in block_commit_write(). Before
> regular files used large folios this was cheap, but a large folio can
> contain hundreds of buffer_heads. Small overwrites of an existing large
> folio therefore pay work proportional to the folio size instead of the
> write size.

Is this a common case for you, or is this something you noticed by
inspection?

> Start the ext4 write_begin walk at the first buffer that overlaps the
> write. For already-uptodate large folio overwrites, add a partial commit
> path which marks only the written buffers uptodate and dirty. Leave
> non-uptodate folios on the old full-buffer commit path so BH_New cleanup
> and folio-uptodate discovery are preserved.

Wouldn't you get just as much benefit from this?

+++ b/fs/buffer.c
@@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from,
size_t to)
 {
        size_t block_start, block_end;
        bool partial = false;
+       bool uptodate = folio_test_uptodate(folio);
        unsigned blocksize;
        struct buffer_head *bh, *head;

@@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
                        clear_buffer_new(bh);

                block_start = block_end;
+               if (uptodate && block_start >= to)
+                       break;
                bh = bh->b_this_page;
        } while (bh != head);

> @@ -1191,17 +1191,18 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  	head = folio_buffers(folio);
>  	if (!head)
>  		head = create_empty_buffers(folio, blocksize, 0);
> -	block = EXT4_PG_TO_LBLK(inode, folio->index);
> +	if (from == to)
> +		return 0;
> +	block_start = round_down(from, blocksize);
> +	block = EXT4_PG_TO_LBLK(inode, folio->index) +
> +		(block_start >> inode->i_blkbits);
> +	bh = head;
> +	for (i = 0; i < block_start; i += blocksize)
> +		bh = bh->b_this_page;
>  
> -	for (bh = head, block_start = 0; bh != head || !block_start;
> -	    block++, block_start = block_end, bh = bh->b_this_page) {
> +	for (; block_start < to;
> +	     block++, block_start = block_end, bh = bh->b_this_page) {
>  		block_end = block_start + blocksize;
> -		if (block_end <= from || block_start >= to) {
> -			if (folio_test_uptodate(folio)) {
> -				set_buffer_uptodate(bh);
> -			}
> -			continue;
> -		}
>  		if (WARN_ON_ONCE(buffer_new(bh)))
>  			clear_buffer_new(bh);
>  		if (!buffer_mapped(bh)) {
> 

I'm unconvinced that this is safe ... but all of this is a distraction
form what we should really be doing which is converting ext4 to use
iomap instead of buffer heads.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
  2026-06-03 18:11 ` Matthew Wilcox
@ 2026-06-05  9:02   ` Jia Zhu
  2026-06-05 14:24     ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Jia Zhu @ 2026-06-05  9:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jia Zhu, Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel

On Wed, Jun 03, 2026 at 07:11:48PM +0100, Matthew Wilcox wrote:
> Is this a common case for you, or is this something you noticed by
> inspection?

This was found by our kernel release benchmark.  We run libMicro as part
of that test suite:

  https://github.com/rzezeski/libMicro

The regression shows up in buffered write/pwrite/writev overwrite tests
on ext4 large folios.

> Wouldn't you get just as much benefit from this?

Yes.  I tested this approach, and it gives almost the same result as my
original partial-commit helper.

I agree this is a better direction for block_commit_write().  It keeps the
existing buffer-head state handling and only stops the tail walk after an
already-uptodate folio has been committed through @to.  That removes the
main large-folio cost in our small-overwrite benchmark while keeping the
change much closer to the old code.

> I'm unconvinced that this is safe ...

Agreed.  The original ext4_block_write_begin() change was too aggressive.
Seeking directly to @from also skips the prefix buffers, which makes the
old side effects harder to prove.

For v2 I plan to drop that part and keep the existing walk from the head.
The ext4 change would only stop after @to when the folio was already
uptodate on entry, similar to your block_commit_write() suggestion:

+       bool folio_uptodate = folio_test_uptodate(folio);
+
        for (bh = head, block_start = 0;
-            bh != head || !block_start;
+            (bh != head || !block_start) &&
+            (!folio_uptodate || block_start < to);
             block++, block_start = block_end, bh = bh->b_this_page) {
                ...
        }

So the prefix path and all in-range handling stay unchanged.  The only
skipped work is the tail part after @to, and only for a folio that was
already uptodate before write_begin() started.

> ... converting ext4 to use iomap instead of buffer heads.

I strongly agree that iomap is the right direction for ext4.  The iomap
buffered write path would make this particular buffer-head walk cost go
away.

The reason I am still looking at this path is that the regression is
visible in our LTS upgrade testing from 6.12 to 6.18.  It was introduced
by the ext4 large-folio enablement in v6.16.  For example, in our
libMicro release benchmark with THP always enabled, usecs/call, lower is
better:

case        v6.12        v6.18        regression
write_u1k   0.609        4.659        +665.0%
write_u10k  1.408        4.869        +245.8%

The iomap conversion is the long-term fix, but it does not help kernels
which still use the buffer-head buffered write path.  I would like to keep
this as a small regression fix for that path, and make it minimal enough
to be suitable for stable/LTS backport.

Would this v2 direction look OK to you?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
  2026-06-05  9:02   ` Jia Zhu
@ 2026-06-05 14:24     ` Matthew Wilcox
  2026-06-08 11:56       ` Jia Zhu
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2026-06-05 14:24 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel

On Fri, Jun 05, 2026 at 05:02:53PM +0800, Jia Zhu wrote:
> On Wed, Jun 03, 2026 at 07:11:48PM +0100, Matthew Wilcox wrote:
> > Is this a common case for you, or is this something you noticed by
> > inspection?
> 
> This was found by our kernel release benchmark.  We run libMicro as part
> of that test suite:
> 
>   https://github.com/rzezeski/libMicro
> 
> The regression shows up in buffered write/pwrite/writev overwrite tests
> on ext4 large folios.

Makes sense.  I'll assume this can correspond to a reasonable workload.
It certainly seems like something that could exist.

> > Wouldn't you get just as much benefit from this?
> 
> Yes.  I tested this approach, and it gives almost the same result as my
> original partial-commit helper.

Excellent!  Obviously it'd be even better if we didn't have to walk the
leading buffer_heads ... but there's no way to do this with the data
structure we have.

> Agreed.  The original ext4_block_write_begin() change was too aggressive.
> Seeking directly to @from also skips the prefix buffers, which makes the
> old side effects harder to prove.
> 
> For v2 I plan to drop that part and keep the existing walk from the head.
> The ext4 change would only stop after @to when the folio was already
> uptodate on entry, similar to your block_commit_write() suggestion:
> 
> +       bool folio_uptodate = folio_test_uptodate(folio);
> +
>         for (bh = head, block_start = 0;
> -            bh != head || !block_start;
> +            (bh != head || !block_start) &&
> +            (!folio_uptodate || block_start < to);
>              block++, block_start = block_end, bh = bh->b_this_page) {
>                 ...
>         }

Yes, I think that's a good approach.

> So the prefix path and all in-range handling stay unchanged.  The only
> skipped work is the tail part after @to, and only for a folio that was
> already uptodate before write_begin() started.
> 
> > ... converting ext4 to use iomap instead of buffer heads.
> 
> I strongly agree that iomap is the right direction for ext4.  The iomap
> buffered write path would make this particular buffer-head walk cost go
> away.
> 
> The reason I am still looking at this path is that the regression is
> visible in our LTS upgrade testing from 6.12 to 6.18.  It was introduced
> by the ext4 large-folio enablement in v6.16.  For example, in our
> libMicro release benchmark with THP always enabled, usecs/call, lower is
> better:
> 
> case        v6.12        v6.18        regression
> write_u1k   0.609        4.659        +665.0%
> write_u10k  1.408        4.869        +245.8%

Ouch ;-)  No wonder you want to address this.  Do you recover all the
regression with this fix?

> The iomap conversion is the long-term fix, but it does not help kernels
> which still use the buffer-head buffered write path.  I would like to keep
> this as a small regression fix for that path, and make it minimal enough
> to be suitable for stable/LTS backport.

Is it that you're using some ext4 features that aren't supported by
iomap yet?  Could you say which ones?  That might motivate someone to
prioritise that support.

> Would this v2 direction look OK to you?

Absolutely.  Very happy with this approach.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
  2026-06-05 14:24     ` Matthew Wilcox
@ 2026-06-08 11:56       ` Jia Zhu
  0 siblings, 0 replies; 11+ messages in thread
From: Jia Zhu @ 2026-06-08 11:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jia Zhu, Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel

On Fri, Jun 05, 2026 at 03:24:20PM +0100, Matthew Wilcox wrote:
> > The reason I am still looking at this path is that the regression is
> > visible in our LTS upgrade testing from 6.12 to 6.18.  It was introduced
> > by the ext4 large-folio enablement in v6.16.  For example, in our
> > libMicro release benchmark with THP always enabled, usecs/call, lower is
> > better:
> > 
> > case        v6.12        v6.18        regression
> > write_u1k   0.609        4.659        +665.0%
> > write_u10k  1.408        4.869        +245.8%
> 
> Ouch ;-)  No wonder you want to address this.  Do you recover all the
> regression with this fix?

With the full v2 series applied to v6.18, the small overwrite cases look
like this.  Results are usecs/call, lower is better:

case           v6.12    v6.18   v6.18 + series
write_u1k      0.609    4.659       0.528
write_u10k     1.408    4.869       0.809
pwrite_u1k     0.609    4.659       0.538
pwrite_u10k    1.399    4.889       0.819
writev_u1k     2.238    5.277       1.179
writev_u10k   11.057    8.029       4.219

This matches the regression I was trying to address.

> > The iomap conversion is the long-term fix, but it does not help kernels
> > which still use the buffer-head buffered write path.  I would like to keep
> > this as a small regression fix for that path, and make it minimal enough
> > to be suitable for stable/LTS backport.
> 
> Is it that you're using some ext4 features that aren't supported by
> iomap yet?  Could you say which ones?  That might motivate someone to
> prioritise that support.

No, this benchmark is not using a specific ext4 feature that prevents
iomap.  It is just the default ext4 buffered write path on a regular
file.

I agree that iomap looks like the better long-term direction for ext4
buffered writes.  This small fix is mainly motivated by current/LTS
kernels that still have the buffer-head path (from v6.16 through current
mainline, until ext4 buffered writes are converted to iomap upstream),
where the large-folio enablement made this tail-walk cost visible.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 0/2] ext4: avoid tail walks for cached large-folio writes
  2026-06-03 13:48 [PATCH] ext4: avoid full buffer walks for large folio partial writes Jia Zhu
  2026-06-03 18:11 ` Matthew Wilcox
@ 2026-06-08 12:01 ` Jia Zhu
  2026-06-08 12:01   ` [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios Jia Zhu
  2026-06-08 12:01   ` [PATCH v2 2/2] ext4: avoid tail write_begin " Jia Zhu
  1 sibling, 2 replies; 11+ messages in thread
From: Jia Zhu @ 2026-06-08 12:01 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu

Hi,

This series addresses a buffered-write regression we found during our
v6.12 -> v6.18 LTS upgrade testing on ext4.

The regression is in the remaining buffer_head path.  A small overwrite
of an already cached, uptodate large folio still walks every buffer_head
attached to the folio in both write_begin and write_end.  With order-0
folios this was bounded by the page size.  After ext4 enabled large
folios for regular files, the same loops became proportional to the
folio size.

I agree that converting ext4 buffered I/O to iomap is the right long-term
direction, and that would avoid this problem.  This series is meant as a
small fix for current and LTS kernels that still use the buffer_head path.

Patch 1 follows Willy's suggestion for block_commit_write(): if the folio
was already uptodate on entry, stop the commit walk once the copied range
has been processed.

Patch 2 applies the same conservative shape to ext4_block_write_begin().
It keeps walking from the first buffer, so prefix buffer state handling is
unchanged, and only skips the suffix for folios that were already
uptodate on entry.

The workload is from libMicro, which we use in kernel release testing:

  https://github.com/rzezeski/libMicro

The table below includes the v6.12 baseline from the same release
benchmark.  The v6.12 and v6.18 columns were run with THP=always.  The
last column is v6.18 with this series applied.  Results are usecs/call,
lower is better, and the improvement is relative to unpatched v6.18.

case           v6.12    v6.18   v6.18 + series   improvement
write_u1k      0.609    4.659       0.528           88.7%
write_u10k     1.408    4.869       0.809           83.4%
pwrite_u1k     0.609    4.659       0.538           88.5%
pwrite_u10k    1.399    4.889       0.819           83.2%
writev_u1k     2.238    5.277       1.179           77.7%
writev_u10k   11.057    8.029       4.219           47.5%

For the cases that regressed from v6.12 to v6.18 in this test, this
series brings the v6.18 numbers back below the v6.12 cost.

Link: https://lore.kernel.org/all/20260603134800.25155-1-zhujia.zj@bytedance.com/

Changes since v1:
- replace the ext4 seek-to-@from optimization with a conservative tail
  break that preserves prefix buffer handling;
- add the block_commit_write() tail break suggested by Willy;
- add v6.12 and v6.18 benchmark results for the full series.

Jia Zhu (2):
  fs/buffer: avoid tail commit walk for uptodate folios
  ext4: avoid tail write_begin walk for uptodate folios

 fs/buffer.c     |  3 +++
 fs/ext4/inode.c | 12 +++++++-----
 2 files changed, 10 insertions(+), 5 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios
  2026-06-08 12:01 ` [PATCH v2 0/2] ext4: avoid tail walks for cached large-folio writes Jia Zhu
@ 2026-06-08 12:01   ` Jia Zhu
  2026-06-08 13:06     ` Jan Kara
  2026-06-08 12:01   ` [PATCH v2 2/2] ext4: avoid tail write_begin " Jia Zhu
  1 sibling, 1 reply; 11+ messages in thread
From: Jia Zhu @ 2026-06-08 12:01 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu

block_commit_write() always walks every buffer_head attached to the
folio.  That was cheap for order-0 folios, but large folios can contain
hundreds of buffer_heads.  For a small buffered overwrite of an
already-uptodate large folio, the commit work is therefore proportional
to the folio size rather than the copied range.

This became visible with ext4 regular-file large folios, where cached
small overwrites reach block_commit_write() through block_write_end().
Before ext4 enabled large folios for regular files, this path was only
hit with order-0 folios for normal ext4 buffered writes, so the full walk
was bounded.  The ext4 large-folio commit is therefore the regression
point for this generic helper cost.

The full walk is still needed when the folio is not uptodate, because
block_commit_write() uses per-buffer uptodate state to decide whether
the whole folio can be marked uptodate.  Keep those folios on the old
full-buffer path.

For a folio that was already uptodate on entry, the commit no longer
needs tail buffers for folio-uptodate discovery.  The copied range has
already been processed once block_start reaches @to, so stop there and
avoid the suffix walk.

Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496e..c8c41c799030d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 {
 	size_t block_start, block_end;
 	bool partial = false;
+	bool uptodate = folio_test_uptodate(folio);
 	unsigned blocksize;
 	struct buffer_head *bh, *head;

@@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 			clear_buffer_new(bh);

 		block_start = block_end;
+		if (uptodate && block_start >= to)
+			break;
 		bh = bh->b_this_page;
 	} while (bh != head);

-- 
2.20.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] ext4: avoid tail write_begin walk for uptodate folios
  2026-06-08 12:01 ` [PATCH v2 0/2] ext4: avoid tail walks for cached large-folio writes Jia Zhu
  2026-06-08 12:01   ` [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios Jia Zhu
@ 2026-06-08 12:01   ` Jia Zhu
  2026-06-08 14:29     ` Jan Kara
  1 sibling, 1 reply; 11+ messages in thread
From: Jia Zhu @ 2026-06-08 12:01 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu

Ext4 buffered writes into large folios also pay a full buffer_head
walk in ext4_block_write_begin().  For a small overwrite of an existing
cached folio, the folio is already uptodate and the write only needs to
prepare the buffers through the written range.  Walking the suffix still
makes the write_begin cost proportional to the folio size.

Before ext4 enabled large folios for regular files, the same loop was
bounded by a single page of buffers.  That commit made the existing
full-folio walk visible as a regression for cached small overwrites.

The suffix walk is needed for non-uptodate folios, where ext4 may have
to submit reads for partial blocks, preserve new-buffer cleanup, and run
error zeroing.  Keep those folios on the old full walk.

For already-uptodate folios, keep the walk starting at the first buffer
rather than seeking directly to from.  This preserves the existing prefix
buffer state handling.  Stop once block_start reaches the end of the
write range, because the skipped suffix would only repeat the
outside-range uptodate handling for buffers beyond @to.

On current master, the libMicro ext4 large-folio overwrite test shows
the following full-series result.  Results are median usecs/call over 10
runs, lower is better:

case        nofix     this series   improvement
write_u1k   1.418     0.3405        76.0%
write_u10k  1.887     0.4175        77.9%
pwrite_u1k  1.6775    0.3390        79.8%
pwrite_u10k 1.9035    0.4130        78.3%

Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/ext4/inode.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d1..d63785fcd2acb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1182,6 +1182,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	int nr_wait = 0;
 	int i;
 	bool should_journal_data = ext4_should_journal_data(inode);
+	bool folio_uptodate = folio_test_uptodate(folio);

 	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(to > folio_size(folio));
@@ -1193,13 +1194,14 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 		head = create_empty_buffers(folio, blocksize, 0);
 	block = EXT4_PG_TO_LBLK(inode, folio->index);

-	for (bh = head, block_start = 0; bh != head || !block_start;
+	for (bh = head, block_start = 0;
+	     (bh != head || !block_start) &&
+	     (!folio_uptodate || block_start < to);
 	    block++, block_start = block_end, bh = bh->b_this_page) {
 		block_end = block_start + blocksize;
 		if (block_end <= from || block_start >= to) {
-			if (folio_test_uptodate(folio)) {
+			if (folio_uptodate)
 				set_buffer_uptodate(bh);
-			}
 			continue;
 		}
 		if (WARN_ON_ONCE(buffer_new(bh)))
@@ -1220,7 +1222,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 				if (should_journal_data)
 					do_journal_get_write_access(handle,
 								    inode, bh);
-				if (folio_test_uptodate(folio)) {
+				if (folio_uptodate) {
 					/*
 					 * Unlike __block_write_begin() we leave
 					 * dirtying of new uptodate buffers to
@@ -1237,7 +1239,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 				continue;
 			}
 		}
-		if (folio_test_uptodate(folio)) {
+		if (folio_uptodate) {
 			set_buffer_uptodate(bh);
 			continue;
 		}
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios
  2026-06-08 12:01   ` [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios Jia Zhu
@ 2026-06-08 13:06     ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2026-06-08 13:06 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel

On Mon 08-06-26 20:01:30, Jia Zhu wrote:
> block_commit_write() always walks every buffer_head attached to the
> folio.  That was cheap for order-0 folios, but large folios can contain
> hundreds of buffer_heads.  For a small buffered overwrite of an
> already-uptodate large folio, the commit work is therefore proportional
> to the folio size rather than the copied range.
> 
> This became visible with ext4 regular-file large folios, where cached
> small overwrites reach block_commit_write() through block_write_end().
> Before ext4 enabled large folios for regular files, this path was only
> hit with order-0 folios for normal ext4 buffered writes, so the full walk
> was bounded.  The ext4 large-folio commit is therefore the regression
> point for this generic helper cost.
> 
> The full walk is still needed when the folio is not uptodate, because
> block_commit_write() uses per-buffer uptodate state to decide whether
> the whole folio can be marked uptodate.  Keep those folios on the old
> full-buffer path.
> 
> For a folio that was already uptodate on entry, the commit no longer
> needs tail buffers for folio-uptodate discovery.  The copied range has
> already been processed once block_start reaches @to, so stop there and
> avoid the suffix walk.
> 
> Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b0b3792b1496e..c8c41c799030d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
>  {
>  	size_t block_start, block_end;
>  	bool partial = false;
> +	bool uptodate = folio_test_uptodate(folio);
>  	unsigned blocksize;
>  	struct buffer_head *bh, *head;
>  
> @@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
>  			clear_buffer_new(bh);
>  
>  		block_start = block_end;
> +		if (uptodate && block_start >= to)
> +			break;
>  		bh = bh->b_this_page;
>  	} while (bh != head);
>  
> -- 
> 2.20.1
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] ext4: avoid tail write_begin walk for uptodate folios
  2026-06-08 12:01   ` [PATCH v2 2/2] ext4: avoid tail write_begin " Jia Zhu
@ 2026-06-08 14:29     ` Jan Kara
  2026-06-09  3:54       ` Jia Zhu
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2026-06-08 14:29 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel

On Mon 08-06-26 20:01:31, Jia Zhu wrote:
> Ext4 buffered writes into large folios also pay a full buffer_head
> walk in ext4_block_write_begin().  For a small overwrite of an existing
> cached folio, the folio is already uptodate and the write only needs to
> prepare the buffers through the written range.  Walking the suffix still
> makes the write_begin cost proportional to the folio size.
> 
> Before ext4 enabled large folios for regular files, the same loop was
> bounded by a single page of buffers.  That commit made the existing
> full-folio walk visible as a regression for cached small overwrites.
> 
> The suffix walk is needed for non-uptodate folios, where ext4 may have
> to submit reads for partial blocks, preserve new-buffer cleanup, and run
> error zeroing.  Keep those folios on the old full walk.
> 
> For already-uptodate folios, keep the walk starting at the first buffer
> rather than seeking directly to from.  This preserves the existing prefix
> buffer state handling.  Stop once block_start reaches the end of the
> write range, because the skipped suffix would only repeat the
> outside-range uptodate handling for buffers beyond @to.
> 
> On current master, the libMicro ext4 large-folio overwrite test shows
> the following full-series result.  Results are median usecs/call over 10
> runs, lower is better:
> 
> case        nofix     this series   improvement
> write_u1k   1.418     0.3405        76.0%
> write_u10k  1.887     0.4175        77.9%
> pwrite_u1k  1.6775    0.3390        79.8%
> pwrite_u10k 1.9035    0.4130        78.3%
> 
> Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
> Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>

Looks good, just one simplification suggestion:

> @@ -1193,13 +1194,14 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  		head = create_empty_buffers(folio, blocksize, 0);
>  	block = EXT4_PG_TO_LBLK(inode, folio->index);
>  
> -	for (bh = head, block_start = 0; bh != head || !block_start;
> +	for (bh = head, block_start = 0;
> +	     (bh != head || !block_start) &&
> +	     (!folio_uptodate || block_start < to);

You simplify this condition to:

  block_start < to || (!folio_uptodate && bh != head)

With this updated feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


>  	    block++, block_start = block_end, bh = bh->b_this_page) {
>  		block_end = block_start + blocksize;
>  		if (block_end <= from || block_start >= to) {
> -			if (folio_test_uptodate(folio)) {
> +			if (folio_uptodate)
>  				set_buffer_uptodate(bh);
> -			}
>  			continue;
>  		}
>  		if (WARN_ON_ONCE(buffer_new(bh)))
> @@ -1220,7 +1222,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  				if (should_journal_data)
>  					do_journal_get_write_access(handle,
>  								    inode, bh);
> -				if (folio_test_uptodate(folio)) {
> +				if (folio_uptodate) {
>  					/*
>  					 * Unlike __block_write_begin() we leave
>  					 * dirtying of new uptodate buffers to
> @@ -1237,7 +1239,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  				continue;
>  			}
>  		}
> -		if (folio_test_uptodate(folio)) {
> +		if (folio_uptodate) {
>  			set_buffer_uptodate(bh);
>  			continue;
>  		}
> -- 
> 2.20.1
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] ext4: avoid tail write_begin walk for uptodate folios
  2026-06-08 14:29     ` Jan Kara
@ 2026-06-09  3:54       ` Jia Zhu
  0 siblings, 0 replies; 11+ messages in thread
From: Jia Zhu @ 2026-06-09  3:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jia Zhu, Theodore Ts'o, Andreas Dilger, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel

On Mon, Jun 08, 2026 at 04:29:59PM +0200, Jan Kara wrote:
> You simplify this condition to:
> 
>   block_start < to || (!folio_uptodate && bh != head)
> 
> With this updated feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks Jan.  I've applied this simplification in v3 and added your
Reviewed-by tag to both patches.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-06-09  3:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-03 13:48 [PATCH] ext4: avoid full buffer walks for large folio partial writes Jia Zhu
2026-06-03 18:11 ` Matthew Wilcox
2026-06-05  9:02   ` Jia Zhu
2026-06-05 14:24     ` Matthew Wilcox
2026-06-08 11:56       ` Jia Zhu
2026-06-08 12:01 ` [PATCH v2 0/2] ext4: avoid tail walks for cached large-folio writes Jia Zhu
2026-06-08 12:01   ` [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios Jia Zhu
2026-06-08 13:06     ` Jan Kara
2026-06-08 12:01   ` [PATCH v2 2/2] ext4: avoid tail write_begin " Jia Zhu
2026-06-08 14:29     ` Jan Kara
2026-06-09  3:54       ` Jia Zhu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.