Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes

Linux EXT4 FS development
 help / color / mirror / Atom feed

From: Matthew Wilcox <willy@infradead.org>
To: Jia Zhu <zhujia.zj@bytedance.com>
Cc: Theodore Ts'o <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Baokun Li <libaokun@linux.alibaba.com>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	Ritesh Harjani <ritesh.list@gmail.com>,
	Zhang Yi <yi.zhang@huawei.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
Date: Fri, 5 Jun 2026 15:24:20 +0100	[thread overview]
Message-ID: <aiLcFP2drmHGjEL2@casper.infradead.org> (raw)
In-Reply-To: <20260605090253.32822-1-zhujia.zj@bytedance.com>

On Fri, Jun 05, 2026 at 05:02:53PM +0800, Jia Zhu wrote:
> On Wed, Jun 03, 2026 at 07:11:48PM +0100, Matthew Wilcox wrote:
> > Is this a common case for you, or is this something you noticed by
> > inspection?
> 
> This was found by our kernel release benchmark.  We run libMicro as part
> of that test suite:
> 
>   https://github.com/rzezeski/libMicro
> 
> The regression shows up in buffered write/pwrite/writev overwrite tests
> on ext4 large folios.

Makes sense.  I'll assume this can correspond to a reasonable workload.
It certainly seems like something that could exist.

> > Wouldn't you get just as much benefit from this?
> 
> Yes.  I tested this approach, and it gives almost the same result as my
> original partial-commit helper.

Excellent!  Obviously it'd be even better if we didn't have to walk the
leading buffer_heads ... but there's no way to do this with the data
structure we have.

> Agreed.  The original ext4_block_write_begin() change was too aggressive.
> Seeking directly to @from also skips the prefix buffers, which makes the
> old side effects harder to prove.
> 
> For v2 I plan to drop that part and keep the existing walk from the head.
> The ext4 change would only stop after @to when the folio was already
> uptodate on entry, similar to your block_commit_write() suggestion:
> 
> +       bool folio_uptodate = folio_test_uptodate(folio);
> +
>         for (bh = head, block_start = 0;
> -            bh != head || !block_start;
> +            (bh != head || !block_start) &&
> +            (!folio_uptodate || block_start < to);
>              block++, block_start = block_end, bh = bh->b_this_page) {
>                 ...
>         }

Yes, I think that's a good approach.

> So the prefix path and all in-range handling stay unchanged.  The only
> skipped work is the tail part after @to, and only for a folio that was
> already uptodate before write_begin() started.
> 
> > ... converting ext4 to use iomap instead of buffer heads.
> 
> I strongly agree that iomap is the right direction for ext4.  The iomap
> buffered write path would make this particular buffer-head walk cost go
> away.
> 
> The reason I am still looking at this path is that the regression is
> visible in our LTS upgrade testing from 6.12 to 6.18.  It was introduced
> by the ext4 large-folio enablement in v6.16.  For example, in our
> libMicro release benchmark with THP always enabled, usecs/call, lower is
> better:
> 
> case        v6.12        v6.18        regression
> write_u1k   0.609        4.659        +665.0%
> write_u10k  1.408        4.869        +245.8%

Ouch ;-)  No wonder you want to address this.  Do you recover all the
regression with this fix?

> The iomap conversion is the long-term fix, but it does not help kernels
> which still use the buffer-head buffered write path.  I would like to keep
> this as a small regression fix for that path, and make it minimal enough
> to be suitable for stable/LTS backport.

Is it that you're using some ext4 features that aren't supported by
iomap yet?  Could you say which ones?  That might motivate someone to
prioritise that support.

> Would this v2 direction look OK to you?

Absolutely.  Very happy with this approach.

next prev parent reply	other threads:[~2026-06-05 14:24 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-03 13:48 [PATCH] ext4: avoid full buffer walks for large folio partial writes Jia Zhu
2026-06-03 18:11 ` Matthew Wilcox
2026-06-05  9:02   ` Jia Zhu
2026-06-05 14:24     ` Matthew Wilcox [this message]
2026-06-08 11:56       ` Jia Zhu
2026-06-08 12:01 ` [PATCH v2 0/2] ext4: avoid tail walks for cached large-folio writes Jia Zhu
2026-06-08 12:01   ` [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios Jia Zhu
2026-06-08 13:06     ` Jan Kara
2026-06-08 12:01   ` [PATCH v2 2/2] ext4: avoid tail write_begin " Jia Zhu
2026-06-08 14:29     ` Jan Kara
2026-06-09  3:54       ` Jia Zhu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiLcFP2drmHGjEL2@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=libaokun@linux.alibaba.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yi.zhang@huawei.com \
    --cc=zhujia.zj@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox