From: Matt Fleming <matt@readmodwrite.com>
To: Jan Kara <jack@suse.cz>
Cc: adilger.kernel@dilger.ca, kernel-team@cloudflare.com,
libaokun1@huawei.com, linux-ext4@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
matt@readmodwrite.com, tytso@mit.edu, willy@infradead.org
Subject: Re: ext4 writeback performance issue in 6.12
Date: Thu, 9 Oct 2025 11:17:48 +0100 [thread overview]
Message-ID: <20251009101748.529277-1-matt@readmodwrite.com> (raw)
In-Reply-To: <2nuegl4wtmu3lkprcomfeluii77ofrmkn4ukvbx2gesnqlsflk@yx466sbd7bni>
On Wed, Oct 08, 2025 at 06:35:29PM +0200, Jan Kara wrote:
> Hi Matt!
>
> Nice talking to you again :)
Same. It's been too long :)
> On Wed 08-10-25 16:07:05, Matt Fleming wrote:
>
[...]
> So this particular hang check warning will be silenced by [1]. That being
> said if the writeback is indeed taking longer than expected (depends on
> cgroup configuration etc.) these patches will obviously not fix it. Based
> on what you write below, are you saying that most of the time from these
> 225s is spent in the filesystem allocating blocks? I'd expect we'd spend
> most of the time waiting for IO to complete...
Yeah, you're right. Most of the time is spenting waiting for writeback
to complete.
> So I'm somewhat confused here. How big is the allocation request? Above you
> write that average size of order 9 bucket is < 1280 which is true and
> makes me assume the allocation is for 1 stripe which is 1280 blocks. But
> here you write about order 9 allocation.
Sorry, I muddled my words. The allocation request is for 1280 blocks.
> Anyway, stripe aligned allocations don't always play well with
> mb_optimize_scan logic, so you can try mounting the filesystem with
> mb_optimize_scan=0 mount option.
Thanks, but unfortunately running with mb_optimize_scan=0 gives us much
worse performance. It looks like it's taking a long time to write out
even 1 page to disk. The flusher thread has been running for 20+hours
now non-stop and it's blocking tasks waiting on writeback.
[Thu Oct 9 09:49:59 2025] INFO: task dockerd:45649 blocked for more than 70565 seconds.
mfleming@node:~$ ps -p 50674 -o pid,etime,cputime,comm
PID ELAPSED TIME COMMAND
50674 20:18:25 20:14:15 kworker/u400:20+flush-9:127
A perf profile shows:
# Overhead Command Shared Object Symbol
# ........ ............... ................. ...................................
#
32.09% kworker/u400:20 [kernel.kallsyms] [k] ext4_get_group_info
|
|--11.91%--ext4_mb_prefetch
| ext4_mb_regular_allocator
| ext4_mb_new_blocks
| ext4_ext_map_blocks
| ext4_map_blocks
| ext4_do_writepages
| ext4_writepages
| do_writepages
| __writeback_single_inode
| writeback_sb_inodes
| __writeback_inodes_wb
| wb_writeback
| wb_workfn
| process_one_work
| worker_thread
| kthread
| ret_from_fork
| ret_from_fork_asm
|
|--7.23%--ext4_mb_regular_allocator
| ext4_mb_new_blocks
| ext4_ext_map_blocks
| ext4_map_blocks
| ext4_do_writepages
| ext4_writepages
| do_writepages
| __writeback_single_inode
| writeback_sb_inodes
| __writeback_inodes_wb
| wb_writeback
| wb_workfn
| process_one_work
| worker_thread
| kthread
| ret_from_fork
| ret_from_fork_asm
mfleming@node:~$ sudo perf ftrace latency -b -p 50674 -T ext4_mb_regular_allocator -- sleep 10
# DURATION | COUNT | GRAPH |
0 - 1 us | 0 | |
1 - 2 us | 0 | |
2 - 4 us | 0 | |
4 - 8 us | 0 | |
8 - 16 us | 0 | |
16 - 32 us | 0 | |
32 - 64 us | 0 | |
64 - 128 us | 0 | |
128 - 256 us | 0 | |
256 - 512 us | 0 | |
512 - 1024 us | 0 | |
1 - 2 ms | 0 | |
2 - 4 ms | 0 | |
4 - 8 ms | 0 | |
8 - 16 ms | 0 | |
16 - 32 ms | 0 | |
32 - 64 ms | 0 | |
64 - 128 ms | 85 | ############################################# |
128 - 256 ms | 1 | |
256 - 512 ms | 0 | |
512 - 1024 ms | 0 | |
1 - ... s | 0 | |
mfleming@node:~$ sudo perf ftrace latency -b -p 50674 -T ext4_mb_prefetch -- sleep 10
# DURATION | COUNT | GRAPH |
0 - 1 us | 130 | |
1 - 2 us | 1962306 | #################################### |
2 - 4 us | 497793 | ######### |
4 - 8 us | 4598 | |
8 - 16 us | 277 | |
16 - 32 us | 21 | |
32 - 64 us | 10 | |
64 - 128 us | 1 | |
128 - 256 us | 0 | |
256 - 512 us | 0 | |
512 - 1024 us | 0 | |
1 - 2 ms | 0 | |
2 - 4 ms | 0 | |
4 - 8 ms | 0 | |
8 - 16 ms | 0 | |
16 - 32 ms | 0 | |
32 - 64 ms | 0 | |
64 - 128 ms | 0 | |
128 - 256 ms | 0 | |
256 - 512 ms | 0 | |
512 - 1024 ms | 0 | |
1 - ... s | 0 | |
mfleming@node:~$ sudo bpftrace -e 'fentry:vmlinux:writeback_sb_inodes / tid==50674/ { @in = args.work->nr_pages; @start=nsecs;} fexit:vmlinux:writeback_sb_inodes /tid == 50674/ { $delta = (nsecs - @start) / 1000000; printf("IN: work->nr_pages=%d, OUT: work->nr_pages=%d, wrote=%d page(s) in %dms\n", @in, args.work->nr_pages, @in - args.work->nr_pages, $delta);} END{clear(@in);} interval:s:5 { exit();}'
Attaching 4 probes...
IN: work->nr_pages=6095831, OUT: work->nr_pages=6095830, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095830, OUT: work->nr_pages=6095829, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095829, OUT: work->nr_pages=6095828, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095828, OUT: work->nr_pages=6095827, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095827, OUT: work->nr_pages=6095826, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095826, OUT: work->nr_pages=6095825, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095825, OUT: work->nr_pages=6095824, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095824, OUT: work->nr_pages=6095823, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095823, OUT: work->nr_pages=6095822, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095822, OUT: work->nr_pages=6095821, wrote=1 page(s) in 106ms
Thanks,
Matt
next prev parent reply other threads:[~2025-10-09 10:17 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-06 11:56 ext4 writeback performance issue in 6.12 Matt Fleming
2025-10-08 15:07 ` Matt Fleming
2025-10-08 16:26 ` Theodore Ts'o
2025-10-09 10:22 ` Matt Fleming
2025-10-09 17:52 ` Matt Fleming
2025-10-10 2:04 ` Theodore Ts'o
2025-10-10 12:42 ` Matt Fleming
2025-10-08 16:35 ` Jan Kara
2025-10-09 10:17 ` Matt Fleming [this message]
2025-10-09 12:29 ` Jan Kara
2025-10-09 17:21 ` Matt Fleming
2025-10-10 17:23 ` Jan Kara
2025-10-14 10:13 ` Matt Fleming
2025-10-09 12:36 ` Ojaswin Mujoo
2025-10-09 17:50 ` Matt Fleming
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251009101748.529277-1-matt@readmodwrite.com \
--to=matt@readmodwrite.com \
--cc=adilger.kernel@dilger.ca \
--cc=jack@suse.cz \
--cc=kernel-team@cloudflare.com \
--cc=libaokun1@huawei.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.