From: Matt Fleming <matt@readmodwrite.com>
To: Theodore Ts'o <tytso@mit.edu>, Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
kernel-team@cloudflare.com, linux-fsdevel@vger.kernel.org,
Matthew Wilcox <willy@infradead.org>
Subject: ext4 writeback performance issue in 6.12
Date: Mon, 6 Oct 2025 12:56:15 +0100 [thread overview]
Message-ID: <20251006115615.2289526-1-matt@readmodwrite.com> (raw)
Hi,
We're seeing writeback take a long time and triggering blocked task
warnings on some of our database nodes, e.g.
INFO: task kworker/34:2:243325 blocked for more than 225 seconds.
Tainted: G O 6.12.41-cloudflare-2025.8.2 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/34:2 state:D stack:0 pid:243325 tgid:243325 ppid:2 task_flags:0x4208060 flags:0x00004000
Workqueue: cgroup_destroy css_free_rwork_fn
Call Trace:
<TASK>
__schedule+0x4fb/0xbf0
schedule+0x27/0xf0
wb_wait_for_completion+0x5d/0x90
? __pfx_autoremove_wake_function+0x10/0x10
mem_cgroup_css_free+0x19/0xb0
css_free_rwork_fn+0x4e/0x430
process_one_work+0x17e/0x330
worker_thread+0x2ce/0x3f0
? __pfx_worker_thread+0x10/0x10
kthread+0xd2/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
A large chunk of system time (4.43%) is being spent in the following
code path:
ext4_get_group_info+9
ext4_mb_good_group+41
ext4_mb_find_good_group_avg_frag_lists+136
ext4_mb_regular_allocator+2748
ext4_mb_new_blocks+2373
ext4_ext_map_blocks+2149
ext4_map_blocks+294
ext4_do_writepages+2031
ext4_writepages+173
do_writepages+229
__writeback_single_inode+65
writeback_sb_inodes+544
__writeback_inodes_wb+76
wb_writeback+413
wb_workfn+196
process_one_work+382
worker_thread+718
kthread+210
ret_from_fork+52
ret_from_fork_asm+26
That's the path through the CR_GOAL_LEN_FAST allocator.
The primary reason for all these cycles looks to be that we're spending
a lot of time in ext4_mb_find_good_group_avg_frag_lists(). The fragment
lists seem quite big and the function fails to find a suitable group
pretty much every time it's called either because the frag list is empty
(orders 10-13) or the average size is < 1280 (order 9). I'm assuming it
falls back to a linear scan at that point.
https://gist.github.com/mfleming/5b16ee4cf598e361faf54f795a98c0a8
$ sudo cat /proc/fs/ext4/md127/mb_structs_summary
optimize_scan: 1
max_free_order_lists:
list_order_0_groups: 0
list_order_1_groups: 1
list_order_2_groups: 6
list_order_3_groups: 42
list_order_4_groups: 513
list_order_5_groups: 62
list_order_6_groups: 434
list_order_7_groups: 2602
list_order_8_groups: 10951
list_order_9_groups: 44883
list_order_10_groups: 152357
list_order_11_groups: 24899
list_order_12_groups: 30461
list_order_13_groups: 18756
avg_fragment_size_lists:
list_order_0_groups: 108
list_order_1_groups: 411
list_order_2_groups: 1640
list_order_3_groups: 5809
list_order_4_groups: 14909
list_order_5_groups: 31345
list_order_6_groups: 54132
list_order_7_groups: 90294
list_order_8_groups: 77322
list_order_9_groups: 10096
list_order_10_groups: 0
list_order_11_groups: 0
list_order_12_groups: 0
list_order_13_groups: 0
These machines are striped and are using noatime:
$ grep ext4 /proc/mounts
/dev/md127 /state ext4 rw,noatime,stripe=1280 0 0
Is there some tunable or configuration option that I'm missing that
could help here to avoid wasting time in
ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to
fail an order 9 allocation anyway?
I'm happy to provide any more details that might help.
Thanks,
Matt
next reply other threads:[~2025-10-06 11:56 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-06 11:56 Matt Fleming [this message]
2025-10-08 15:07 ` ext4 writeback performance issue in 6.12 Matt Fleming
2025-10-08 16:26 ` Theodore Ts'o
2025-10-09 10:22 ` Matt Fleming
2025-10-09 17:52 ` Matt Fleming
2025-10-10 2:04 ` Theodore Ts'o
2025-10-10 12:42 ` Matt Fleming
2025-10-08 16:35 ` Jan Kara
2025-10-09 10:17 ` Matt Fleming
2025-10-09 12:29 ` Jan Kara
2025-10-09 17:21 ` Matt Fleming
2025-10-10 17:23 ` Jan Kara
2025-10-14 10:13 ` Matt Fleming
2025-10-09 12:36 ` Ojaswin Mujoo
2025-10-09 17:50 ` Matt Fleming
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251006115615.2289526-1-matt@readmodwrite.com \
--to=matt@readmodwrite.com \
--cc=adilger.kernel@dilger.ca \
--cc=kernel-team@cloudflare.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.