From: Matt Fleming <matt@readmodwrite.com>
To: Matt Fleming <matt@readmodwrite.com>
Cc: adilger.kernel@dilger.ca, kernel-team@cloudflare.com,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, tytso@mit.edu, willy@infradead.org,
Baokun Li <libaokun1@huawei.com>, Jan Kara <jack@suse.cz>
Subject: Re: ext4 writeback performance issue in 6.12
Date: Wed, 8 Oct 2025 16:07:05 +0100 [thread overview]
Message-ID: <20251008150705.4090434-1-matt@readmodwrite.com> (raw)
In-Reply-To: <20251006115615.2289526-1-matt@readmodwrite.com>
(Adding Baokun and Jan in case they have any ideas)
On Mon, Oct 06, 2025 at 12:56:15 +0100, Matt Fleming wrote:
> Hi,
>
> We're seeing writeback take a long time and triggering blocked task
> warnings on some of our database nodes, e.g.
>
> INFO: task kworker/34:2:243325 blocked for more than 225 seconds.
> Tainted: G O 6.12.41-cloudflare-2025.8.2 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/34:2 state:D stack:0 pid:243325 tgid:243325 ppid:2 task_flags:0x4208060 flags:0x00004000
> Workqueue: cgroup_destroy css_free_rwork_fn
> Call Trace:
> <TASK>
> __schedule+0x4fb/0xbf0
> schedule+0x27/0xf0
> wb_wait_for_completion+0x5d/0x90
> ? __pfx_autoremove_wake_function+0x10/0x10
> mem_cgroup_css_free+0x19/0xb0
> css_free_rwork_fn+0x4e/0x430
> process_one_work+0x17e/0x330
> worker_thread+0x2ce/0x3f0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xd2/0x100
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x34/0x50
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
> A large chunk of system time (4.43%) is being spent in the following
> code path:
>
> ext4_get_group_info+9
> ext4_mb_good_group+41
> ext4_mb_find_good_group_avg_frag_lists+136
> ext4_mb_regular_allocator+2748
> ext4_mb_new_blocks+2373
> ext4_ext_map_blocks+2149
> ext4_map_blocks+294
> ext4_do_writepages+2031
> ext4_writepages+173
> do_writepages+229
> __writeback_single_inode+65
> writeback_sb_inodes+544
> __writeback_inodes_wb+76
> wb_writeback+413
> wb_workfn+196
> process_one_work+382
> worker_thread+718
> kthread+210
> ret_from_fork+52
> ret_from_fork_asm+26
>
> That's the path through the CR_GOAL_LEN_FAST allocator.
>
> The primary reason for all these cycles looks to be that we're spending
> a lot of time in ext4_mb_find_good_group_avg_frag_lists(). The fragment
> lists seem quite big and the function fails to find a suitable group
> pretty much every time it's called either because the frag list is empty
> (orders 10-13) or the average size is < 1280 (order 9). I'm assuming it
> falls back to a linear scan at that point.
>
> https://gist.github.com/mfleming/5b16ee4cf598e361faf54f795a98c0a8
>
> $ sudo cat /proc/fs/ext4/md127/mb_structs_summary
> optimize_scan: 1
> max_free_order_lists:
> list_order_0_groups: 0
> list_order_1_groups: 1
> list_order_2_groups: 6
> list_order_3_groups: 42
> list_order_4_groups: 513
> list_order_5_groups: 62
> list_order_6_groups: 434
> list_order_7_groups: 2602
> list_order_8_groups: 10951
> list_order_9_groups: 44883
> list_order_10_groups: 152357
> list_order_11_groups: 24899
> list_order_12_groups: 30461
> list_order_13_groups: 18756
> avg_fragment_size_lists:
> list_order_0_groups: 108
> list_order_1_groups: 411
> list_order_2_groups: 1640
> list_order_3_groups: 5809
> list_order_4_groups: 14909
> list_order_5_groups: 31345
> list_order_6_groups: 54132
> list_order_7_groups: 90294
> list_order_8_groups: 77322
> list_order_9_groups: 10096
> list_order_10_groups: 0
> list_order_11_groups: 0
> list_order_12_groups: 0
> list_order_13_groups: 0
>
> These machines are striped and are using noatime:
>
> $ grep ext4 /proc/mounts
> /dev/md127 /state ext4 rw,noatime,stripe=1280 0 0
>
> Is there some tunable or configuration option that I'm missing that
> could help here to avoid wasting time in
> ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to
> fail an order 9 allocation anyway?
>
> I'm happy to provide any more details that might help.
>
> Thanks,
> Matt
next prev parent reply other threads:[~2025-10-08 15:07 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-06 11:56 ext4 writeback performance issue in 6.12 Matt Fleming
2025-10-08 15:07 ` Matt Fleming [this message]
2025-10-08 16:26 ` Theodore Ts'o
2025-10-09 10:22 ` Matt Fleming
2025-10-09 17:52 ` Matt Fleming
2025-10-10 2:04 ` Theodore Ts'o
2025-10-10 12:42 ` Matt Fleming
2025-10-08 16:35 ` Jan Kara
2025-10-09 10:17 ` Matt Fleming
2025-10-09 12:29 ` Jan Kara
2025-10-09 17:21 ` Matt Fleming
2025-10-10 17:23 ` Jan Kara
2025-10-14 10:13 ` Matt Fleming
2025-10-09 12:36 ` Ojaswin Mujoo
2025-10-09 17:50 ` Matt Fleming
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251008150705.4090434-1-matt@readmodwrite.com \
--to=matt@readmodwrite.com \
--cc=adilger.kernel@dilger.ca \
--cc=jack@suse.cz \
--cc=kernel-team@cloudflare.com \
--cc=libaokun1@huawei.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.