* ext4 writeback performance issue in 6.12
@ 2025-10-06 11:56 Matt Fleming
2025-10-08 15:07 ` Matt Fleming
2025-10-09 12:36 ` Ojaswin Mujoo
0 siblings, 2 replies; 15+ messages in thread
From: Matt Fleming @ 2025-10-06 11:56 UTC (permalink / raw)
To: Theodore Ts'o, Andreas Dilger
Cc: linux-ext4, linux-kernel, kernel-team, linux-fsdevel,
Matthew Wilcox
Hi,
We're seeing writeback take a long time and triggering blocked task
warnings on some of our database nodes, e.g.
INFO: task kworker/34:2:243325 blocked for more than 225 seconds.
Tainted: G O 6.12.41-cloudflare-2025.8.2 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/34:2 state:D stack:0 pid:243325 tgid:243325 ppid:2 task_flags:0x4208060 flags:0x00004000
Workqueue: cgroup_destroy css_free_rwork_fn
Call Trace:
<TASK>
__schedule+0x4fb/0xbf0
schedule+0x27/0xf0
wb_wait_for_completion+0x5d/0x90
? __pfx_autoremove_wake_function+0x10/0x10
mem_cgroup_css_free+0x19/0xb0
css_free_rwork_fn+0x4e/0x430
process_one_work+0x17e/0x330
worker_thread+0x2ce/0x3f0
? __pfx_worker_thread+0x10/0x10
kthread+0xd2/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
A large chunk of system time (4.43%) is being spent in the following
code path:
ext4_get_group_info+9
ext4_mb_good_group+41
ext4_mb_find_good_group_avg_frag_lists+136
ext4_mb_regular_allocator+2748
ext4_mb_new_blocks+2373
ext4_ext_map_blocks+2149
ext4_map_blocks+294
ext4_do_writepages+2031
ext4_writepages+173
do_writepages+229
__writeback_single_inode+65
writeback_sb_inodes+544
__writeback_inodes_wb+76
wb_writeback+413
wb_workfn+196
process_one_work+382
worker_thread+718
kthread+210
ret_from_fork+52
ret_from_fork_asm+26
That's the path through the CR_GOAL_LEN_FAST allocator.
The primary reason for all these cycles looks to be that we're spending
a lot of time in ext4_mb_find_good_group_avg_frag_lists(). The fragment
lists seem quite big and the function fails to find a suitable group
pretty much every time it's called either because the frag list is empty
(orders 10-13) or the average size is < 1280 (order 9). I'm assuming it
falls back to a linear scan at that point.
https://gist.github.com/mfleming/5b16ee4cf598e361faf54f795a98c0a8
$ sudo cat /proc/fs/ext4/md127/mb_structs_summary
optimize_scan: 1
max_free_order_lists:
list_order_0_groups: 0
list_order_1_groups: 1
list_order_2_groups: 6
list_order_3_groups: 42
list_order_4_groups: 513
list_order_5_groups: 62
list_order_6_groups: 434
list_order_7_groups: 2602
list_order_8_groups: 10951
list_order_9_groups: 44883
list_order_10_groups: 152357
list_order_11_groups: 24899
list_order_12_groups: 30461
list_order_13_groups: 18756
avg_fragment_size_lists:
list_order_0_groups: 108
list_order_1_groups: 411
list_order_2_groups: 1640
list_order_3_groups: 5809
list_order_4_groups: 14909
list_order_5_groups: 31345
list_order_6_groups: 54132
list_order_7_groups: 90294
list_order_8_groups: 77322
list_order_9_groups: 10096
list_order_10_groups: 0
list_order_11_groups: 0
list_order_12_groups: 0
list_order_13_groups: 0
These machines are striped and are using noatime:
$ grep ext4 /proc/mounts
/dev/md127 /state ext4 rw,noatime,stripe=1280 0 0
Is there some tunable or configuration option that I'm missing that
could help here to avoid wasting time in
ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to
fail an order 9 allocation anyway?
I'm happy to provide any more details that might help.
Thanks,
Matt
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: ext4 writeback performance issue in 6.12 2025-10-06 11:56 ext4 writeback performance issue in 6.12 Matt Fleming @ 2025-10-08 15:07 ` Matt Fleming 2025-10-08 16:26 ` Theodore Ts'o 2025-10-08 16:35 ` Jan Kara 2025-10-09 12:36 ` Ojaswin Mujoo 1 sibling, 2 replies; 15+ messages in thread From: Matt Fleming @ 2025-10-08 15:07 UTC (permalink / raw) To: Matt Fleming Cc: adilger.kernel, kernel-team, linux-ext4, linux-fsdevel, linux-kernel, tytso, willy, Baokun Li, Jan Kara (Adding Baokun and Jan in case they have any ideas) On Mon, Oct 06, 2025 at 12:56:15 +0100, Matt Fleming wrote: > Hi, > > We're seeing writeback take a long time and triggering blocked task > warnings on some of our database nodes, e.g. > > INFO: task kworker/34:2:243325 blocked for more than 225 seconds. > Tainted: G O 6.12.41-cloudflare-2025.8.2 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/34:2 state:D stack:0 pid:243325 tgid:243325 ppid:2 task_flags:0x4208060 flags:0x00004000 > Workqueue: cgroup_destroy css_free_rwork_fn > Call Trace: > <TASK> > __schedule+0x4fb/0xbf0 > schedule+0x27/0xf0 > wb_wait_for_completion+0x5d/0x90 > ? __pfx_autoremove_wake_function+0x10/0x10 > mem_cgroup_css_free+0x19/0xb0 > css_free_rwork_fn+0x4e/0x430 > process_one_work+0x17e/0x330 > worker_thread+0x2ce/0x3f0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0xd2/0x100 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x34/0x50 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > > A large chunk of system time (4.43%) is being spent in the following > code path: > > ext4_get_group_info+9 > ext4_mb_good_group+41 > ext4_mb_find_good_group_avg_frag_lists+136 > ext4_mb_regular_allocator+2748 > ext4_mb_new_blocks+2373 > ext4_ext_map_blocks+2149 > ext4_map_blocks+294 > ext4_do_writepages+2031 > ext4_writepages+173 > do_writepages+229 > __writeback_single_inode+65 > writeback_sb_inodes+544 > __writeback_inodes_wb+76 > wb_writeback+413 > wb_workfn+196 > process_one_work+382 > worker_thread+718 > kthread+210 > ret_from_fork+52 > ret_from_fork_asm+26 > > That's the path through the CR_GOAL_LEN_FAST allocator. > > The primary reason for all these cycles looks to be that we're spending > a lot of time in ext4_mb_find_good_group_avg_frag_lists(). The fragment > lists seem quite big and the function fails to find a suitable group > pretty much every time it's called either because the frag list is empty > (orders 10-13) or the average size is < 1280 (order 9). I'm assuming it > falls back to a linear scan at that point. > > https://gist.github.com/mfleming/5b16ee4cf598e361faf54f795a98c0a8 > > $ sudo cat /proc/fs/ext4/md127/mb_structs_summary > optimize_scan: 1 > max_free_order_lists: > list_order_0_groups: 0 > list_order_1_groups: 1 > list_order_2_groups: 6 > list_order_3_groups: 42 > list_order_4_groups: 513 > list_order_5_groups: 62 > list_order_6_groups: 434 > list_order_7_groups: 2602 > list_order_8_groups: 10951 > list_order_9_groups: 44883 > list_order_10_groups: 152357 > list_order_11_groups: 24899 > list_order_12_groups: 30461 > list_order_13_groups: 18756 > avg_fragment_size_lists: > list_order_0_groups: 108 > list_order_1_groups: 411 > list_order_2_groups: 1640 > list_order_3_groups: 5809 > list_order_4_groups: 14909 > list_order_5_groups: 31345 > list_order_6_groups: 54132 > list_order_7_groups: 90294 > list_order_8_groups: 77322 > list_order_9_groups: 10096 > list_order_10_groups: 0 > list_order_11_groups: 0 > list_order_12_groups: 0 > list_order_13_groups: 0 > > These machines are striped and are using noatime: > > $ grep ext4 /proc/mounts > /dev/md127 /state ext4 rw,noatime,stripe=1280 0 0 > > Is there some tunable or configuration option that I'm missing that > could help here to avoid wasting time in > ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to > fail an order 9 allocation anyway? > > I'm happy to provide any more details that might help. > > Thanks, > Matt ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-08 15:07 ` Matt Fleming @ 2025-10-08 16:26 ` Theodore Ts'o 2025-10-09 10:22 ` Matt Fleming 2025-10-08 16:35 ` Jan Kara 1 sibling, 1 reply; 15+ messages in thread From: Theodore Ts'o @ 2025-10-08 16:26 UTC (permalink / raw) To: Matt Fleming Cc: adilger.kernel, kernel-team, linux-ext4, linux-fsdevel, linux-kernel, willy, Baokun Li, Jan Kara On Wed, Oct 08, 2025 at 04:07:05PM +0100, Matt Fleming wrote: > > > > These machines are striped and are using noatime: > > > > $ grep ext4 /proc/mounts > > /dev/md127 /state ext4 rw,noatime,stripe=1280 0 0 > > > > Is there some tunable or configuration option that I'm missing that > > could help here to avoid wasting time in > > ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to > > fail an order 9 allocation anyway? Can you try disabling stripe parameter? If you are willing to try the latest mainline kernel, there are some changes that *might* make a different, but RAID stripe alignment has been causing problems. In fact, in the latest e2fsprogs release, we have added this change: commit b61f182b2de1ea75cff935037883ba1a8c7db623 Author: Theodore Ts'o <tytso@mit.edu> Date: Sun May 4 14:07:14 2025 -0400 mke2fs: don't set the raid stripe for non-rotational devices by default The ext4 block allocator is not at all efficient when it is asked to enforce RAID alignment. It is especially bad for flash-based devices, or when the file system is highly fragmented. For non-rotational devices, it's fine to set the stride parameter (which controls spreading the allocation bitmaps across the RAID component devices, which always makessense); but for the stripe parameter (which asks the ext4 block alocator to try _very_ hard to find RAID stripe aligned devices) it's probably not a good idea. Add new mke2fs.conf parameters with the defaults: [defaults] set_raid_stride = always set_raid_stripe = disk Even for RAID arrays based on HDD's, we can still have problems for highly fragmented file systems. This will need to solved in the kernel, probably by having some kind of wall clock or CPU time limitation for each block allocation or adding some kind of optimization which is faster than using our current buddy bitmap implementation, especially if the stripe size is not multiple of a power of two. But for SSD's, it's much less likely to make sense even if we have an optimized block allocator, because if you've paid $$$ for a flash-based RAID array, the cost/benefit tradeoffs of doing less optimized stripe RMW cycles versus the block allocator time and CPU overhead is harder to justify without a lot of optimization effort. If and when we can improve the ext4 kernel implementation (and it gets rolled out to users using LTS kernels), we can change the defaults. And of course, system administrators can always change /etc/mke2fs.conf settings. Signed-off-by: Theodore Ts'o <tytso@mit.edu> - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-08 16:26 ` Theodore Ts'o @ 2025-10-09 10:22 ` Matt Fleming 2025-10-09 17:52 ` Matt Fleming 0 siblings, 1 reply; 15+ messages in thread From: Matt Fleming @ 2025-10-09 10:22 UTC (permalink / raw) To: Theodore Ts'o Cc: adilger.kernel, jack, kernel-team, libaokun1, linux-ext4, linux-fsdevel, linux-kernel, willy On Wed, Oct 08, 2025 at 12:26:55PM -0400, Theodore Ts'o wrote: > On Wed, Oct 08, 2025 at 04:07:05PM +0100, Matt Fleming wrote: > > > > > > These machines are striped and are using noatime: > > > > > > $ grep ext4 /proc/mounts > > > /dev/md127 /state ext4 rw,noatime,stripe=1280 0 0 > > > > > > Is there some tunable or configuration option that I'm missing that > > > could help here to avoid wasting time in > > > ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to > > > fail an order 9 allocation anyway? > > Can you try disabling stripe parameter? If you are willing to try the > latest mainline kernel, there are some changes that *might* make a > different, but RAID stripe alignment has been causing problems. Thanks Ted. I'm going to try disabling the stripe parameter now. I'll report back shortly. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-09 10:22 ` Matt Fleming @ 2025-10-09 17:52 ` Matt Fleming 2025-10-10 2:04 ` Theodore Ts'o 0 siblings, 1 reply; 15+ messages in thread From: Matt Fleming @ 2025-10-09 17:52 UTC (permalink / raw) To: Theodore Ts'o Cc: adilger.kernel, jack, kernel-team, libaokun1, linux-ext4, linux-fsdevel, linux-kernel, willy On Thu, Oct 09, 2025 at 11:22:59AM +0100, Matt Fleming wrote: > > Thanks Ted. I'm going to try disabling the stripe parameter now. I'll report > back shortly. Initial results look very good. No blocked tasks so far and the mb allocator latency is much improved. mfleming@node:~$ sudo perf ftrace latency -b -a -T ext4_mb_regular_allocator -- sleep 10 # DURATION | COUNT | GRAPH | 0 - 1 us | 0 | | 1 - 2 us | 0 | | 2 - 4 us | 41 | | 4 - 8 us | 499 | ########### | 8 - 16 us | 246 | ##### | 16 - 32 us | 126 | ## | 32 - 64 us | 103 | ## | 64 - 128 us | 74 | # | 128 - 256 us | 109 | ## | 256 - 512 us | 293 | ###### | 512 - 1024 us | 448 | ########## | 1 - 2 ms | 36 | | 2 - 4 ms | 11 | | 4 - 8 ms | 1 | | 8 - 16 ms | 0 | | 16 - 32 ms | 0 | | 32 - 64 ms | 0 | | 64 - 128 ms | 0 | | 128 - 256 ms | 0 | | 256 - 512 ms | 0 | | 512 - 1024 ms | 0 | | 1 - ... s | 0 | | Thanks, Matt ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-09 17:52 ` Matt Fleming @ 2025-10-10 2:04 ` Theodore Ts'o 2025-10-10 12:42 ` Matt Fleming 0 siblings, 1 reply; 15+ messages in thread From: Theodore Ts'o @ 2025-10-10 2:04 UTC (permalink / raw) To: Matt Fleming Cc: adilger.kernel, jack, kernel-team, libaokun1, linux-ext4, linux-fsdevel, linux-kernel, willy On Thu, Oct 09, 2025 at 06:52:54PM +0100, Matt Fleming wrote: > On Thu, Oct 09, 2025 at 11:22:59AM +0100, Matt Fleming wrote: > > > > Thanks Ted. I'm going to try disabling the stripe parameter now. I'll report > > back shortly. > > Initial results look very good. No blocked tasks so far and the mb > allocator latency is much improved. OK, so that definitely confirms the theory of what's going on. There have been some changes in the latest kernel that *might* address what you're seeing. The challenge is that we don't have a easy reproducer that doesn't involve using a large file system running a production workload. If you can only run this on a production server, it's probably not fair to ask you to try running 6.17.1 and see if it shows up there. I do think in the long term, we need to augment thy buddy bitmap in fs/ext4/mballoc.c with some data structure which tracks free space in units of stripe blocks, so we can do block allocation in a much more efficient way for RAID systems. The simplest way would be to add a counter of the number of aligned free stripes in the group info structure, plus a bit array which indicates which aligned stripes are free. This is not just to improve stripe allocation, but also when doing sub-stripe allocation, we preferentially try allocating out of stripes which are already partially in use. Out of curiosity, are you using the stride parameter because you're using a SSD-based RAID array, or a HDD-based RAID array? - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-10 2:04 ` Theodore Ts'o @ 2025-10-10 12:42 ` Matt Fleming 0 siblings, 0 replies; 15+ messages in thread From: Matt Fleming @ 2025-10-10 12:42 UTC (permalink / raw) To: Theodore Ts'o Cc: adilger.kernel, jack, kernel-team, libaokun1, linux-ext4, linux-fsdevel, linux-kernel, willy On Thu, Oct 09, 2025 at 10:04:10PM -0400, Theodore Ts'o wrote: > > OK, so that definitely confirms the theory of what's going on. There > have been some changes in the latest kernel that *might* address what > you're seeing. The challenge is that we don't have a easy reproducer > that doesn't involve using a large file system running a production > workload. If you can only run this on a production server, it's > probably not fair to ask you to try running 6.17.1 and see if it shows > up there. FWIW we will likely pick up the next LTS so I can get you an answer but it might take a few months :) > I do think in the long term, we need to augment thy buddy bitmap in > fs/ext4/mballoc.c with some data structure which tracks free space in > units of stripe blocks, so we can do block allocation in a much more > efficient way for RAID systems. The simplest way would be to add a > counter of the number of aligned free stripes in the group info > structure, plus a bit array which indicates which aligned stripes are > free. This is not just to improve stripe allocation, but also when > doing sub-stripe allocation, we preferentially try allocating out of > stripes which are already partially in use. > > Out of curiosity, are you using the stride parameter because you're > using a SSD-based RAID array, or a HDD-based RAID array? We're using SSD-based RAID 0 with 10 disks. $ sudo dumpe2fs -h /dev/md127| grep -E "stride|stripe" dumpe2fs 1.47.0 (5-Feb-2023) RAID stride: 128 RAID stripe width: 1280 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-08 15:07 ` Matt Fleming 2025-10-08 16:26 ` Theodore Ts'o @ 2025-10-08 16:35 ` Jan Kara 2025-10-09 10:17 ` Matt Fleming 1 sibling, 1 reply; 15+ messages in thread From: Jan Kara @ 2025-10-08 16:35 UTC (permalink / raw) To: Matt Fleming Cc: adilger.kernel, kernel-team, linux-ext4, linux-fsdevel, linux-kernel, tytso, willy, Baokun Li, Jan Kara Hi Matt! Nice talking to you again :) On Wed 08-10-25 16:07:05, Matt Fleming wrote: > (Adding Baokun and Jan in case they have any ideas) > On Mon, Oct 06, 2025 at 12:56:15 +0100, Matt Fleming wrote: > > Hi, > > > > We're seeing writeback take a long time and triggering blocked task > > warnings on some of our database nodes, e.g. > > > > INFO: task kworker/34:2:243325 blocked for more than 225 seconds. > > Tainted: G O 6.12.41-cloudflare-2025.8.2 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > task:kworker/34:2 state:D stack:0 pid:243325 tgid:243325 ppid:2 task_flags:0x4208060 flags:0x00004000 > > Workqueue: cgroup_destroy css_free_rwork_fn > > Call Trace: > > <TASK> > > __schedule+0x4fb/0xbf0 > > schedule+0x27/0xf0 > > wb_wait_for_completion+0x5d/0x90 > > ? __pfx_autoremove_wake_function+0x10/0x10 > > mem_cgroup_css_free+0x19/0xb0 > > css_free_rwork_fn+0x4e/0x430 > > process_one_work+0x17e/0x330 > > worker_thread+0x2ce/0x3f0 > > ? __pfx_worker_thread+0x10/0x10 > > kthread+0xd2/0x100 > > ? __pfx_kthread+0x10/0x10 > > ret_from_fork+0x34/0x50 > > ? __pfx_kthread+0x10/0x10 > > ret_from_fork_asm+0x1a/0x30 > > </TASK> So this particular hang check warning will be silenced by [1]. That being said if the writeback is indeed taking longer than expected (depends on cgroup configuration etc.) these patches will obviously not fix it. Based on what you write below, are you saying that most of the time from these 225s is spent in the filesystem allocating blocks? I'd expect we'd spend most of the time waiting for IO to complete... [1] https://lore.kernel.org/linux-fsdevel/20250930065637.1876707-1-sunjunchao@bytedance.com/ > > A large chunk of system time (4.43%) is being spent in the following > > code path: > > > > ext4_get_group_info+9 > > ext4_mb_good_group+41 > > ext4_mb_find_good_group_avg_frag_lists+136 > > ext4_mb_regular_allocator+2748 > > ext4_mb_new_blocks+2373 > > ext4_ext_map_blocks+2149 > > ext4_map_blocks+294 > > ext4_do_writepages+2031 > > ext4_writepages+173 > > do_writepages+229 > > __writeback_single_inode+65 > > writeback_sb_inodes+544 > > __writeback_inodes_wb+76 > > wb_writeback+413 > > wb_workfn+196 > > process_one_work+382 > > worker_thread+718 > > kthread+210 > > ret_from_fork+52 > > ret_from_fork_asm+26 > > > > That's the path through the CR_GOAL_LEN_FAST allocator. > > > > The primary reason for all these cycles looks to be that we're spending > > a lot of time in ext4_mb_find_good_group_avg_frag_lists(). The fragment > > lists seem quite big and the function fails to find a suitable group > > pretty much every time it's called either because the frag list is empty > > (orders 10-13) or the average size is < 1280 (order 9). I'm assuming it > > falls back to a linear scan at that point. > > > > https://gist.github.com/mfleming/5b16ee4cf598e361faf54f795a98c0a8 > > > > $ sudo cat /proc/fs/ext4/md127/mb_structs_summary > > optimize_scan: 1 > > max_free_order_lists: > > list_order_0_groups: 0 > > list_order_1_groups: 1 > > list_order_2_groups: 6 > > list_order_3_groups: 42 > > list_order_4_groups: 513 > > list_order_5_groups: 62 > > list_order_6_groups: 434 > > list_order_7_groups: 2602 > > list_order_8_groups: 10951 > > list_order_9_groups: 44883 > > list_order_10_groups: 152357 > > list_order_11_groups: 24899 > > list_order_12_groups: 30461 > > list_order_13_groups: 18756 > > avg_fragment_size_lists: > > list_order_0_groups: 108 > > list_order_1_groups: 411 > > list_order_2_groups: 1640 > > list_order_3_groups: 5809 > > list_order_4_groups: 14909 > > list_order_5_groups: 31345 > > list_order_6_groups: 54132 > > list_order_7_groups: 90294 > > list_order_8_groups: 77322 > > list_order_9_groups: 10096 > > list_order_10_groups: 0 > > list_order_11_groups: 0 > > list_order_12_groups: 0 > > list_order_13_groups: 0 > > > > These machines are striped and are using noatime: > > > > $ grep ext4 /proc/mounts > > /dev/md127 /state ext4 rw,noatime,stripe=1280 0 0 > > > > Is there some tunable or configuration option that I'm missing that > > could help here to avoid wasting time in > > ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to > > fail an order 9 allocation anyway? So I'm somewhat confused here. How big is the allocation request? Above you write that average size of order 9 bucket is < 1280 which is true and makes me assume the allocation is for 1 stripe which is 1280 blocks. But here you write about order 9 allocation. Anyway, stripe aligned allocations don't always play well with mb_optimize_scan logic, so you can try mounting the filesystem with mb_optimize_scan=0 mount option. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-08 16:35 ` Jan Kara @ 2025-10-09 10:17 ` Matt Fleming 2025-10-09 12:29 ` Jan Kara 0 siblings, 1 reply; 15+ messages in thread From: Matt Fleming @ 2025-10-09 10:17 UTC (permalink / raw) To: Jan Kara Cc: adilger.kernel, kernel-team, libaokun1, linux-ext4, linux-fsdevel, linux-kernel, matt, tytso, willy On Wed, Oct 08, 2025 at 06:35:29PM +0200, Jan Kara wrote: > Hi Matt! > > Nice talking to you again :) Same. It's been too long :) > On Wed 08-10-25 16:07:05, Matt Fleming wrote: > [...] > So this particular hang check warning will be silenced by [1]. That being > said if the writeback is indeed taking longer than expected (depends on > cgroup configuration etc.) these patches will obviously not fix it. Based > on what you write below, are you saying that most of the time from these > 225s is spent in the filesystem allocating blocks? I'd expect we'd spend > most of the time waiting for IO to complete... Yeah, you're right. Most of the time is spenting waiting for writeback to complete. > So I'm somewhat confused here. How big is the allocation request? Above you > write that average size of order 9 bucket is < 1280 which is true and > makes me assume the allocation is for 1 stripe which is 1280 blocks. But > here you write about order 9 allocation. Sorry, I muddled my words. The allocation request is for 1280 blocks. > Anyway, stripe aligned allocations don't always play well with > mb_optimize_scan logic, so you can try mounting the filesystem with > mb_optimize_scan=0 mount option. Thanks, but unfortunately running with mb_optimize_scan=0 gives us much worse performance. It looks like it's taking a long time to write out even 1 page to disk. The flusher thread has been running for 20+hours now non-stop and it's blocking tasks waiting on writeback. [Thu Oct 9 09:49:59 2025] INFO: task dockerd:45649 blocked for more than 70565 seconds. mfleming@node:~$ ps -p 50674 -o pid,etime,cputime,comm PID ELAPSED TIME COMMAND 50674 20:18:25 20:14:15 kworker/u400:20+flush-9:127 A perf profile shows: # Overhead Command Shared Object Symbol # ........ ............... ................. ................................... # 32.09% kworker/u400:20 [kernel.kallsyms] [k] ext4_get_group_info | |--11.91%--ext4_mb_prefetch | ext4_mb_regular_allocator | ext4_mb_new_blocks | ext4_ext_map_blocks | ext4_map_blocks | ext4_do_writepages | ext4_writepages | do_writepages | __writeback_single_inode | writeback_sb_inodes | __writeback_inodes_wb | wb_writeback | wb_workfn | process_one_work | worker_thread | kthread | ret_from_fork | ret_from_fork_asm | |--7.23%--ext4_mb_regular_allocator | ext4_mb_new_blocks | ext4_ext_map_blocks | ext4_map_blocks | ext4_do_writepages | ext4_writepages | do_writepages | __writeback_single_inode | writeback_sb_inodes | __writeback_inodes_wb | wb_writeback | wb_workfn | process_one_work | worker_thread | kthread | ret_from_fork | ret_from_fork_asm mfleming@node:~$ sudo perf ftrace latency -b -p 50674 -T ext4_mb_regular_allocator -- sleep 10 # DURATION | COUNT | GRAPH | 0 - 1 us | 0 | | 1 - 2 us | 0 | | 2 - 4 us | 0 | | 4 - 8 us | 0 | | 8 - 16 us | 0 | | 16 - 32 us | 0 | | 32 - 64 us | 0 | | 64 - 128 us | 0 | | 128 - 256 us | 0 | | 256 - 512 us | 0 | | 512 - 1024 us | 0 | | 1 - 2 ms | 0 | | 2 - 4 ms | 0 | | 4 - 8 ms | 0 | | 8 - 16 ms | 0 | | 16 - 32 ms | 0 | | 32 - 64 ms | 0 | | 64 - 128 ms | 85 | ############################################# | 128 - 256 ms | 1 | | 256 - 512 ms | 0 | | 512 - 1024 ms | 0 | | 1 - ... s | 0 | | mfleming@node:~$ sudo perf ftrace latency -b -p 50674 -T ext4_mb_prefetch -- sleep 10 # DURATION | COUNT | GRAPH | 0 - 1 us | 130 | | 1 - 2 us | 1962306 | #################################### | 2 - 4 us | 497793 | ######### | 4 - 8 us | 4598 | | 8 - 16 us | 277 | | 16 - 32 us | 21 | | 32 - 64 us | 10 | | 64 - 128 us | 1 | | 128 - 256 us | 0 | | 256 - 512 us | 0 | | 512 - 1024 us | 0 | | 1 - 2 ms | 0 | | 2 - 4 ms | 0 | | 4 - 8 ms | 0 | | 8 - 16 ms | 0 | | 16 - 32 ms | 0 | | 32 - 64 ms | 0 | | 64 - 128 ms | 0 | | 128 - 256 ms | 0 | | 256 - 512 ms | 0 | | 512 - 1024 ms | 0 | | 1 - ... s | 0 | | mfleming@node:~$ sudo bpftrace -e 'fentry:vmlinux:writeback_sb_inodes / tid==50674/ { @in = args.work->nr_pages; @start=nsecs;} fexit:vmlinux:writeback_sb_inodes /tid == 50674/ { $delta = (nsecs - @start) / 1000000; printf("IN: work->nr_pages=%d, OUT: work->nr_pages=%d, wrote=%d page(s) in %dms\n", @in, args.work->nr_pages, @in - args.work->nr_pages, $delta);} END{clear(@in);} interval:s:5 { exit();}' Attaching 4 probes... IN: work->nr_pages=6095831, OUT: work->nr_pages=6095830, wrote=1 page(s) in 108ms IN: work->nr_pages=6095830, OUT: work->nr_pages=6095829, wrote=1 page(s) in 108ms IN: work->nr_pages=6095829, OUT: work->nr_pages=6095828, wrote=1 page(s) in 108ms IN: work->nr_pages=6095828, OUT: work->nr_pages=6095827, wrote=1 page(s) in 107ms IN: work->nr_pages=6095827, OUT: work->nr_pages=6095826, wrote=1 page(s) in 107ms IN: work->nr_pages=6095826, OUT: work->nr_pages=6095825, wrote=1 page(s) in 107ms IN: work->nr_pages=6095825, OUT: work->nr_pages=6095824, wrote=1 page(s) in 107ms IN: work->nr_pages=6095824, OUT: work->nr_pages=6095823, wrote=1 page(s) in 107ms IN: work->nr_pages=6095823, OUT: work->nr_pages=6095822, wrote=1 page(s) in 107ms IN: work->nr_pages=6095822, OUT: work->nr_pages=6095821, wrote=1 page(s) in 106ms Thanks, Matt ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-09 10:17 ` Matt Fleming @ 2025-10-09 12:29 ` Jan Kara 2025-10-09 17:21 ` Matt Fleming 0 siblings, 1 reply; 15+ messages in thread From: Jan Kara @ 2025-10-09 12:29 UTC (permalink / raw) To: Matt Fleming Cc: Jan Kara, adilger.kernel, kernel-team, libaokun1, linux-ext4, linux-fsdevel, linux-kernel, tytso, willy On Thu 09-10-25 11:17:48, Matt Fleming wrote: > On Wed, Oct 08, 2025 at 06:35:29PM +0200, Jan Kara wrote: > > On Wed 08-10-25 16:07:05, Matt Fleming wrote: > > So this particular hang check warning will be silenced by [1]. That being > > said if the writeback is indeed taking longer than expected (depends on > > cgroup configuration etc.) these patches will obviously not fix it. Based > > on what you write below, are you saying that most of the time from these > > 225s is spent in the filesystem allocating blocks? I'd expect we'd spend > > most of the time waiting for IO to complete... > > Yeah, you're right. Most of the time is spenting waiting for writeback > to complete. OK, so even if we reduce the somewhat pointless CPU load in the allocator you aren't going to see substantial increase in your writeback throughput. Reducing the CPU load is obviously a worthy goal but I'm not sure if that's your motivation or something else that I'm missing :). > > So I'm somewhat confused here. How big is the allocation request? Above you > > write that average size of order 9 bucket is < 1280 which is true and > > makes me assume the allocation is for 1 stripe which is 1280 blocks. But > > here you write about order 9 allocation. > > Sorry, I muddled my words. The allocation request is for 1280 blocks. OK, thanks for confirmation. > > Anyway, stripe aligned allocations don't always play well with > > mb_optimize_scan logic, so you can try mounting the filesystem with > > mb_optimize_scan=0 mount option. > > Thanks, but unfortunately running with mb_optimize_scan=0 gives us much > worse performance. It looks like it's taking a long time to write out > even 1 page to disk. The flusher thread has been running for 20+hours > now non-stop and it's blocking tasks waiting on writeback. OK, so clearly (based on the perf results you've posted) mb_optimize_scan does significantly reduce the pointless scanning for free space (in the past we had some pathological cases when it was making things worse). Just there's still some pointless scanning left. Then, as Ted writes, removing the stripe mount option might be another way how to reduce the scanning. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-09 12:29 ` Jan Kara @ 2025-10-09 17:21 ` Matt Fleming 2025-10-10 17:23 ` Jan Kara 0 siblings, 1 reply; 15+ messages in thread From: Matt Fleming @ 2025-10-09 17:21 UTC (permalink / raw) To: Jan Kara Cc: adilger.kernel, kernel-team, libaokun1, linux-ext4, linux-fsdevel, linux-kernel, tytso, willy On Thu, Oct 09, 2025 at 02:29:07PM +0200, Jan Kara wrote: > > OK, so even if we reduce the somewhat pointless CPU load in the allocator > you aren't going to see substantial increase in your writeback throughput. > Reducing the CPU load is obviously a worthy goal but I'm not sure if that's > your motivation or something else that I'm missing :). I'm not following. If you reduce the time it takes to allocate blocks during writeback, why will that not improve writeback throughput? Thanks, Matt ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-09 17:21 ` Matt Fleming @ 2025-10-10 17:23 ` Jan Kara 2025-10-14 10:13 ` Matt Fleming 0 siblings, 1 reply; 15+ messages in thread From: Jan Kara @ 2025-10-10 17:23 UTC (permalink / raw) To: Matt Fleming Cc: Jan Kara, adilger.kernel, kernel-team, libaokun1, linux-ext4, linux-fsdevel, linux-kernel, tytso, willy On Thu 09-10-25 18:21:53, Matt Fleming wrote: > On Thu, Oct 09, 2025 at 02:29:07PM +0200, Jan Kara wrote: > > > > OK, so even if we reduce the somewhat pointless CPU load in the allocator > > you aren't going to see substantial increase in your writeback throughput. > > Reducing the CPU load is obviously a worthy goal but I'm not sure if that's > > your motivation or something else that I'm missing :). > > I'm not following. If you reduce the time it takes to allocate blocks > during writeback, why will that not improve writeback throughput? Maybe I misunderstood what you wrote about your profiles but you wrote that we were spending about 4% of CPU time in the block allocation code. Even if we get that close to 0%, you'd still gain only 4%. Or am I misunderstanding something? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-10 17:23 ` Jan Kara @ 2025-10-14 10:13 ` Matt Fleming 0 siblings, 0 replies; 15+ messages in thread From: Matt Fleming @ 2025-10-14 10:13 UTC (permalink / raw) To: Jan Kara Cc: adilger.kernel, kernel-team, libaokun1, linux-ext4, linux-fsdevel, linux-kernel, tytso, willy On Fri, Oct 10, 2025 at 07:23:54PM +0200, Jan Kara wrote: > > Maybe I misunderstood what you wrote about your profiles but you wrote that > we were spending about 4% of CPU time in the block allocation code. Even if > we get that close to 0%, you'd still gain only 4%. Or am I misunderstanding > something? Ah, I see. Yeah that's true but that's 4% of CPU cycles that could be put to better use elsehwere :D ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-06 11:56 ext4 writeback performance issue in 6.12 Matt Fleming 2025-10-08 15:07 ` Matt Fleming @ 2025-10-09 12:36 ` Ojaswin Mujoo 2025-10-09 17:50 ` Matt Fleming 1 sibling, 1 reply; 15+ messages in thread From: Ojaswin Mujoo @ 2025-10-09 12:36 UTC (permalink / raw) To: Matt Fleming Cc: Theodore Ts'o, Andreas Dilger, linux-ext4, linux-kernel, kernel-team, linux-fsdevel, Matthew Wilcox On Mon, Oct 06, 2025 at 12:56:15PM +0100, Matt Fleming wrote: > Hi, > > We're seeing writeback take a long time and triggering blocked task > warnings on some of our database nodes, e.g. > > INFO: task kworker/34:2:243325 blocked for more than 225 seconds. > Tainted: G O 6.12.41-cloudflare-2025.8.2 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/34:2 state:D stack:0 pid:243325 tgid:243325 ppid:2 task_flags:0x4208060 flags:0x00004000 > Workqueue: cgroup_destroy css_free_rwork_fn > Call Trace: > <TASK> > __schedule+0x4fb/0xbf0 > schedule+0x27/0xf0 > wb_wait_for_completion+0x5d/0x90 > ? __pfx_autoremove_wake_function+0x10/0x10 > mem_cgroup_css_free+0x19/0xb0 > css_free_rwork_fn+0x4e/0x430 > process_one_work+0x17e/0x330 > worker_thread+0x2ce/0x3f0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0xd2/0x100 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x34/0x50 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > > A large chunk of system time (4.43%) is being spent in the following > code path: > > ext4_get_group_info+9 > ext4_mb_good_group+41 > ext4_mb_find_good_group_avg_frag_lists+136 > ext4_mb_regular_allocator+2748 > ext4_mb_new_blocks+2373 > ext4_ext_map_blocks+2149 > ext4_map_blocks+294 > ext4_do_writepages+2031 > ext4_writepages+173 > do_writepages+229 > __writeback_single_inode+65 > writeback_sb_inodes+544 > __writeback_inodes_wb+76 > wb_writeback+413 > wb_workfn+196 > process_one_work+382 > worker_thread+718 > kthread+210 > ret_from_fork+52 > ret_from_fork_asm+26 > > That's the path through the CR_GOAL_LEN_FAST allocator. > > The primary reason for all these cycles looks to be that we're spending > a lot of time in ext4_mb_find_good_group_avg_frag_lists(). The fragment > lists seem quite big and the function fails to find a suitable group > pretty much every time it's called either because the frag list is empty > (orders 10-13) or the average size is < 1280 (order 9). I'm assuming it > falls back to a linear scan at that point. > > https://gist.github.com/mfleming/5b16ee4cf598e361faf54f795a98c0a8 > > $ sudo cat /proc/fs/ext4/md127/mb_structs_summary > optimize_scan: 1 > max_free_order_lists: > list_order_0_groups: 0 > list_order_1_groups: 1 > list_order_2_groups: 6 > list_order_3_groups: 42 > list_order_4_groups: 513 > list_order_5_groups: 62 > list_order_6_groups: 434 > list_order_7_groups: 2602 > list_order_8_groups: 10951 > list_order_9_groups: 44883 > list_order_10_groups: 152357 > list_order_11_groups: 24899 > list_order_12_groups: 30461 > list_order_13_groups: 18756 > avg_fragment_size_lists: > list_order_0_groups: 108 > list_order_1_groups: 411 > list_order_2_groups: 1640 > list_order_3_groups: 5809 > list_order_4_groups: 14909 > list_order_5_groups: 31345 > list_order_6_groups: 54132 > list_order_7_groups: 90294 > list_order_8_groups: 77322 > list_order_9_groups: 10096 > list_order_10_groups: 0 > list_order_11_groups: 0 > list_order_12_groups: 0 > list_order_13_groups: 0 > > These machines are striped and are using noatime: Hi Matt, Thanks for the details, we have had issues in past where the allocator gets stuck in a loop trying too hard to find blocks that are aligned to the stripe size [1] but this particular issue was patched in an pre 6.12 kernel. Coming to the above details, ext4_mb_find_good_group_avg_frag_list() exits early if there are no groups of the needed so if we do have many order 9+ allocations we shouldn't have been spending more time there. The issue I think are the order 9 allocations, which allocator thinks it can satisfy but it ends up not being able to find the space easily. If ext4_mb_find_group_avg_frag_list() is indeed a bottleneck, there are 2 places where it could be getting called from: - ext4_mb_choose_next_group_goal_fast (criteria = EXT4_MB_CR_GOAL_LEN_FAST) - ext4_mb_choose_next_group_best_avail (criteria = EXT4_MB_CR_BEST_AVAIL_LEN) Will it be possible for you to use bpf to try to figure out which one of the callers is actually the one bottlenecking (mihgt be tricky since they will mostly get inlined) and a sample of values for ac_g_ex->fe_len and ac_b_ex->fe_len if possible. Also, can you share the ext4 mb stats by enabling it via: echo 1 > /sys/fs/ext4/vda2/mb_stats And then once you are able to replicate it for a few mins: cat /proc/fs/ext4/vda2/mb_stats This will also give some idea on where the allocator is spending more time. Also, as Ted suggested, switching stripe off might also help here. Regards, Ojaswin > > $ grep ext4 /proc/mounts > /dev/md127 /state ext4 rw,noatime,stripe=1280 0 0 > > Is there some tunable or configuration option that I'm missing that > could help here to avoid wasting time in > ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to > fail an order 9 allocation anyway? > > I'm happy to provide any more details that might help. > > Thanks, > Matt ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 writeback performance issue in 6.12 2025-10-09 12:36 ` Ojaswin Mujoo @ 2025-10-09 17:50 ` Matt Fleming 0 siblings, 0 replies; 15+ messages in thread From: Matt Fleming @ 2025-10-09 17:50 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Theodore Ts'o, Andreas Dilger, linux-ext4, linux-kernel, kernel-team, linux-fsdevel, Matthew Wilcox On Thu, Oct 09, 2025 at 06:06:27PM +0530, Ojaswin Mujoo wrote: > > Hi Matt, > > Thanks for the details, we have had issues in past where the allocator > gets stuck in a loop trying too hard to find blocks that are aligned to > the stripe size [1] but this particular issue was patched in an pre 6.12 > kernel. Yeah, we (Cloudflare) hit this exact issue last year. > Coming to the above details, ext4_mb_find_good_group_avg_frag_list() > exits early if there are no groups of the needed so if we do have many > order 9+ allocations we shouldn't have been spending more time there. > The issue I think are the order 9 allocations, which allocator thinks it > can satisfy but it ends up not being able to find the space easily. > If ext4_mb_find_group_avg_frag_list() is indeed a bottleneck, there > are 2 places where it could be getting called from: > > - ext4_mb_choose_next_group_goal_fast (criteria = > EXT4_MB_CR_GOAL_LEN_FAST) > - ext4_mb_choose_next_group_best_avail (criteria = > EXT4_MB_CR_BEST_AVAIL_LEN) > > Will it be possible for you to use bpf to try to figure out which one of > the callers is actually the one bottlenecking (mihgt be tricky since > they will mostly get inlined) and a sample of values for ac_g_ex->fe_len > and ac_b_ex->fe_len if possible. Mostly we go through ext4_mb_choose_next_group_goal_fast() but we also go through ext4_mb_choose_next_group_best_avail(). > Also, can you share the ext4 mb stats by enabling it via: > > echo 1 > /sys/fs/ext4/vda2/mb_stats > > And then once you are able to replicate it for a few mins: > > cat /proc/fs/ext4/vda2/mb_stats > > This will also give some idea on where the allocator is spending more > time. > > Also, as Ted suggested, switching stripe off might also help here. Preliminary results look very promising with stripe disabled. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-10-14 10:13 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-06 11:56 ext4 writeback performance issue in 6.12 Matt Fleming 2025-10-08 15:07 ` Matt Fleming 2025-10-08 16:26 ` Theodore Ts'o 2025-10-09 10:22 ` Matt Fleming 2025-10-09 17:52 ` Matt Fleming 2025-10-10 2:04 ` Theodore Ts'o 2025-10-10 12:42 ` Matt Fleming 2025-10-08 16:35 ` Jan Kara 2025-10-09 10:17 ` Matt Fleming 2025-10-09 12:29 ` Jan Kara 2025-10-09 17:21 ` Matt Fleming 2025-10-10 17:23 ` Jan Kara 2025-10-14 10:13 ` Matt Fleming 2025-10-09 12:36 ` Ojaswin Mujoo 2025-10-09 17:50 ` Matt Fleming
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).