All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matt Fleming <matt@readmodwrite.com>
To: Theodore Ts'o <tytso@mit.edu>, Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@cloudflare.com, linux-fsdevel@vger.kernel.org,
	Matthew Wilcox <willy@infradead.org>
Subject: ext4 writeback performance issue in 6.12
Date: Mon,  6 Oct 2025 12:56:15 +0100	[thread overview]
Message-ID: <20251006115615.2289526-1-matt@readmodwrite.com> (raw)

Hi,

We're seeing writeback take a long time and triggering blocked task
warnings on some of our database nodes, e.g.

  INFO: task kworker/34:2:243325 blocked for more than 225 seconds.
        Tainted: G           O       6.12.41-cloudflare-2025.8.2 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:kworker/34:2    state:D stack:0     pid:243325 tgid:243325 ppid:2      task_flags:0x4208060 flags:0x00004000
  Workqueue: cgroup_destroy css_free_rwork_fn
  Call Trace:
   <TASK>
   __schedule+0x4fb/0xbf0
   schedule+0x27/0xf0
   wb_wait_for_completion+0x5d/0x90
   ? __pfx_autoremove_wake_function+0x10/0x10
   mem_cgroup_css_free+0x19/0xb0
   css_free_rwork_fn+0x4e/0x430
   process_one_work+0x17e/0x330
   worker_thread+0x2ce/0x3f0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xd2/0x100
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x34/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

A large chunk of system time (4.43%) is being spent in the following
code path:

   ext4_get_group_info+9
   ext4_mb_good_group+41
   ext4_mb_find_good_group_avg_frag_lists+136
   ext4_mb_regular_allocator+2748
   ext4_mb_new_blocks+2373
   ext4_ext_map_blocks+2149
   ext4_map_blocks+294
   ext4_do_writepages+2031
   ext4_writepages+173
   do_writepages+229
   __writeback_single_inode+65
   writeback_sb_inodes+544
   __writeback_inodes_wb+76
   wb_writeback+413
   wb_workfn+196
   process_one_work+382
   worker_thread+718
   kthread+210
   ret_from_fork+52
   ret_from_fork_asm+26

That's the path through the CR_GOAL_LEN_FAST allocator.

The primary reason for all these cycles looks to be that we're spending
a lot of time in ext4_mb_find_good_group_avg_frag_lists(). The fragment
lists seem quite big and the function fails to find a suitable group
pretty much every time it's called either because the frag list is empty
(orders 10-13) or the average size is < 1280 (order 9). I'm assuming it
falls back to a linear scan at that point.

  https://gist.github.com/mfleming/5b16ee4cf598e361faf54f795a98c0a8

$ sudo cat /proc/fs/ext4/md127/mb_structs_summary
optimize_scan: 1
max_free_order_lists:
	list_order_0_groups: 0
	list_order_1_groups: 1
	list_order_2_groups: 6
	list_order_3_groups: 42
	list_order_4_groups: 513
	list_order_5_groups: 62
	list_order_6_groups: 434
	list_order_7_groups: 2602
	list_order_8_groups: 10951
	list_order_9_groups: 44883
	list_order_10_groups: 152357
	list_order_11_groups: 24899
	list_order_12_groups: 30461
	list_order_13_groups: 18756
avg_fragment_size_lists:
	list_order_0_groups: 108
	list_order_1_groups: 411
	list_order_2_groups: 1640
	list_order_3_groups: 5809
	list_order_4_groups: 14909
	list_order_5_groups: 31345
	list_order_6_groups: 54132
	list_order_7_groups: 90294
	list_order_8_groups: 77322
	list_order_9_groups: 10096
	list_order_10_groups: 0
	list_order_11_groups: 0
	list_order_12_groups: 0
	list_order_13_groups: 0

These machines are striped and are using noatime:

$ grep ext4 /proc/mounts
/dev/md127 /state ext4 rw,noatime,stripe=1280 0 0

Is there some tunable or configuration option that I'm missing that
could help here to avoid wasting time in
ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to
fail an order 9 allocation anyway?

I'm happy to provide any more details that might help.

Thanks,
Matt

             reply	other threads:[~2025-10-06 11:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 11:56 Matt Fleming [this message]
2025-10-08 15:07 ` ext4 writeback performance issue in 6.12 Matt Fleming
2025-10-08 16:26   ` Theodore Ts'o
2025-10-09 10:22     ` Matt Fleming
2025-10-09 17:52       ` Matt Fleming
2025-10-10  2:04         ` Theodore Ts'o
2025-10-10 12:42           ` Matt Fleming
2025-10-08 16:35   ` Jan Kara
2025-10-09 10:17     ` Matt Fleming
2025-10-09 12:29       ` Jan Kara
2025-10-09 17:21         ` Matt Fleming
2025-10-10 17:23           ` Jan Kara
2025-10-14 10:13             ` Matt Fleming
2025-10-09 12:36 ` Ojaswin Mujoo
2025-10-09 17:50   ` Matt Fleming

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251006115615.2289526-1-matt@readmodwrite.com \
    --to=matt@readmodwrite.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.