linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matt Fleming <matt@readmodwrite.com>
To: Jan Kara <jack@suse.cz>
Cc: adilger.kernel@dilger.ca, kernel-team@cloudflare.com,
	libaokun1@huawei.com, linux-ext4@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	matt@readmodwrite.com, tytso@mit.edu, willy@infradead.org
Subject: Re: ext4 writeback performance issue in 6.12
Date: Thu,  9 Oct 2025 11:17:48 +0100	[thread overview]
Message-ID: <20251009101748.529277-1-matt@readmodwrite.com> (raw)
In-Reply-To: <2nuegl4wtmu3lkprcomfeluii77ofrmkn4ukvbx2gesnqlsflk@yx466sbd7bni>

On Wed, Oct 08, 2025 at 06:35:29PM +0200, Jan Kara wrote:
> Hi Matt!
> 
> Nice talking to you again :)
 
Same. It's been too long :)

> On Wed 08-10-25 16:07:05, Matt Fleming wrote:
> 

[...]

> So this particular hang check warning will be silenced by [1]. That being
> said if the writeback is indeed taking longer than expected (depends on
> cgroup configuration etc.) these patches will obviously not fix it. Based
> on what you write below, are you saying that most of the time from these
> 225s is spent in the filesystem allocating blocks? I'd expect we'd spend
> most of the time waiting for IO to complete...
 
Yeah, you're right. Most of the time is spenting waiting for writeback
to complete.

> So I'm somewhat confused here. How big is the allocation request? Above you
> write that average size of order 9 bucket is < 1280 which is true and
> makes me assume the allocation is for 1 stripe which is 1280 blocks. But
> here you write about order 9 allocation.
 
Sorry, I muddled my words. The allocation request is for 1280 blocks.

> Anyway, stripe aligned allocations don't always play well with
> mb_optimize_scan logic, so you can try mounting the filesystem with
> mb_optimize_scan=0 mount option.

Thanks, but unfortunately running with mb_optimize_scan=0 gives us much
worse performance. It looks like it's taking a long time to write out
even 1 page to disk. The flusher thread has been running for 20+hours
now non-stop and it's blocking tasks waiting on writeback.

[Thu Oct  9 09:49:59 2025] INFO: task dockerd:45649 blocked for more than 70565 seconds.

mfleming@node:~$ ps -p 50674 -o pid,etime,cputime,comm
    PID     ELAPSED     TIME COMMAND
  50674    20:18:25 20:14:15 kworker/u400:20+flush-9:127

A perf profile shows:

# Overhead  Command          Shared Object      Symbol                             
# ........  ...............  .................  ...................................
#
    32.09%  kworker/u400:20  [kernel.kallsyms]  [k] ext4_get_group_info
            |          
            |--11.91%--ext4_mb_prefetch
            |          ext4_mb_regular_allocator
            |          ext4_mb_new_blocks
            |          ext4_ext_map_blocks
            |          ext4_map_blocks
            |          ext4_do_writepages
            |          ext4_writepages
            |          do_writepages
            |          __writeback_single_inode
            |          writeback_sb_inodes
            |          __writeback_inodes_wb
            |          wb_writeback
            |          wb_workfn
            |          process_one_work
            |          worker_thread
            |          kthread
            |          ret_from_fork
            |          ret_from_fork_asm
            |          
            |--7.23%--ext4_mb_regular_allocator
            |          ext4_mb_new_blocks
            |          ext4_ext_map_blocks
            |          ext4_map_blocks
            |          ext4_do_writepages
            |          ext4_writepages
            |          do_writepages
            |          __writeback_single_inode
            |          writeback_sb_inodes
            |          __writeback_inodes_wb
            |          wb_writeback
            |          wb_workfn
            |          process_one_work
            |          worker_thread
            |          kthread
            |          ret_from_fork
            |          ret_from_fork_asm

mfleming@node:~$ sudo perf ftrace latency -b  -p 50674 -T ext4_mb_regular_allocator -- sleep 10
#   DURATION     |      COUNT | GRAPH                                          |
     0 - 1    us |          0 |                                                |
     1 - 2    us |          0 |                                                |
     2 - 4    us |          0 |                                                |
     4 - 8    us |          0 |                                                |
     8 - 16   us |          0 |                                                |
    16 - 32   us |          0 |                                                |
    32 - 64   us |          0 |                                                |
    64 - 128  us |          0 |                                                |
   128 - 256  us |          0 |                                                |
   256 - 512  us |          0 |                                                |
   512 - 1024 us |          0 |                                                |
     1 - 2    ms |          0 |                                                |
     2 - 4    ms |          0 |                                                |
     4 - 8    ms |          0 |                                                |
     8 - 16   ms |          0 |                                                |
    16 - 32   ms |          0 |                                                |
    32 - 64   ms |          0 |                                                |
    64 - 128  ms |         85 | #############################################  |
   128 - 256  ms |          1 |                                                |
   256 - 512  ms |          0 |                                                |
   512 - 1024 ms |          0 |                                                |
     1 - ...   s |          0 |                                                |

mfleming@node:~$ sudo perf ftrace latency -b  -p 50674 -T ext4_mb_prefetch -- sleep 10
#   DURATION     |      COUNT | GRAPH                                          |
     0 - 1    us |        130 |                                                |
     1 - 2    us |    1962306 | ####################################           |
     2 - 4    us |     497793 | #########                                      |
     4 - 8    us |       4598 |                                                |
     8 - 16   us |        277 |                                                |
    16 - 32   us |         21 |                                                |
    32 - 64   us |         10 |                                                |
    64 - 128  us |          1 |                                                |
   128 - 256  us |          0 |                                                |
   256 - 512  us |          0 |                                                |
   512 - 1024 us |          0 |                                                |
     1 - 2    ms |          0 |                                                |
     2 - 4    ms |          0 |                                                |
     4 - 8    ms |          0 |                                                |
     8 - 16   ms |          0 |                                                |
    16 - 32   ms |          0 |                                                |
    32 - 64   ms |          0 |                                                |
    64 - 128  ms |          0 |                                                |
   128 - 256  ms |          0 |                                                |
   256 - 512  ms |          0 |                                                |
   512 - 1024 ms |          0 |                                                |
     1 - ...   s |          0 |                                                |

mfleming@node:~$ sudo bpftrace -e 'fentry:vmlinux:writeback_sb_inodes / tid==50674/ { @in = args.work->nr_pages; @start=nsecs;} fexit:vmlinux:writeback_sb_inodes /tid == 50674/ { $delta = (nsecs - @start) / 1000000; printf("IN: work->nr_pages=%d, OUT: work->nr_pages=%d, wrote=%d page(s) in %dms\n", @in, args.work->nr_pages, @in - args.work->nr_pages, $delta);} END{clear(@in);} interval:s:5 { exit();}'
Attaching 4 probes...
IN: work->nr_pages=6095831, OUT: work->nr_pages=6095830, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095830, OUT: work->nr_pages=6095829, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095829, OUT: work->nr_pages=6095828, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095828, OUT: work->nr_pages=6095827, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095827, OUT: work->nr_pages=6095826, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095826, OUT: work->nr_pages=6095825, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095825, OUT: work->nr_pages=6095824, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095824, OUT: work->nr_pages=6095823, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095823, OUT: work->nr_pages=6095822, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095822, OUT: work->nr_pages=6095821, wrote=1 page(s) in 106ms

Thanks,
Matt

  reply	other threads:[~2025-10-09 10:17 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 11:56 ext4 writeback performance issue in 6.12 Matt Fleming
2025-10-08 15:07 ` Matt Fleming
2025-10-08 16:26   ` Theodore Ts'o
2025-10-09 10:22     ` Matt Fleming
2025-10-09 17:52       ` Matt Fleming
2025-10-10  2:04         ` Theodore Ts'o
2025-10-10 12:42           ` Matt Fleming
2025-10-08 16:35   ` Jan Kara
2025-10-09 10:17     ` Matt Fleming [this message]
2025-10-09 12:29       ` Jan Kara
2025-10-09 17:21         ` Matt Fleming
2025-10-10 17:23           ` Jan Kara
2025-10-14 10:13             ` Matt Fleming
2025-10-09 12:36 ` Ojaswin Mujoo
2025-10-09 17:50   ` Matt Fleming

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251009101748.529277-1-matt@readmodwrite.com \
    --to=matt@readmodwrite.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=jack@suse.cz \
    --cc=kernel-team@cloudflare.com \
    --cc=libaokun1@huawei.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).