All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matt Fleming <matt@readmodwrite.com>
To: Jan Kara <jack@suse.cz>
Cc: adilger.kernel@dilger.ca, kernel-team@cloudflare.com,
	libaokun1@huawei.com, linux-ext4@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	matt@readmodwrite.com, tytso@mit.edu, willy@infradead.org
Subject: Re: ext4 writeback performance issue in 6.12
Date: Thu,  9 Oct 2025 11:17:48 +0100	[thread overview]
Message-ID: <20251009101748.529277-1-matt@readmodwrite.com> (raw)
In-Reply-To: <2nuegl4wtmu3lkprcomfeluii77ofrmkn4ukvbx2gesnqlsflk@yx466sbd7bni>

On Wed, Oct 08, 2025 at 06:35:29PM +0200, Jan Kara wrote:
> Hi Matt!
> 
> Nice talking to you again :)
 
Same. It's been too long :)

> On Wed 08-10-25 16:07:05, Matt Fleming wrote:
> 

[...]

> So this particular hang check warning will be silenced by [1]. That being
> said if the writeback is indeed taking longer than expected (depends on
> cgroup configuration etc.) these patches will obviously not fix it. Based
> on what you write below, are you saying that most of the time from these
> 225s is spent in the filesystem allocating blocks? I'd expect we'd spend
> most of the time waiting for IO to complete...
 
Yeah, you're right. Most of the time is spenting waiting for writeback
to complete.

> So I'm somewhat confused here. How big is the allocation request? Above you
> write that average size of order 9 bucket is < 1280 which is true and
> makes me assume the allocation is for 1 stripe which is 1280 blocks. But
> here you write about order 9 allocation.
 
Sorry, I muddled my words. The allocation request is for 1280 blocks.

> Anyway, stripe aligned allocations don't always play well with
> mb_optimize_scan logic, so you can try mounting the filesystem with
> mb_optimize_scan=0 mount option.

Thanks, but unfortunately running with mb_optimize_scan=0 gives us much
worse performance. It looks like it's taking a long time to write out
even 1 page to disk. The flusher thread has been running for 20+hours
now non-stop and it's blocking tasks waiting on writeback.

[Thu Oct  9 09:49:59 2025] INFO: task dockerd:45649 blocked for more than 70565 seconds.

mfleming@node:~$ ps -p 50674 -o pid,etime,cputime,comm
    PID     ELAPSED     TIME COMMAND
  50674    20:18:25 20:14:15 kworker/u400:20+flush-9:127

A perf profile shows:

# Overhead  Command          Shared Object      Symbol                             
# ........  ...............  .................  ...................................
#
    32.09%  kworker/u400:20  [kernel.kallsyms]  [k] ext4_get_group_info
            |          
            |--11.91%--ext4_mb_prefetch
            |          ext4_mb_regular_allocator
            |          ext4_mb_new_blocks
            |          ext4_ext_map_blocks
            |          ext4_map_blocks
            |          ext4_do_writepages
            |          ext4_writepages
            |          do_writepages
            |          __writeback_single_inode
            |          writeback_sb_inodes
            |          __writeback_inodes_wb
            |          wb_writeback
            |          wb_workfn
            |          process_one_work
            |          worker_thread
            |          kthread
            |          ret_from_fork
            |          ret_from_fork_asm
            |          
            |--7.23%--ext4_mb_regular_allocator
            |          ext4_mb_new_blocks
            |          ext4_ext_map_blocks
            |          ext4_map_blocks
            |          ext4_do_writepages
            |          ext4_writepages
            |          do_writepages
            |          __writeback_single_inode
            |          writeback_sb_inodes
            |          __writeback_inodes_wb
            |          wb_writeback
            |          wb_workfn
            |          process_one_work
            |          worker_thread
            |          kthread
            |          ret_from_fork
            |          ret_from_fork_asm

mfleming@node:~$ sudo perf ftrace latency -b  -p 50674 -T ext4_mb_regular_allocator -- sleep 10
#   DURATION     |      COUNT | GRAPH                                          |
     0 - 1    us |          0 |                                                |
     1 - 2    us |          0 |                                                |
     2 - 4    us |          0 |                                                |
     4 - 8    us |          0 |                                                |
     8 - 16   us |          0 |                                                |
    16 - 32   us |          0 |                                                |
    32 - 64   us |          0 |                                                |
    64 - 128  us |          0 |                                                |
   128 - 256  us |          0 |                                                |
   256 - 512  us |          0 |                                                |
   512 - 1024 us |          0 |                                                |
     1 - 2    ms |          0 |                                                |
     2 - 4    ms |          0 |                                                |
     4 - 8    ms |          0 |                                                |
     8 - 16   ms |          0 |                                                |
    16 - 32   ms |          0 |                                                |
    32 - 64   ms |          0 |                                                |
    64 - 128  ms |         85 | #############################################  |
   128 - 256  ms |          1 |                                                |
   256 - 512  ms |          0 |                                                |
   512 - 1024 ms |          0 |                                                |
     1 - ...   s |          0 |                                                |

mfleming@node:~$ sudo perf ftrace latency -b  -p 50674 -T ext4_mb_prefetch -- sleep 10
#   DURATION     |      COUNT | GRAPH                                          |
     0 - 1    us |        130 |                                                |
     1 - 2    us |    1962306 | ####################################           |
     2 - 4    us |     497793 | #########                                      |
     4 - 8    us |       4598 |                                                |
     8 - 16   us |        277 |                                                |
    16 - 32   us |         21 |                                                |
    32 - 64   us |         10 |                                                |
    64 - 128  us |          1 |                                                |
   128 - 256  us |          0 |                                                |
   256 - 512  us |          0 |                                                |
   512 - 1024 us |          0 |                                                |
     1 - 2    ms |          0 |                                                |
     2 - 4    ms |          0 |                                                |
     4 - 8    ms |          0 |                                                |
     8 - 16   ms |          0 |                                                |
    16 - 32   ms |          0 |                                                |
    32 - 64   ms |          0 |                                                |
    64 - 128  ms |          0 |                                                |
   128 - 256  ms |          0 |                                                |
   256 - 512  ms |          0 |                                                |
   512 - 1024 ms |          0 |                                                |
     1 - ...   s |          0 |                                                |

mfleming@node:~$ sudo bpftrace -e 'fentry:vmlinux:writeback_sb_inodes / tid==50674/ { @in = args.work->nr_pages; @start=nsecs;} fexit:vmlinux:writeback_sb_inodes /tid == 50674/ { $delta = (nsecs - @start) / 1000000; printf("IN: work->nr_pages=%d, OUT: work->nr_pages=%d, wrote=%d page(s) in %dms\n", @in, args.work->nr_pages, @in - args.work->nr_pages, $delta);} END{clear(@in);} interval:s:5 { exit();}'
Attaching 4 probes...
IN: work->nr_pages=6095831, OUT: work->nr_pages=6095830, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095830, OUT: work->nr_pages=6095829, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095829, OUT: work->nr_pages=6095828, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095828, OUT: work->nr_pages=6095827, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095827, OUT: work->nr_pages=6095826, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095826, OUT: work->nr_pages=6095825, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095825, OUT: work->nr_pages=6095824, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095824, OUT: work->nr_pages=6095823, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095823, OUT: work->nr_pages=6095822, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095822, OUT: work->nr_pages=6095821, wrote=1 page(s) in 106ms

Thanks,
Matt

  reply	other threads:[~2025-10-09 10:17 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 11:56 ext4 writeback performance issue in 6.12 Matt Fleming
2025-10-08 15:07 ` Matt Fleming
2025-10-08 16:26   ` Theodore Ts'o
2025-10-09 10:22     ` Matt Fleming
2025-10-09 17:52       ` Matt Fleming
2025-10-10  2:04         ` Theodore Ts'o
2025-10-10 12:42           ` Matt Fleming
2025-10-08 16:35   ` Jan Kara
2025-10-09 10:17     ` Matt Fleming [this message]
2025-10-09 12:29       ` Jan Kara
2025-10-09 17:21         ` Matt Fleming
2025-10-10 17:23           ` Jan Kara
2025-10-14 10:13             ` Matt Fleming
2025-10-09 12:36 ` Ojaswin Mujoo
2025-10-09 17:50   ` Matt Fleming

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251009101748.529277-1-matt@readmodwrite.com \
    --to=matt@readmodwrite.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=jack@suse.cz \
    --cc=kernel-team@cloudflare.com \
    --cc=libaokun1@huawei.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.