Re: performance regression between 6.1.x and 5.15.x

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wang Yugui <wangyugui@e16-tech.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: performance regression between 6.1.x and 5.15.x
Date: Tue, 09 May 2023 20:37:52 +0800	[thread overview]
Message-ID: <20230509203751.E6D2.409509F4@e16-tech.com> (raw)
In-Reply-To: <20230509013625.GS3223426@dread.disaster.area>

Hi,

> On Tue, May 09, 2023 at 07:25:53AM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > > On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> > > > Hi,
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > > > > with the compare to xfs 5.15.110.
> > > > > 
> > > > > It is yet not clear whether  it is a problem of xfs or lvm2.
> > > > > 
> > > > > any guide to troubleshoot it?
> > > > > 
> > > > > test case:
> > > > >   disk: NVMe PCIe3 SSD *4 
> > > > >   LVM: raid0 default strip size 64K.
> > > > >   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> > > > >    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > > > >    -directory=/mnt/test
> > > > > 
> > > > > 
> > > > > 6.1.27/6.1.23
> > > > > fio bw=2623MiB/s (2750MB/s)
> > > > > perf report:
> > > > > Samples: 330K of event 'cycles', Event count (approx.): 120739812790
> > > > > Overhead  Command  Shared Object        Symbol
> > > > >   31.07%  fio      [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
> > > > >    5.11%  fio      [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
> > > > >    3.36%  fio      [kernel.kallsyms]    [k] asm_exc_nmi
> > > > >    3.29%  fio      [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
> > > > >    2.27%  fio      [kernel.kallsyms]    [k] iomap_write_begin
> > > > >    2.18%  fio      [kernel.kallsyms]    [k] get_page_from_freelist
> > > > >    2.11%  fio      [kernel.kallsyms]    [k] xas_load
> > > > >    2.10%  fio      [kernel.kallsyms]    [k] xas_descend
> > > > > 
> > > > > 5.15.110
> > > > > fio bw=6796MiB/s (7126MB/s)
> > > > > perf report:
> > > > > Samples: 267K of event 'cycles', Event count (approx.): 186688803871
> > > > > Overhead  Command  Shared Object       Symbol
> > > > >   38.09%  fio      [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
> > > > >    6.76%  fio      [kernel.kallsyms]   [k] iomap_set_range_uptodate
> > > > >    4.40%  fio      [kernel.kallsyms]   [k] xas_load
> > > > >    3.94%  fio      [kernel.kallsyms]   [k] get_page_from_freelist
> > > > >    3.04%  fio      [kernel.kallsyms]   [k] asm_exc_nmi
> > > > >    1.97%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
> > > > >    1.88%  fio      [kernel.kallsyms]   [k] __pagevec_lru_add
> > > > >    1.53%  fio      [kernel.kallsyms]   [k] iomap_write_begin
> > > > >    1.53%  fio      [kernel.kallsyms]   [k] __add_to_page_cache_locked
> > > > >    1.41%  fio      [kernel.kallsyms]   [k] xas_start
> > > 
> > > Because you are testing buffered IO, you need to run perf across all
> > > CPUs and tasks, not just the fio process so that it captures the
> > > profile of memory reclaim and writeback that is being performed by
> > > the kernel.
> > 
> > 'perf report' of all CPU.
> > Samples: 211K of event 'cycles', Event count (approx.): 56590727219
> > Overhead  Command          Shared Object            Symbol
> >   16.29%  fio              [kernel.kallsyms]        [k] rep_movs_alternative
> >    3.38%  kworker/u98:1+f  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> >    3.11%  fio              [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> >    3.05%  swapper          [kernel.kallsyms]        [k] intel_idle
> >    2.63%  fio              [kernel.kallsyms]        [k] get_page_from_freelist
> >    2.33%  fio              [kernel.kallsyms]        [k] asm_exc_nmi
> >    2.26%  kworker/u98:1+f  [kernel.kallsyms]        [k] __folio_start_writeback
> >    1.40%  fio              [kernel.kallsyms]        [k] __filemap_add_folio
> >    1.37%  fio              [kernel.kallsyms]        [k] lru_add_fn
> >    1.35%  fio              [kernel.kallsyms]        [k] xas_load
> >    1.33%  fio              [kernel.kallsyms]        [k] iomap_write_begin
> >    1.31%  fio              [kernel.kallsyms]        [k] xas_descend
> >    1.19%  kworker/u98:1+f  [kernel.kallsyms]        [k] folio_clear_dirty_for_io
> >    1.07%  fio              [kernel.kallsyms]        [k] folio_add_lru
> >    1.01%  fio              [kernel.kallsyms]        [k] __folio_mark_dirty
> >    1.00%  kworker/u98:1+f  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave
> > 
> > and 'top' show that 'kworker/u98:1' have over 80% CPU usage.
> 
> Can you provide an expanded callgraph profile for both the good and
> bad kernels showing the CPU used in the fio write() path and the
> kworker-based writeback path?

I'm sorry that some detail guide for info gather of this test please.

the test machine here is already reserved.

> [ The test machine I have that I could reproduce this sort of
> performance anomoly went bad a month ago, so I have no hardware
> available to me right now to reproduce this behaviour locally.
> Hence I'll need you to do the profiling I need to understand the
> regression for me. ]
> 
> > > I suspect that the likely culprit is mm-level changes - the
> > > page reclaim algorithm was completely replaced in 6.1 with a
> > > multi-generation LRU that will have different cache footprint
> > > behaviour in exactly this sort of "repeatedly over-write same files
> > > in a set that are significantly larger than memory" micro-benchmark.
> > > 
> > > i.e. these commits:
> > > 
> > > 07017acb0601 mm: multi-gen LRU: admin guide
> > > d6c3af7d8a2b mm: multi-gen LRU: debugfs interface
> > > 1332a809d95a mm: multi-gen LRU: thrashing prevention
> > > 354ed5974429 mm: multi-gen LRU: kill switch
> > > f76c83378851 mm: multi-gen LRU: optimize multiple memcgs
> > > bd74fdaea146 mm: multi-gen LRU: support page table walks
> > > 018ee47f1489 mm: multi-gen LRU: exploit locality in rmap
> > > ac35a4902374 mm: multi-gen LRU: minimal implementation
> > > ec1c86b25f4b mm: multi-gen LRU: groundwork
> > > 
> > > If that's the case, I'd expect kernels up to 6.0 to demonstrate the
> > > same behaviour as 5.15, and 6.1+ to demonstrate the same behaviour
> > > as you've reported.
> > I tested 6.4.0-rc1. the performance become a little worse.
> 
> Thanks, that's as I expected.
> 
> WHich means that the interesting kernel versions to check now are a
> 6.0.x kernel, and then if it has the same perf as 5.15.x, then the
> commit before the multi-gen LRU was introduced vs the commit after
> the multi-gen LRU was introduced to see if that is the functionality
> that introduced the regression....

more performance test result:

linux 6.0.18
	fio WRITE: bw=2565MiB/s (2689MB/s)
linux 5.17.0
	fio WRITE: bw=2602MiB/s (2729MB/s) 
linux 5.16.20
	fio WRITE: bw=7666MiB/s (8039MB/s),

so it is a problem between 5.16.20 and 5.17.0?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/09

next prev parent reply	other threads:[~2023-05-09 12:37 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-08  9:24 performance regression between 6.1.x and 5.15.x Wang Yugui
2023-05-08 14:46 ` Wang Yugui
2023-05-08 22:32   ` Dave Chinner
2023-05-08 23:25     ` Wang Yugui
2023-05-09  1:36       ` Dave Chinner
2023-05-09 12:37         ` Wang Yugui [this message]
2023-05-09 22:14           ` Dave Chinner
2023-05-10  5:46             ` Wang Yugui
2023-05-10  7:27               ` Dave Chinner
2023-05-10  8:50                 ` Wang Yugui
2023-05-11  1:34                   ` Dave Chinner
2023-05-17 13:07                     ` Wang Yugui
2023-05-17 22:11                       ` Dave Chinner
2023-05-18 18:36                       ` Creating large folios in iomap buffered write path Matthew Wilcox
2023-05-18 21:46                         ` Matthew Wilcox
2023-05-18 22:03                           ` Matthew Wilcox
2023-05-19  2:55                             ` Wang Yugui
2023-05-19 15:38                               ` Matthew Wilcox
2023-05-20 13:35                                 ` Wang Yugui
2023-05-20 16:35                                   ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230509203751.E6D2.409509F4@e16-tech.com \
    --to=wangyugui@e16-tech.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.