From: Wang Yugui <wangyugui@e16-tech.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: performance regression between 6.1.x and 5.15.x
Date: Tue, 09 May 2023 07:25:53 +0800 [thread overview]
Message-ID: <20230509072552.7E8B.409509F4@e16-tech.com> (raw)
In-Reply-To: <20230508223243.GR3223426@dread.disaster.area>
Hi,
> On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> > Hi,
> >
> > > Hi,
> > >
> > > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > > with the compare to xfs 5.15.110.
> > >
> > > It is yet not clear whether it is a problem of xfs or lvm2.
> > >
> > > any guide to troubleshoot it?
> > >
> > > test case:
> > > disk: NVMe PCIe3 SSD *4
> > > LVM: raid0 default strip size 64K.
> > > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> > > -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > > -directory=/mnt/test
> > >
> > >
> > > 6.1.27/6.1.23
> > > fio bw=2623MiB/s (2750MB/s)
> > > perf report:
> > > Samples: 330K of event 'cycles', Event count (approx.): 120739812790
> > > Overhead Command Shared Object Symbol
> > > 31.07% fio [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> > > 5.11% fio [kernel.kallsyms] [k] iomap_set_range_uptodate.part.24
> > > 3.36% fio [kernel.kallsyms] [k] asm_exc_nmi
> > > 3.29% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
> > > 2.27% fio [kernel.kallsyms] [k] iomap_write_begin
> > > 2.18% fio [kernel.kallsyms] [k] get_page_from_freelist
> > > 2.11% fio [kernel.kallsyms] [k] xas_load
> > > 2.10% fio [kernel.kallsyms] [k] xas_descend
> > >
> > > 5.15.110
> > > fio bw=6796MiB/s (7126MB/s)
> > > perf report:
> > > Samples: 267K of event 'cycles', Event count (approx.): 186688803871
> > > Overhead Command Shared Object Symbol
> > > 38.09% fio [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> > > 6.76% fio [kernel.kallsyms] [k] iomap_set_range_uptodate
> > > 4.40% fio [kernel.kallsyms] [k] xas_load
> > > 3.94% fio [kernel.kallsyms] [k] get_page_from_freelist
> > > 3.04% fio [kernel.kallsyms] [k] asm_exc_nmi
> > > 1.97% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
> > > 1.88% fio [kernel.kallsyms] [k] __pagevec_lru_add
> > > 1.53% fio [kernel.kallsyms] [k] iomap_write_begin
> > > 1.53% fio [kernel.kallsyms] [k] __add_to_page_cache_locked
> > > 1.41% fio [kernel.kallsyms] [k] xas_start
>
> Because you are testing buffered IO, you need to run perf across all
> CPUs and tasks, not just the fio process so that it captures the
> profile of memory reclaim and writeback that is being performed by
> the kernel.
'perf report' of all CPU.
Samples: 211K of event 'cycles', Event count (approx.): 56590727219
Overhead Command Shared Object Symbol
16.29% fio [kernel.kallsyms] [k] rep_movs_alternative
3.38% kworker/u98:1+f [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
3.11% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
3.05% swapper [kernel.kallsyms] [k] intel_idle
2.63% fio [kernel.kallsyms] [k] get_page_from_freelist
2.33% fio [kernel.kallsyms] [k] asm_exc_nmi
2.26% kworker/u98:1+f [kernel.kallsyms] [k] __folio_start_writeback
1.40% fio [kernel.kallsyms] [k] __filemap_add_folio
1.37% fio [kernel.kallsyms] [k] lru_add_fn
1.35% fio [kernel.kallsyms] [k] xas_load
1.33% fio [kernel.kallsyms] [k] iomap_write_begin
1.31% fio [kernel.kallsyms] [k] xas_descend
1.19% kworker/u98:1+f [kernel.kallsyms] [k] folio_clear_dirty_for_io
1.07% fio [kernel.kallsyms] [k] folio_add_lru
1.01% fio [kernel.kallsyms] [k] __folio_mark_dirty
1.00% kworker/u98:1+f [kernel.kallsyms] [k] _raw_spin_lock_irqsave
and 'top' show that 'kworker/u98:1' have over 80% CPU usage.
> > more info:
> >
> > 1, 6.2.14 have same performance as 6.1.x
> >
> > 2, 6.1.x fio performance detail:
> > Jobs: 4 (f=4): [W(4)][16.7%][w=10.2GiB/s][w=10.4k IOPS][eta 00m:15s]
> > Jobs: 4 (f=4): [W(4)][25.0%][w=9949MiB/s][w=9949 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][31.2%][w=9618MiB/s][w=9618 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][37.5%][w=7970MiB/s][w=7970 IOPS][eta 00m:10s]
> > Jobs: 4 (f=4): [W(4)][41.2%][w=5048MiB/s][w=5047 IOPS][eta 00m:10s]
> > Jobs: 4 (f=4): [W(4)][42.1%][w=2489MiB/s][w=2488 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][42.9%][w=3227MiB/s][w=3226 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][45.5%][w=3622MiB/s][w=3622 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][47.8%][w=3651MiB/s][w=3650 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][52.2%][w=3435MiB/s][w=3435 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][52.0%][w=2464MiB/s][w=2463 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][53.8%][w=2438MiB/s][w=2438 IOPS][eta 00m:12s]
>
> Looks like it's throttled on dirty pages at this point.
>
> How much memory does your test system have, and what does changing
> the writeback throttling thresholds do? What's the numa layout?
This is a DELL server with two E5-2680 v2 configured as NUMA.
There is 192G memory in this server,
128G memory is installed in CPU1,
64G memory is installed in CPU2.
and sysctl config:
vm.dirty_background_bytes = 1073741824 # 1G
vm.dirty_background_ratio = 0
vm.dirty_bytes = 8589934592 # 8G
vm.dirty_ratio = 0
and NVMe SSD *4 are connected to one NVMe hba / CPU1.
> Watching the stats in /proc/meminfo would be useful. I tend to use
> Performance Co-Pilot (PCP) to collect these sorts of stats and plot
> them in real time so I can see how the state of the machine is
> changing as the test is running....
>
>
> > Jobs: 4 (f=4): [W(4)][55.6%][w=2435MiB/s][w=2434 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][57.1%][w=2449MiB/s][w=2448 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][60.7%][w=2422MiB/s][w=2421 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][62.1%][w=2457MiB/s][w=2457 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][63.3%][w=2436MiB/s][w=2436 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][64.5%][w=2432MiB/s][w=2431 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][67.7%][w=2440MiB/s][w=2440 IOPS][eta 00m:10s]
> > Jobs: 4 (f=4): [W(4)][71.0%][w=2437MiB/s][w=2437 IOPS][eta 00m:09s]
> > Jobs: 4 (f=4): [W(4)][74.2%][w=2442MiB/s][w=2442 IOPS][eta 00m:08s]
> > Jobs: 4 (f=4): [W(4)][77.4%][w=2425MiB/s][w=2424 IOPS][eta 00m:07s]
> > Jobs: 4 (f=4): [W(4)][80.6%][w=2459MiB/s][w=2459 IOPS][eta 00m:06s]
> > Jobs: 4 (f=4): [W(4)][86.7%][w=2428MiB/s][w=2427 IOPS][eta 00m:04s]
> > Jobs: 4 (f=4): [W(4)][90.0%][w=2441MiB/s][w=2440 IOPS][eta 00m:03s]
> > Jobs: 4 (f=4): [W(4)][93.3%][w=2438MiB/s][w=2437 IOPS][eta 00m:02s]
> > Jobs: 4 (f=4): [W(4)][96.7%][w=2450MiB/s][w=2449 IOPS][eta 00m:01s]
> > Jobs: 4 (f=4): [W(4)][100.0%][w=2430MiB/s][w=2430 IOPS][eta 00m:00s]
> > Jobs: 4 (f=4): [F(4)][100.0%][w=2372MiB/s][w=2372 IOPS][eta 00m:00s]
>
> fsync at the end is instant, which indicates that writing into the
> kernel is almost certainly being throttled.
>
> > 5.15 fio performance detail:
> > Jobs: 4 (f=4): [W(4)][14.3%][w=8563MiB/s][w=8563 IOPS][eta 00m:18s]
> > Jobs: 4 (f=4): [W(4)][18.2%][w=6376MiB/s][w=6375 IOPS][eta 00m:18s]
> > Jobs: 4 (f=4): [W(4)][20.8%][w=4566MiB/s][w=4565 IOPS][eta 00m:19s]
> > Jobs: 4 (f=4): [W(4)][23.1%][w=3947MiB/s][w=3947 IOPS][eta 00m:20s]
>
> So throttling starts and perf drops...
>
> > Jobs: 4 (f=4): [W(4)][25.9%][w=4601MiB/s][w=4601 IOPS][eta 00m:20s]
> > Jobs: 4 (f=4): [W(4)][28.6%][w=5797MiB/s][w=5796 IOPS][eta 00m:20s]
> > Jobs: 4 (f=4): [W(4)][32.1%][w=6802MiB/s][w=6801 IOPS][eta 00m:19s]
> > Jobs: 4 (f=4): [W(4)][35.7%][w=7411MiB/s][w=7411 IOPS][eta 00m:18s]
> > Jobs: 4 (f=4): [W(4)][40.7%][w=8445MiB/s][w=8444 IOPS][eta 00m:16s]
>
> .... and then it picks back up....
>
> > Jobs: 4 (f=4): [W(4)][46.2%][w=7992MiB/s][w=7992 IOPS][eta 00m:14s]
> > Jobs: 4 (f=4): [W(4)][52.0%][w=8118MiB/s][w=8117 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][56.0%][w=7742MiB/s][w=7741 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][62.5%][w=7497MiB/s][w=7496 IOPS][eta 00m:09s]
> > Jobs: 4 (f=4): [W(4)][66.7%][w=7248MiB/s][w=7248 IOPS][eta 00m:08s]
> > Jobs: 4 (f=4): [W(4)][70.8%][w=7461MiB/s][w=7460 IOPS][eta 00m:07s]
> > Jobs: 4 (f=4): [W(4)][75.0%][w=7959MiB/s][w=7959 IOPS][eta 00m:06s]
> > Jobs: 4 (f=4): [W(3),F(1)][79.2%][w=6982MiB/s][w=6982 IOPS][eta 00m:05s]
> > Jobs: 1 (f=1): [_(2),W(1),_(1)][87.0%][w=2809MiB/s][w=2808 IOPS][eta 00m:03s]
> > Jobs: 1 (f=1): [_(2),W(1),_(1)][95.5%][w=2669MiB/s][w=2668 IOPS][eta 00m:01s]
> > Jobs: 1 (f=1): [_(2),F(1),_(1)][100.0%][w=2552MiB/s][w=2552 IOPS][eta 00m:00s]
>
> I suspect that the likely culprit is mm-level changes - the
> page reclaim algorithm was completely replaced in 6.1 with a
> multi-generation LRU that will have different cache footprint
> behaviour in exactly this sort of "repeatedly over-write same files
> in a set that are significantly larger than memory" micro-benchmark.
>
> i.e. these commits:
>
> 07017acb0601 mm: multi-gen LRU: admin guide
> d6c3af7d8a2b mm: multi-gen LRU: debugfs interface
> 1332a809d95a mm: multi-gen LRU: thrashing prevention
> 354ed5974429 mm: multi-gen LRU: kill switch
> f76c83378851 mm: multi-gen LRU: optimize multiple memcgs
> bd74fdaea146 mm: multi-gen LRU: support page table walks
> 018ee47f1489 mm: multi-gen LRU: exploit locality in rmap
> ac35a4902374 mm: multi-gen LRU: minimal implementation
> ec1c86b25f4b mm: multi-gen LRU: groundwork
>
> If that's the case, I'd expect kernels up to 6.0 to demonstrate the
> same behaviour as 5.15, and 6.1+ to demonstrate the same behaviour
> as you've reported.
I tested 6.4.0-rc1. the performance become a little worse.
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/09
> So I'm thinking that the best thing to do is confirm that the change
> of behaviour is a result of the multi-gen LRU changes. If it is,
> then it's up to the multi-gen LRU developers to determine how to fix
> it. If it's not the multi-gen LRU, we'll have to kep digging.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2023-05-08 23:26 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-08 9:24 performance regression between 6.1.x and 5.15.x Wang Yugui
2023-05-08 14:46 ` Wang Yugui
2023-05-08 22:32 ` Dave Chinner
2023-05-08 23:25 ` Wang Yugui [this message]
2023-05-09 1:36 ` Dave Chinner
2023-05-09 12:37 ` Wang Yugui
2023-05-09 22:14 ` Dave Chinner
2023-05-10 5:46 ` Wang Yugui
2023-05-10 7:27 ` Dave Chinner
2023-05-10 8:50 ` Wang Yugui
2023-05-11 1:34 ` Dave Chinner
2023-05-17 13:07 ` Wang Yugui
2023-05-17 22:11 ` Dave Chinner
2023-05-18 18:36 ` Creating large folios in iomap buffered write path Matthew Wilcox
2023-05-18 21:46 ` Matthew Wilcox
2023-05-18 22:03 ` Matthew Wilcox
2023-05-19 2:55 ` Wang Yugui
2023-05-19 15:38 ` Matthew Wilcox
2023-05-20 13:35 ` Wang Yugui
2023-05-20 16:35 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230509072552.7E8B.409509F4@e16-tech.com \
--to=wangyugui@e16-tech.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.