Re: performance regression between 6.1.x and 5.15.x

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Wang Yugui <wangyugui@e16-tech.com>
To: Dave Chinner <david@fromorbit.com>, Matthew Wilcox <willy@infradead.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: performance regression between 6.1.x and 5.15.x
Date: Wed, 17 May 2023 21:07:41 +0800	[thread overview]
Message-ID: <20230517210740.6464.409509F4@e16-tech.com> (raw)
In-Reply-To: <20230511013410.GY3223426@dread.disaster.area>

Hi,

> On Wed, May 10, 2023 at 04:50:56PM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > 
> > > On Wed, May 10, 2023 at 01:46:49PM +0800, Wang Yugui wrote:
> > > > > Ok, that is further back in time than I expected. In terms of XFS,
> > > > > there are only two commits between 5.16..5.17 that might impact
> > > > > performance:
> > > > > 
> > > > > ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")
> > > > > 
> > > > > and
> > > > > 
> > > > > 6795801366da ("xfs: Support large folios")
> > > > > 
> > > > > To test whether ebb7fb1557b1 is the cause, go to
> > > > > fs/iomap/buffered-io.c and change:
> > > > > 
> > > > > -#define IOEND_BATCH_SIZE        4096
> > > > > +#define IOEND_BATCH_SIZE        1048576
> > > > > This will increase the IO submission chain lengths to at least 4GB
> > > > > from the 16MB bound that was placed on 5.17 and newer kernels.
> > > > > 
> > > > > To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
> > > > > and comment out both calls to mapping_set_large_folios(). This will
> > > > > ensure the page cache only instantiates single page folios the same
> > > > > as 5.16 would have.
> > > > 
> > > > 6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576'
> > > > 	fio WRITE: bw=6451MiB/s (6764MB/s)
> > > > 
> > > > still  performance regression when compare to linux 5.16.20
> > > > 	fio WRITE: bw=7666MiB/s (8039MB/s),
> > > > 
> > > > but the performance regression is not too big, then difficult to bisect.
> > > > We noticed samle level  performance regression  on btrfs too.
> > > > so maby some problem of some code that is  used by both btrfs and xfs
> > > > such as iomap and mm/folio.
> > > 
> > > Yup, that's quite possibly something like the multi-gen LRU changes,
> > > but that's not the regression we need to find. :/
> > > 
> > > > 6.1.x  with 'mapping_set_large_folios remove' only'
> > > > 	fio   WRITE: bw=2676MiB/s (2806MB/s)
> > > > 
> > > > 6.1.x with 'IOEND_BATCH_SIZE=1048576' only'
> > > > 	fio WRITE: bw=5092MiB/s (5339MB/s),
> > > > 	fio  WRITE: bw=6076MiB/s (6371MB/s)
> > > > 
> > > > maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit
> > > > individual ioend chain lengths in writeback")'.
> > > 
> > > OK, can you re-run the two 6.1.x kernels above (the slow and the
> > > fast) and record the output of `iostat -dxm 1` whilst the
> > > fio test is running? I want to see what the overall differences in
> > > the IO load on the devices are between the two runs. This will tell
> > > us how the IO sizes and queue depths change between the two kernels,
> > > etc.
> > 
> > `iostat -dxm 1` result saved in attachment file.
> > good.txt	good performance
> > bad.txt		bad performance
> 
> Thanks!
> 
> What I see here is that neither the good or the bad config are able
> to drive the hardware to 100% utilisation, but the way the IO stack
> is behaving is identical. The only difference is that
> the good config is driving much more IO to the devices, such that
> the top level RAID0 stripe reports ~90% utilisation vs 50%
> utilisation.
> 
> What this says to me is that the limitation in throughput is the
> single threaded background IO submission (the bdi-flush thread) is
> CPU bound in both cases, and that the difference is in how much CPU
> each IO submission is consuming.
> 
> From some tests here at lower bandwidth (1-2GB/s) with a batch size
> of 4096, I'm seeing the vast majority of submission CPU time being
> spent in folio_start_writeback(), and the vast majority of CPU time
> in IO completion being spent in folio_end_writeback. There's an
> order of magnitude more CPU time in these functions than in any of
> the XFS or iomap writeback functions.
> 
> A typical 5 second expanded snapshot profile (from `perf top -g -U`)
> of the bdi-flusher thread looks like this:
> 
>    99.22%     3.68%  [kernel]  [k] write_cache_pages
>    - 65.13% write_cache_pages
>       - 46.84% iomap_do_writepage
>          - 35.50% __folio_start_writeback
>             - 7.94% _raw_spin_lock_irqsave
>                - 11.35% do_raw_spin_lock
>                     __pv_queued_spin_lock_slowpath
>             - 5.37% _raw_spin_unlock_irqrestore
>                - 5.32% do_raw_spin_unlock
>                     __raw_callee_save___pv_queued_spin_unlock
>                - 0.92% asm_common_interrupt
>                     common_interrupt
>                     __common_interrupt
>                     handle_edge_irq
>                     handle_irq_event
>                     __handle_irq_event_percpu
>                     vring_interrupt
>                     virtblk_done
>             - 4.18% __mod_lruvec_page_state
>                - 2.18% __mod_lruvec_state
>                     1.16% __mod_node_page_state
>                     0.68% __mod_memcg_lruvec_state
>                  0.90% __mod_memcg_lruvec_state
>               2.88% xas_descend
>               1.63% percpu_counter_add_batch
>               1.63% mod_zone_page_state
>               1.15% xas_load
>               1.11% xas_start
>               0.93% __rcu_read_unlock
>             - 0.89% folio_memcg_lock
>               0.63% asm_common_interrupt
>                  common_interrupt
>                  __common_interrupt
>                  handle_edge_irq
>                  handle_irq_event
>                  __handle_irq_event_percpu
>                  vring_interrupt
>                  virtblk_done
>                  virtblk_complete_batch
>                  blk_mq_end_request_batch
>                  bio_endio
>                  iomap_writepage_end_bio
>                  iomap_finish_ioend
>          - 2.75% xfs_map_blocks
>             - 1.55% __might_sleep
>                  1.26% __might_resched
>          - 1.90% bio_add_folio
>               1.13% __bio_try_merge_page
>          - 1.82% submit_bio
>             - submit_bio_noacct
>                - 1.82% submit_bio_noacct_nocheck
>                   - __submit_bio
>                        1.77% blk_mq_submit_bio
>            1.27% inode_to_bdi
>            1.19% xas_clear_mark
>            0.65% xas_set_mark
>            0.57% iomap_page_create.isra.0
>       - 12.91% folio_clear_dirty_for_io
>          - 2.72% __mod_lruvec_page_state
>             - 1.84% __mod_lruvec_state
>                  0.98% __mod_node_page_state
>                  0.58% __mod_memcg_lruvec_state
>            1.55% mod_zone_page_state
>            1.49% percpu_counter_add_batch
>          - 0.72% asm_common_interrupt
>               common_interrupt
>               __common_interrupt
>               handle_edge_irq
>               handle_irq_event
>               __handle_irq_event_percpu
>               vring_interrupt
>               virtblk_done
>               virtblk_complete_batch
>               blk_mq_end_request_batch
>               bio_endio
>               iomap_writepage_end_bio
>               iomap_finish_ioend
>            0.55% folio_mkclean
>       - 8.08% filemap_get_folios_tag
>            1.84% xas_find_marked
>       - 1.89% __pagevec_release
>            1.87% release_pages
>       - 1.65% __might_sleep
>            1.33% __might_resched
>         1.22% folio_unlock
>    - 3.68% ret_from_fork
>         kthread
>         worker_thread
>         process_one_work
>         wb_workfn
>         wb_writeback
>         __writeback_inodes_wb
>         writeback_sb_inodes
>         __writeback_single_inode
>         do_writepages
>         xfs_vm_writepages
>         iomap_writepages
>         write_cache_pages
> 
> This indicates that 35% of writeback submission CPU is in
> __folio_start_writeback(), 13% is in folio_clear_dirty_for_io(), 8%
> is in filemap_get_folios_tag() and only ~8% of CPU time is in the
> rest of the iomap/XFS code building and submitting bios from the
> folios passed to it.  i.e.  it looks a lot like writeback is is
> contending with the incoming write(), IO completion and memory
> reclaim contexts for access to the page cache mapping and mm
> accounting structures.
> 
> Unfortunately, I don't have access to hardware that I can use to
> confirm this is the cause, but it doesn't look like it's directly an
> XFS/iomap issue at this point. The larger batch sizes reduce both
> memory reclaim and IO completion competition with submission, so it
> kinda points in this direction.
> 
> I suspect we need to start using high order folios in the write path
> where we have large user IOs for streaming writes, but I also wonder
> if there isn't some sort of batched accounting/mapping tree updates
> we could do for all the adjacent folios in a single bio....


Is there some comment from Matthew Wilcox?
since it seems a folios problem?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/17

next prev parent reply	other threads:[~2023-05-17 13:07 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-08  9:24 performance regression between 6.1.x and 5.15.x Wang Yugui
2023-05-08 14:46 ` Wang Yugui
2023-05-08 22:32   ` Dave Chinner
2023-05-08 23:25     ` Wang Yugui
2023-05-09  1:36       ` Dave Chinner
2023-05-09 12:37         ` Wang Yugui
2023-05-09 22:14           ` Dave Chinner
2023-05-10  5:46             ` Wang Yugui
2023-05-10  7:27               ` Dave Chinner
2023-05-10  8:50                 ` Wang Yugui
2023-05-11  1:34                   ` Dave Chinner
2023-05-17 13:07                     ` Wang Yugui [this message]
2023-05-17 22:11                       ` Dave Chinner
2023-05-18 18:36                       ` Creating large folios in iomap buffered write path Matthew Wilcox
2023-05-18 21:46                         ` Matthew Wilcox
2023-05-18 22:03                           ` Matthew Wilcox
2023-05-19  2:55                             ` Wang Yugui
2023-05-19 15:38                               ` Matthew Wilcox
2023-05-20 13:35                                 ` Wang Yugui
2023-05-20 16:35                                   ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230517210740.6464.409509F4@e16-tech.com \
    --to=wangyugui@e16-tech.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).