Re: performance regression between 6.1.x and 5.15.x

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wang Yugui <wangyugui@e16-tech.com>
To: Dave Chinner <david@fromorbit.com>, Matthew Wilcox <willy@infradead.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: performance regression between 6.1.x and 5.15.x
Date: Wed, 17 May 2023 21:07:41 +0800	[thread overview]
Message-ID: <20230517210740.6464.409509F4@e16-tech.com> (raw)
In-Reply-To: <20230511013410.GY3223426@dread.disaster.area>

Hi,

> On Wed, May 10, 2023 at 04:50:56PM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > 
> > > On Wed, May 10, 2023 at 01:46:49PM +0800, Wang Yugui wrote:
> > > > > Ok, that is further back in time than I expected. In terms of XFS,
> > > > > there are only two commits between 5.16..5.17 that might impact
> > > > > performance:
> > > > > 
> > > > > ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")
> > > > > 
> > > > > and
> > > > > 
> > > > > 6795801366da ("xfs: Support large folios")
> > > > > 
> > > > > To test whether ebb7fb1557b1 is the cause, go to
> > > > > fs/iomap/buffered-io.c and change:
> > > > > 
> > > > > -#define IOEND_BATCH_SIZE        4096
> > > > > +#define IOEND_BATCH_SIZE        1048576
> > > > > This will increase the IO submission chain lengths to at least 4GB
> > > > > from the 16MB bound that was placed on 5.17 and newer kernels.
> > > > > 
> > > > > To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
> > > > > and comment out both calls to mapping_set_large_folios(). This will
> > > > > ensure the page cache only instantiates single page folios the same
> > > > > as 5.16 would have.
> > > > 
> > > > 6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576'
> > > > 	fio WRITE: bw=6451MiB/s (6764MB/s)
> > > > 
> > > > still  performance regression when compare to linux 5.16.20
> > > > 	fio WRITE: bw=7666MiB/s (8039MB/s),
> > > > 
> > > > but the performance regression is not too big, then difficult to bisect.
> > > > We noticed samle level  performance regression  on btrfs too.
> > > > so maby some problem of some code that is  used by both btrfs and xfs
> > > > such as iomap and mm/folio.
> > > 
> > > Yup, that's quite possibly something like the multi-gen LRU changes,
> > > but that's not the regression we need to find. :/
> > > 
> > > > 6.1.x  with 'mapping_set_large_folios remove' only'
> > > > 	fio   WRITE: bw=2676MiB/s (2806MB/s)
> > > > 
> > > > 6.1.x with 'IOEND_BATCH_SIZE=1048576' only'
> > > > 	fio WRITE: bw=5092MiB/s (5339MB/s),
> > > > 	fio  WRITE: bw=6076MiB/s (6371MB/s)
> > > > 
> > > > maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit
> > > > individual ioend chain lengths in writeback")'.
> > > 
> > > OK, can you re-run the two 6.1.x kernels above (the slow and the
> > > fast) and record the output of `iostat -dxm 1` whilst the
> > > fio test is running? I want to see what the overall differences in
> > > the IO load on the devices are between the two runs. This will tell
> > > us how the IO sizes and queue depths change between the two kernels,
> > > etc.
> > 
> > `iostat -dxm 1` result saved in attachment file.
> > good.txt	good performance
> > bad.txt		bad performance
> 
> Thanks!
> 
> What I see here is that neither the good or the bad config are able
> to drive the hardware to 100% utilisation, but the way the IO stack
> is behaving is identical. The only difference is that
> the good config is driving much more IO to the devices, such that
> the top level RAID0 stripe reports ~90% utilisation vs 50%
> utilisation.
> 
> What this says to me is that the limitation in throughput is the
> single threaded background IO submission (the bdi-flush thread) is
> CPU bound in both cases, and that the difference is in how much CPU
> each IO submission is consuming.
> 
> From some tests here at lower bandwidth (1-2GB/s) with a batch size
> of 4096, I'm seeing the vast majority of submission CPU time being
> spent in folio_start_writeback(), and the vast majority of CPU time
> in IO completion being spent in folio_end_writeback. There's an
> order of magnitude more CPU time in these functions than in any of
> the XFS or iomap writeback functions.
> 
> A typical 5 second expanded snapshot profile (from `perf top -g -U`)
> of the bdi-flusher thread looks like this:
> 
>    99.22%     3.68%  [kernel]  [k] write_cache_pages
>    - 65.13% write_cache_pages
>       - 46.84% iomap_do_writepage
>          - 35.50% __folio_start_writeback
>             - 7.94% _raw_spin_lock_irqsave
>                - 11.35% do_raw_spin_lock
>                     __pv_queued_spin_lock_slowpath
>             - 5.37% _raw_spin_unlock_irqrestore
>                - 5.32% do_raw_spin_unlock
>                     __raw_callee_save___pv_queued_spin_unlock
>                - 0.92% asm_common_interrupt
>                     common_interrupt
>                     __common_interrupt
>                     handle_edge_irq
>                     handle_irq_event
>                     __handle_irq_event_percpu
>                     vring_interrupt
>                     virtblk_done
>             - 4.18% __mod_lruvec_page_state
>                - 2.18% __mod_lruvec_state
>                     1.16% __mod_node_page_state
>                     0.68% __mod_memcg_lruvec_state
>                  0.90% __mod_memcg_lruvec_state
>               2.88% xas_descend
>               1.63% percpu_counter_add_batch
>               1.63% mod_zone_page_state
>               1.15% xas_load
>               1.11% xas_start
>               0.93% __rcu_read_unlock
>             - 0.89% folio_memcg_lock
>               0.63% asm_common_interrupt
>                  common_interrupt
>                  __common_interrupt
>                  handle_edge_irq
>                  handle_irq_event
>                  __handle_irq_event_percpu
>                  vring_interrupt
>                  virtblk_done
>                  virtblk_complete_batch
>                  blk_mq_end_request_batch
>                  bio_endio
>                  iomap_writepage_end_bio
>                  iomap_finish_ioend
>          - 2.75% xfs_map_blocks
>             - 1.55% __might_sleep
>                  1.26% __might_resched
>          - 1.90% bio_add_folio
>               1.13% __bio_try_merge_page
>          - 1.82% submit_bio
>             - submit_bio_noacct
>                - 1.82% submit_bio_noacct_nocheck
>                   - __submit_bio
>                        1.77% blk_mq_submit_bio
>            1.27% inode_to_bdi
>            1.19% xas_clear_mark
>            0.65% xas_set_mark
>            0.57% iomap_page_create.isra.0
>       - 12.91% folio_clear_dirty_for_io
>          - 2.72% __mod_lruvec_page_state
>             - 1.84% __mod_lruvec_state
>                  0.98% __mod_node_page_state
>                  0.58% __mod_memcg_lruvec_state
>            1.55% mod_zone_page_state
>            1.49% percpu_counter_add_batch
>          - 0.72% asm_common_interrupt
>               common_interrupt
>               __common_interrupt
>               handle_edge_irq
>               handle_irq_event
>               __handle_irq_event_percpu
>               vring_interrupt
>               virtblk_done
>               virtblk_complete_batch
>               blk_mq_end_request_batch
>               bio_endio
>               iomap_writepage_end_bio
>               iomap_finish_ioend
>            0.55% folio_mkclean
>       - 8.08% filemap_get_folios_tag
>            1.84% xas_find_marked
>       - 1.89% __pagevec_release
>            1.87% release_pages
>       - 1.65% __might_sleep
>            1.33% __might_resched
>         1.22% folio_unlock
>    - 3.68% ret_from_fork
>         kthread
>         worker_thread
>         process_one_work
>         wb_workfn
>         wb_writeback
>         __writeback_inodes_wb
>         writeback_sb_inodes
>         __writeback_single_inode
>         do_writepages
>         xfs_vm_writepages
>         iomap_writepages
>         write_cache_pages
> 
> This indicates that 35% of writeback submission CPU is in
> __folio_start_writeback(), 13% is in folio_clear_dirty_for_io(), 8%
> is in filemap_get_folios_tag() and only ~8% of CPU time is in the
> rest of the iomap/XFS code building and submitting bios from the
> folios passed to it.  i.e.  it looks a lot like writeback is is
> contending with the incoming write(), IO completion and memory
> reclaim contexts for access to the page cache mapping and mm
> accounting structures.
> 
> Unfortunately, I don't have access to hardware that I can use to
> confirm this is the cause, but it doesn't look like it's directly an
> XFS/iomap issue at this point. The larger batch sizes reduce both
> memory reclaim and IO completion competition with submission, so it
> kinda points in this direction.
> 
> I suspect we need to start using high order folios in the write path
> where we have large user IOs for streaming writes, but I also wonder
> if there isn't some sort of batched accounting/mapping tree updates
> we could do for all the adjacent folios in a single bio....


Is there some comment from Matthew Wilcox?
since it seems a folios problem?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/17

next prev parent reply	other threads:[~2023-05-17 13:07 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-08  9:24 performance regression between 6.1.x and 5.15.x Wang Yugui
2023-05-08 14:46 ` Wang Yugui
2023-05-08 22:32   ` Dave Chinner
2023-05-08 23:25     ` Wang Yugui
2023-05-09  1:36       ` Dave Chinner
2023-05-09 12:37         ` Wang Yugui
2023-05-09 22:14           ` Dave Chinner
2023-05-10  5:46             ` Wang Yugui
2023-05-10  7:27               ` Dave Chinner
2023-05-10  8:50                 ` Wang Yugui
2023-05-11  1:34                   ` Dave Chinner
2023-05-17 13:07                     ` Wang Yugui [this message]
2023-05-17 22:11                       ` Dave Chinner
2023-05-18 18:36                       ` Creating large folios in iomap buffered write path Matthew Wilcox
2023-05-18 21:46                         ` Matthew Wilcox
2023-05-18 22:03                           ` Matthew Wilcox
2023-05-19  2:55                             ` Wang Yugui
2023-05-19 15:38                               ` Matthew Wilcox
2023-05-20 13:35                                 ` Wang Yugui
2023-05-20 16:35                                   ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230517210740.6464.409509F4@e16-tech.com \
    --to=wangyugui@e16-tech.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.