Re: [PATCH 00/14 v6] xfs: improve CIL scalability

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 00/14 v6] xfs: improve CIL scalability
Date: Fri, 19 Nov 2021 10:15:54 +1100	[thread overview]
Message-ID: <20211118231554.GY449541@dread.disaster.area> (raw)
In-Reply-To: <20211109015240.1547991-1-david@fromorbit.com>

FYI: I've just rebased the git tree branch containing this code
on the V7 version of the xlog_write() rework patch set I just
posted.

No changes to this series were made in the rebase.

Cheers,

Dave.

On Tue, Nov 09, 2021 at 12:52:26PM +1100, Dave Chinner wrote:
> Time to try again to get this code merged.
> 
> This series aims to improve the scalability of XFS transaction
> commits on large CPU count machines. My 32p machine hits contention
> limits in xlog_cil_commit() at about 700,000 transaction commits a
> section. It hits this at 16 thread workloads, and 32 thread
> workloads go no faster and just burn CPU on the CIL spinlocks.
> 
> This patchset gets rid of spinlocks and global serialisation points
> in the xlog_cil_commit() path. It does this by moving to a
> combination of per-cpu counters, unordered per-cpu lists and
> post-ordered per-cpu lists.
> 
> This results in transaction commit rates exceeding 1.6 million
> commits/s under unlink certain workloads, and while the log lock
> contention is largely gone there is still significant lock
> contention at the VFS at 600,000 transactions/s:
> 
>   19.39%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    6.40%  [kernel]  [k] do_raw_spin_lock
>    4.07%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    3.08%  [kernel]  [k] memcpy_erms
>    1.93%  [kernel]  [k] xfs_buf_find
>    1.69%  [kernel]  [k] xlog_cil_commit
>    1.50%  [kernel]  [k] syscall_exit_to_user_mode
>    1.18%  [kernel]  [k] memset_erms
> 
> 
> -   64.23%     0.22%  [kernel]            [k] path_openat
>    - 64.01% path_openat
>       - 48.69% xfs_vn_create
>          - 48.60% xfs_generic_create
>             - 40.96% xfs_create
>                - 20.39% xfs_dir_ialloc
>                   - 7.05% xfs_setup_inode
> >>>>>                - 6.87% inode_sb_list_add
>                         - 6.54% _raw_spin_lock
>                            - 6.53% do_raw_spin_lock
>                                 6.08% __pv_queued_spin_lock_slowpath
> .....
>                - 11.27% xfs_trans_commit
>                   - 11.23% __xfs_trans_commit
>                      - 10.85% xlog_cil_commit
>                           2.47% memcpy_erms
>                         - 1.77% xfs_buf_item_committing
>                            - 1.70% xfs_buf_item_release
>                               - 0.79% xfs_buf_unlock
>                                    0.68% up
>                                 0.61% xfs_buf_rele
>                           0.80% xfs_buf_item_format
>                           0.73% xfs_inode_item_format
>                           0.68% xfs_buf_item_size
>                         - 0.55% kmem_alloc_large
>                            - 0.55% kmem_alloc
>                                 0.52% __kmalloc
> .....
>             - 7.08% d_instantiate
>                - 6.66% security_d_instantiate
> >>>>>>            - 6.63% selinux_d_instantiate
>                      - 6.48% inode_doinit_with_dentry
>                         - 6.11% _raw_spin_lock
>                            - 6.09% do_raw_spin_lock
>                                 5.60% __pv_queued_spin_lock_slowpath
> ....
>       - 1.77% terminate_walk
> >>>>>>   - 1.69% dput
>             - 1.55% _raw_spin_lock
>                - do_raw_spin_lock
>                     1.19% __pv_queued_spin_lock_slowpath
> 
> 
> But when we extend out to 1.5M commits/s we see that the contention
> starts to shift to the atomics in the lockless log reservation path:
> 
>   14.81%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    7.88%  [kernel]  [k] xlog_grant_add_space
>    7.18%  [kernel]  [k] xfs_log_ticket_ungrant
>    4.82%  [kernel]  [k] do_raw_spin_lock
>    3.58%  [kernel]  [k] xlog_space_left
>    3.51%  [kernel]  [k] xlog_cil_commit
> 
> There's still substantial spin lock contention occurring at the VFS,
> too, but it's indicating that multiple atomic variable updates per
> transaction reservation/commit pair is starting to reach scalability
> limits here.
> 
> This is largely a re-implementation of a past RFC patchsets. While
> that were good enough proof of concept to perf test, they did not
> preserve transaction order correctly and failed shutdown tests all
> the time. The changes to the CIL accounting and behaviour, combined
> with the structural changes to xlog_write() in prior patchsets make
> the per-cpu restructuring possible and sane.
> 
> Instead of trying to account for continuation log opheaders on a
> "growth" basis, we pre-calculate how many iclogs we'll need to write
> out a maximally sized CIL checkpoint and just reserve that space one
> per commit until the CIL has a full reservation. If we ever run a
> commit when we are already at the hard limit (because
> post-throttling) we simply take an extra reservation from each
> commit that is run when over the limit. Hence we don't need to do
> space usage math in the fast path and so never need to sum the
> per-cpu counters in this path.
> 
> Similarly, per-cpu lists have the problem of ordering - we can't
> remove an item from a per-cpu list if we want to move it forward in
> the CIL. We solve this problem by using an atomic counter to give
> every commit a sequence number that is copied into the log items in
> that transaction. Hence relogging items just overwrites the sequence
> number in the log item, and does not move it in the per-cpu lists.
> Once we reaggregate the per-cpu lists back into a single list in the
> CIL push work, we can run it through list-sort() and reorder it back
> into a globally ordered list. This costs a bit of CPU time, but now
> that the CIL can run multiple works and pipelines properly, this is
> not a limiting factor for performance. It does increase fsync
> latency when the CIL is full, but workloads issuing large numbers of
> fsync()s or sync transactions end up with very small CILs and so the
> latency impact or sorting is not measurable for such workloads.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-cil-scale-3
> 
> Version 6:
> - split out from aggregated patchset
> - rebase on linux-xfs/for-next + dgc/xlog-write-rework
> 
> Version 5:
> - https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@fromorbit.com/
> 
> 

-- 
Dave Chinner
david@fromorbit.com

     prev parent reply	other threads:[~2021-11-18 23:15 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-09  1:52 [PATCH 00/14 v6] xfs: improve CIL scalability Dave Chinner
2021-11-09  1:52 ` [PATCH 01/14] xfs: use the CIL space used counter for emptiness checks Dave Chinner
2021-11-09  1:52 ` [PATCH 02/14] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
2021-11-09  1:52 ` [PATCH 03/14] xfs: rework per-iclog header CIL reservation Dave Chinner
2021-11-09  1:52 ` [PATCH 04/14] xfs: introduce per-cpu CIL tracking structure Dave Chinner
2021-11-09  1:52 ` [PATCH 05/14] xfs: implement percpu cil space used calculation Dave Chinner
2021-11-09  1:52 ` [PATCH 06/14] xfs: track CIL ticket reservation in percpu structure Dave Chinner
2021-11-09  1:52 ` [PATCH 07/14] xfs: convert CIL busy extents to per-cpu Dave Chinner
2021-11-09  1:52 ` [PATCH 08/14] xfs: Add order IDs to log items in CIL Dave Chinner
2021-11-09  1:52 ` [PATCH 09/14] xfs: convert CIL to unordered per cpu lists Dave Chinner
2021-11-09  1:52 ` [PATCH 10/14] xfs: convert log vector chain to use list heads Dave Chinner
2021-11-09  1:52 ` [PATCH 11/14] xfs: move CIL ordering to the logvec chain Dave Chinner
2021-11-09  1:52 ` [PATCH 12/14] xfs: avoid cil push lock if possible Dave Chinner
2021-11-09  1:52 ` [PATCH 13/14] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
2021-11-09  1:52 ` [PATCH 14/14] xfs: expanding delayed logging design with background material Dave Chinner
2021-11-18 23:15 ` Dave Chinner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211118231554.GY449541@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox