From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [GIT PULL] xfs: Improve CIL scalability
Date: Thu, 11 May 2023 18:28:01 -0700 [thread overview]
Message-ID: <20230512012801.GI858799@frogsfrogsfrogs> (raw)
In-Reply-To: <20220707233347.GO227878@dread.disaster.area>
On Fri, Jul 08, 2022 at 09:33:47AM +1000, Dave Chinner wrote:
> Hi Darrick,
>
> Can you please pull the CIL scalability improvements for 5.20 from
> the tag below? This branch is based on the linux-xfs/for-next branch
> as of 2 days ago, so should apply without any merge issues at all.
>
> Cheers,
>
> Dave.
>
> The following changes since commit 7561cea5dbb97fecb952548a0fb74fb105bf4664:
>
> xfs: prevent a UAF when log IO errors race with unmount (2022-07-01 09:09:52 -0700)
>
> are available in the Git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs tags/xfs-cil-scale-5.20
>
> for you to fetch changes up to 51a117edff133a1ea8cb0fcbc599b8d5a34414e9:
>
> xfs: expanding delayed logging design with background material (2022-07-07 18:56:09 +1000)
>
> ----------------------------------------------------------------
> xfs: improve CIL scalability
>
> This series aims to improve the scalability of XFS transaction
> commits on large CPU count machines. My 32p machine hits contention
> limits in xlog_cil_commit() at about 700,000 transaction commits a
> section. It hits this at 16 thread workloads, and 32 thread
> workloads go no faster and just burn CPU on the CIL spinlocks.
>
> This patchset gets rid of spinlocks and global serialisation points
> in the xlog_cil_commit() path. It does this by moving to a
> combination of per-cpu counters, unordered per-cpu lists and
> post-ordered per-cpu lists.
FWIW, I (rather infrequently) see things like this in the 10 months or
so that this has been in mainline:
run fstests generic/650 at 2023-05-10 19:17:09
XFS (sda3): EXPERIMENTAL Large extent counts feature in use. Use at your own risk!
XFS (sda3): Mounting V5 Filesystem 75c42b12-8a39-4ecd-aac4-6b6ab0e384bd
XFS (sda3): Ending clean mount
smpboot: CPU 1 is now offline
x86: Booting SMP configuration:
smpboot: Booting Node 0 Processor 1 APIC 0x1
smpboot: CPU 1 is now offline
smpboot: CPU 3 is now offline
x86: Booting SMP configuration:
smpboot: Booting Node 0 Processor 1 APIC 0x1
smpboot: Booting Node 0 Processor 3 APIC 0x3
smpboot: CPU 3 is now offline
smpboot: Booting Node 0 Processor 3 APIC 0x3
smpboot: CPU 2 is now offline
smpboot: CPU 3 is now offline
XFS (sda3): ctx ticket reservation ran out. Need to up reservation
XFS (sda3): ticket reservation summary:
XFS (sda3): unit res = 9268 bytes
XFS (sda3): current res = -40 bytes
XFS (sda3): original count = 1
XFS (sda3): remaining count = 1
XFS (sda3): Filesystem has been shut down due to log error (0x2).
XFS (sda3): Please unmount the filesystem and rectify the problem(s).
Not sure what that's about, but given the recent discussions about
percpu counters not quite working correctly when racing with cpu
hotremove, I figured this would be a good time to capture one of the
failures and report it to the list.
--D
> This results in transaction commit rates exceeding 1.4 million
> commits/s under unlink certain workloads, and while the log lock
> contention is largely gone there is still significant lock
> contention in the VFS (dentry cache, inode cache and security layers)
> at >600,000 transactions/s that still limit scalability.
>
> The changes to the CIL accounting and behaviour, combined with the
> structural changes to xlog_write() in prior patchsets make the
> per-cpu restructuring possible and sane. This allows us to move to
> precalculated reservation requirements that allow for reservation
> stealing to be accounted across multiple CPUs accurately.
>
> That is, instead of trying to account for continuation log opheaders
> on a "growth" basis, we pre-calculate how many iclogs we'll need to
> write out a maximally sized CIL checkpoint and steal that reserveD
> that space one commit at a time until the CIL has a full
> reservation. If we ever run a commit when we are already at the hard
> limit (because post-throttling) we simply take an extra reservation
> from each commit that is run when over the limit. Hence we don't
> need to do space usage math in the fast path and so never need to
> sum the per-cpu counters in this fast path.
>
> Similarly, per-cpu lists have the problem of ordering - we can't
> remove an item from a per-cpu list if we want to move it forward in
> the CIL. We solve this problem by using an atomic counter to give
> every commit a sequence number that is copied into the log items in
> that transaction. Hence relogging items just overwrites the sequence
> number in the log item, and does not move it in the per-cpu lists.
> Once we reaggregate the per-cpu lists back into a single list in the
> CIL push work, we can run it through list-sort() and reorder it back
> into a globally ordered list. This costs a bit of CPU time, but now
> that the CIL can run multiple works and pipelines properly, this is
> not a limiting factor for performance. It does increase fsync
> latency when the CIL is full, but workloads issuing large numbers of
> fsync()s or sync transactions end up with very small CILs and so the
> latency impact or sorting is not measurable for such workloads.
>
> OVerall, this pushes the transaction commit bottleneck out to the
> lockless reservation grant head updates. These atomic updates don't
> start to be a limiting fact until > 1.5 million transactions/s are
> being run, at which point the accounting functions start to show up
> in profiles as the highest CPU users. Still, this series doubles
> transaction throughput without increasing CPU usage before we get
> to that cacheline contention breakdown point...
> `
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>
> ----------------------------------------------------------------
> Dave Chinner (14):
> xfs: use the CIL space used counter for emptiness checks
> xfs: lift init CIL reservation out of xc_cil_lock
> xfs: rework per-iclog header CIL reservation
> xfs: introduce per-cpu CIL tracking structure
> xfs: implement percpu cil space used calculation
> xfs: track CIL ticket reservation in percpu structure
> xfs: convert CIL busy extents to per-cpu
> xfs: Add order IDs to log items in CIL
> xfs: convert CIL to unordered per cpu lists
> xfs: convert log vector chain to use list heads
> xfs: move CIL ordering to the logvec chain
> xfs: avoid cil push lock if possible
> xfs: xlog_sync() manually adjusts grant head space
> xfs: expanding delayed logging design with background material
>
> Documentation/filesystems/xfs-delayed-logging-design.rst | 361 +++++++++++++++++++++++++++++++++++++++++++++++------
> fs/xfs/xfs_log.c | 55 ++++++---
> fs/xfs/xfs_log.h | 3 +-
> fs/xfs/xfs_log_cil.c | 472 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------
> fs/xfs/xfs_log_priv.h | 58 ++++++---
> fs/xfs/xfs_super.c | 1 +
> fs/xfs/xfs_trans.c | 4 +-
> fs/xfs/xfs_trans.h | 1 +
> fs/xfs/xfs_trans_priv.h | 3 +-
> 9 files changed, 768 insertions(+), 190 deletions(-)
>
> --
> Dave Chinner
> david@fromorbit.com
prev parent reply other threads:[~2023-05-12 1:28 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-07 23:33 [GIT PULL] xfs: Improve CIL scalability Dave Chinner
2023-05-12 1:28 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230512012801.GI858799@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox