From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
linux-xfs@vger.kernel.org, Dave Chinner <dchinner@redhat.com>,
"Darrick J. Wong" <djwong@kernel.org>
Subject: Re: Performance regression with async inode inactivation
Date: Wed, 13 Oct 2021 11:51:39 +0200 [thread overview]
Message-ID: <20211013095139.GC19200@quack2.suse.cz> (raw)
In-Reply-To: <20211012212339.GQ2361455@dread.disaster.area>
On Wed 13-10-21 08:23:39, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 03:42:55PM +0200, Jan Kara wrote:
> > On Fri 08-10-21 10:44:30, Dave Chinner wrote:
> > > Remove the fsync and we do have a real world workload - temporary
> > > files for compiles, etc. Then everything should mostly get logged
> > > in the same CIL context because all transactions are run
> > > asynchronously and aggregate in memory until the CIL is full and
> > > checkpoints itself. Hence log writes shouldn't change very much at
> > > all for such workloads.
> >
> > OK, that makes sense. Thanks for explanation. So to verify your theory,
> > I've removed fsync(2) from the test program. So now it is pure create,
> > write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000
> > loops of create-unlink to increase runtime (but the workload does barely
> > any writes so it should not matter wrt log):
> >
> > default mkfs params:
> > AVG STDDEV
> > before 2.0380 0.1597
> > after 2.7356 0.4712
> >
> > agcount=96, log size 512M
> > AVG STDDEV
> > before 1.0610 0.0227
> > after 1.2508 0.0218
> >
> > So still notable regression with the async inactivation. With default mkfs
> > params we'd need more runs to get more reliable results (note the rather
> > high standard deviation) but with high AG count results show pretty stable
> > 20% regression - so let's have a look at that.
> >
> > Looking at xfs stats there are barely any differences between before &
> > after - 'after' writes a bit more to the log but it is ~1.5 MB over the
> > whole benchmark run, altogether spending some 8ms doing IO so that's not
> > it. Generally the workload seems to be CPU / memory bound now (it does
> > barely any IO). Perf shows we are heavily contending on some spinlock in
> > xfs_cil_commit() - I presume this is a xc_cil_lock.
>
> Yes, and I have patches that fix this. It got reverted a before
> release because it exposed a bunch of underlying zero-fay bugs in
> the log code, and I haven't had time to run it through the review
> cycle again even though it's pretty much unchanged from commits
> 26-39 in this series:
>
> https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@fromorbit.com/
>
> The profiles in this patch demonstrate the problem and the fix:
>
> https://lore.kernel.org/linux-xfs/20210603052240.171998-35-david@fromorbit.com/
>
> I did all my perf testing of inode inactivation with the CIL
> scalability patches also installed, because deferred inode
> inactivation only made contention on the CIL lock worse in my perf
> tests. We simply can't evaluate the benefit of a change when the
> system is easily driven into catastrophic lock breakdown by user
> level operational concurrency.
>
> IOWs, the CIL lock is the global limiting factor for async
> transaction commit rates on large CPU count machines, and things
> that remove bottlenecks in higher layers often just increase
> contention on this lock and drive it into breakdown. That makes perf
> go backwards, not forwards, and it's not the fault of the high level
> change being made. That doesn't make the high level change wrong, it
> just means we need to peel the onion further before the improvements
> are fully realised.
OK, understood.
> > This actually happens
> > both before and after, but we seem to spend some more time there with async
> > inactivation. Likely this is related to work being done from worker
> > threads. Perf stats for comparison:
> >
> > before
> > 51,135.08 msec cpu-clock # 47.894 CPUs utilized
> > 4,699 context-switches # 0.092 K/sec
> > 382 cpu-migrations # 0.007 K/sec
> > 1,228 page-faults # 0.024 K/sec
> > 128,884,972,351 cycles # 2.520 GHz
> > 38,517,767,839 instructions # 0.30 insn per cycle
> > 8,337,611,468 branches # 163.051 M/sec
> > 39,749,736 branch-misses # 0.48% of all branches
> > 25,225,109 cache-misses
> >
> > 1.067666170 seconds time elapsed
> >
> > after
> > 65,353.43 msec cpu-clock # 47.894 CPUs utilized
> > 43,737 context-switches # 0.669 K/sec
> > 1,824 cpu-migrations # 0.028 K/sec
> > 1,953 page-faults # 0.030 K/sec
> > 155,144,150,867 cycles # 2.374 GHz
> > 45,280,145,337 instructions # 0.29 insn per cycle
> > 10,027,567,384 branches # 153.436 M/sec
> > 39,554,691 branch-misses # 0.39% of all branches
> > 30,203,567 cache-misses
> >
> > 1.364539400 seconds time elapsed
> >
> > So we can see huge increase in context-switches, notable increase in
> > cache-misses, decrease in cycles/s so perhaps we are bouncing cache more?
> > Anyway I guess this is kind of expected due to the nature of async
> > inactivation, I just wanted to highlight that there are regressions without
> > fsync in the game as well.
>
> Context switches are largely noise - they are most likely just AGI
> locks being bounced a bit more. It's the spinlock contention that is
> the likely issue here. For example, on my 32p machine with vanilla
> 5.15-rc4 with a fsync-less, 5000 iteration test run:
>
> $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch
> Processes started.
> 1.290
>
> Performance counter stats for './stress-unlink 32 /mnt/scratch':
>
> 16,856.61 msec task-clock # 12.595 CPUs utilized
> 48,297 context-switches # 0.003 M/sec
> 4,219 cpu-migrations # 0.250 K/sec
> 1,373 page-faults # 0.081 K/sec
> 39,254,798,526 cycles # 2.329 GHz
> 16,460,808,349 instructions # 0.42 insn per cycle
> 3,475,251,228 branches # 206.166 M/sec
> 12,129,889 branch-misses # 0.35% of all branches
>
> 1.338312347 seconds time elapsed
>
> 0.186554000 seconds user
> 17.247176000 seconds sys
>
> And with 5.15-rc4 + CIL scalability:
>
> $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch
> Processes started.
> 0.894
>
> Performance counter stats for './stress-unlink 32 /mnt/scratch':
>
> 12,917.93 msec task-clock # 13.805 CPUs utilized
> 39,680 context-switches # 0.003 M/sec
> 2,737 cpu-migrations # 0.212 K/sec
> 1,402 page-faults # 0.109 K/sec
> 30,920,293,752 cycles # 2.394 GHz
> 14,472,067,501 instructions # 0.47 insn per cycle
> 2,700,978,247 branches # 209.087 M/sec
> 9,287,754 branch-misses # 0.34% of all branches
>
> 0.935710173 seconds time elapsed
>
> 0.192467000 seconds user
> 13.245977000 seconds sys
>
> Runtime of the fsync-less, 5,000 iteration version drops from 1.29s
> to 0.89s, IPC goes up, branches and branch-misses go down, context
> switches only go down slightly, etc. IOWs, when you take away the
> CIL lock contention, we get back all that perf loss and then some...
Nice results!
> FWIW, let's really hammer it for a long while. Vanilla 5.14-rc4:
>
> $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch
> Processes started.
> 38.881
>
> Performance counter stats for './stress-unlink 1000 /mnt/scratch':
>
> 733,741.06 msec task-clock # 16.004 CPUs utilized
> 13,131,968 context-switches # 0.018 M/sec
> 1,302,636 cpu-migrations # 0.002 M/sec
> 40,720 page-faults # 0.055 K/sec
> 1,195,192,185,398 cycles # 1.629 GHz
> 643,382,890,656 instructions # 0.54 insn per cycle
> 129,065,409,600 branches # 175.900 M/sec
> 768,146,988 branch-misses # 0.60% of all branches
>
> 45.847750477 seconds time elapsed
>
> 11.194020000 seconds user
> 758.762376000 seconds sys
>
> And the transaction rate is pinned at 800,000/s for the entire test.
> We're running at the limit of the CIL lock here.
>
> With CIL scalbility patchset:
>
> $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch
> Processes started.
> 28.263
>
> Performance counter stats for './stress-unlink 1000 /mnt/scratch':
>
> 450,547.80 msec task-clock # 15.038 CPUs utilized
> 5,949,268 context-switches # 0.013 M/sec
> 868,887 cpu-migrations # 0.002 M/sec
> 41,570 page-faults # 0.092 K/sec
> 836,933,822,425 cycles # 1.858 GHz
> 536,132,522,275 instructions # 0.64 insn per cycle
> 99,264,579,180 branches # 220.320 M/sec
> 506,921,132 branch-misses # 0.51% of all branches
>
> 29.961492616 seconds time elapsed
>
> 7.796758000 seconds user
> 471.990545000 seconds sys
>
>
> 30% reduction in runtime because the transaction rate is now
> running at 1.1M/s. Improvements in code execution across the board
> here, so the problem clearly isn't the deferred inode inactivation.
>
> IOWs, I'm largely not concerned about the high CPU count perf
> regressions that we are seeing from log code these days - the fix is
> largely ready, it's just lacking in available engineering time to get
> it over the line and merged right now...
OK, thanks for explanation!
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
prev parent reply other threads:[~2021-10-13 9:51 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-04 10:06 Performance regression with async inode inactivation Jan Kara
2021-10-04 18:17 ` Darrick J. Wong
2021-10-04 21:15 ` Dave Chinner
2021-10-05 8:11 ` Jan Kara
2021-10-05 21:26 ` Dave Chinner
2021-10-06 18:10 ` Jan Kara
2021-10-06 21:58 ` Dave Chinner
2021-10-07 12:03 ` Jan Kara
2021-10-07 23:44 ` Dave Chinner
2021-10-12 13:42 ` Jan Kara
2021-10-12 21:23 ` Dave Chinner
2021-10-13 9:51 ` Jan Kara [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211013095139.GC19200@quack2.suse.cz \
--to=jack@suse.cz \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).