From: Ingo Molnar <mingo@kernel.org>
To: Uros Bizjak <ubizjak@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@intel.com>,
x86@kernel.org, linux-kernel@vger.kernel.org,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
Linus Torvalds <torvalds@linuxfoundation.org>
Subject: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns
Date: Wed, 5 Mar 2025 21:20:05 +0100 [thread overview]
Message-ID: <Z8ix9YQEIdyAopCw@gmail.com> (raw)
In-Reply-To: <Z8isNxBxC9pcG4KL@gmail.com>
* Ingo Molnar <mingo@kernel.org> wrote:
>
> * Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > On Sat, Mar 1, 2025 at 1:38 PM Borislav Petkov <bp@alien8.de> wrote:
> > >
> > > On Sat, Mar 01, 2025 at 10:05:56AM +0100, Uros Bizjak wrote:
> > > > OTOH, -Os, where different code size/performance heuristics are used, now
> > > > performs better w.r.t code size.
> > >
> > > Did anything change since:
> > >
> > > 281dc5c5ec0f ("Give up on pushing CC_OPTIMIZE_FOR_SIZE")
> > > 3a55fb0d9fe8 ("Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE")
> > >
> > > wrt -Os?
> > >
> > > Because if not, we still don't love -Os and you can drop the -Os argument.
> >
> > The -Os argument was to show the effect of the patch when the compiler
> > is instructed to take care of the overall size. Giving the compiler
> > -O2 and then looking at the overall size of the produced binary is
> > just wrong.
> >
> > > And without any perf data showing any improvement, this patch does nothing but
> > > enlarge -O2 size...
> >
> > Even to my surprise, the patch has some noticeable effects on the
> > performance, please see the attachment in [1] for LMBench data or [2]
> > for some excerpts from the data. So, I think the patch has potential
> > to improve the performance.
> >
> > [1] https://lore.kernel.org/lkml/CAFULd4YBcG45bigHBox2pu+To+Y5BzbRxG+pUr42AVOWSnfKsg@mail.gmail.com/
> > [2] https://lore.kernel.org/lkml/CAFULd4ZsSKwJ4Dz3cCAgaVsa4ypbb0e2savO-3_Ltbs=1wzgKQ@mail.gmail.com/
>
> If you are measuring micro-costs, please make sure you pin the
> workload to a single CPU (via 'taskset' for example) and run 'perf
> stat --null --repeat 5' or so to measure the run-over-run noise of
> the benchmark.
And if the benchmark is context-switching heavy, you'll want to use
'perf stat -a' option to not have PMU context switching costs, and the
-C option to only measure on the pinned CPU.
For example, to measure pipe handling overhead, the naive measurement is:
starship:~> perf bench sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 6.939 [sec]
6.939128 usecs/op
144110 ops/sec
starship:~> perf bench sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 6.879 [sec]
6.879282 usecs/op
145364 ops/sec
See how the run-to-run noise is 0.9%?
If we measure it naively with perf stat, we get:
starship:~> perf stat perf bench sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 11.870 [sec]
11.870403 usecs/op
84243 ops/sec
Performance counter stats for 'perf bench sched pipe':
10,722.04 msec task-clock # 0.903 CPUs utilized
2,000,093 context-switches # 186.540 K/sec
499 cpu-migrations # 46.540 /sec
1,482 page-faults # 138.220 /sec
27,853,380,218 cycles # 2.598 GHz
18,434,409,889 stalled-cycles-frontend # 66.18% frontend cycles idle
24,277,227,239 instructions # 0.87 insn per cycle
# 0.76 stalled cycles per insn
5,001,727,980 branches # 466.490 M/sec
572,756,283 branch-misses # 11.45% of all branches
11.875458968 seconds time elapsed
0.271152000 seconds user
11.272766000 seconds sys
See how the usecs/op increased by +70% due to PMU switching overhead?
With --null we can reduce the PMU switching overhead by only measuring
elapsed time:
starship:~> perf stat --null perf bench sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 6.916 [sec]
6.916700 usecs/op
144577 ops/sec
Performance counter stats for 'perf bench sched pipe':
6.921547909 seconds time elapsed
0.341734000 seconds user
6.215287000 seconds sys
But noise is still high:
starship:~> perf stat --null --repeat 5 perf bench sched pipe
6.854731 usecs/op
7.082047 usecs/op
7.087193 usecs/op
6.934439 usecs/op
7.056695 usecs/op
...
Performance counter stats for 'perf bench sched pipe' (5 runs):
7.0093 +- 0.0463 seconds time elapsed ( +- 0.66% )
Likely due to the tasks migrating semi-randomly among cores.
We can pin them down to a single CPU (CPU2 in this case) via taskset:
starship:~> taskset 4 perf stat --null --repeat 5 perf bench sched pipe
5.575906 usecs/op
5.637112 usecs/op
5.532060 usecs/op
5.703270 usecs/op
5.506517 usecs/op
Performance counter stats for 'perf bench sched pipe' (5 runs):
5.5929 +- 0.0359 seconds time elapsed ( +- 0.64% )
Note how performance increased by ~25%, due to lack of migration, but
noise is still a bit high.
A good way to reduce noise is to measure instructions only:
starship:~> taskset 0x4 perf stat -e instructions --repeat 5 perf bench sched pipe
6.962279 usecs/op
6.917374 usecs/op
6.928672 usecs/op
6.939555 usecs/op
6.942980 usecs/op
Performance counter stats for 'perf bench sched pipe' (5 runs):
32,561,773,780 instructions ( +- 0.27% )
6.93977 +- 0.00735 seconds time elapsed ( +- 0.11% )
'Number of instructions executed' is an imperfect proxy for overhead.
(Not every instruction has the same overhead - but for compiler code
generation it's a useful proxy in most cases.)
But the best measurement is to avoid the PMU switching overhead via the
'-a' option, and limiting the measurement to the saturated CPU#2:
starship:~> taskset 0x4 perf stat -a -C 2 -e instructions --repeat 5 perf bench sched pipe
5.808068 usecs/op
5.843716 usecs/op
5.826543 usecs/op
5.801616 usecs/op
5.793129 usecs/op
Performance counter stats for 'system wide' (5 runs):
32,244,691,275 instructions ( +- 0.21% )
5.81624 +- 0.00912 seconds time elapsed ( +- 0.16% )
Note how this measurement provides the highest performance for the
workload, almost as good as --null.
- Beware the difference in CPU mask parameters between taskset and perf
stat. (I tried to convince the perf tooling people to integrate
CPU-pinning into perf stat, but I digress.)
- Beware of cpufreq considerations: changing CPU frequencies will skew
your workload's performance by a lot more than the 0.1% kind of
noise we are trying to gun for. Setting your CPU governor to
'performance' will eliminate some (but not all) cpufreq artifacts.
- On modern systems there's also boot-to-boot variance of key data
structure alignment and cache access patterns, that can sometimes
rise beyond the noise of the measurement. These can send you on a
wild goose chase ...
Finally, you can use something like 'nice -n -10' to increase the
priority of your benchmark and reduce the impact of other workloads
running on your system.
Anyway, I think noise levels of around 0.1%-0.2% are about the best you
can expect in context-switch heavy workloads. (A bit better in
CPU-bound workloads with low context switching.)
Thanks,
Ingo
next prev parent reply other threads:[~2025-03-05 20:20 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-28 12:35 [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns Uros Bizjak
2025-02-28 13:13 ` Uros Bizjak
2025-02-28 16:48 ` Dave Hansen
2025-02-28 22:31 ` Uros Bizjak
2025-02-28 22:58 ` Dave Hansen
2025-03-01 9:05 ` Uros Bizjak
2025-03-01 12:38 ` Borislav Petkov
2025-03-05 8:54 ` Uros Bizjak
2025-03-05 17:04 ` Linus Torvalds
2025-03-05 19:40 ` Peter Zijlstra
2025-03-05 19:47 ` Uros Bizjak
2025-03-05 22:18 ` David Laight
2025-03-05 20:14 ` David Laight
2025-03-06 10:45 ` Uros Bizjak
2025-03-06 13:07 ` Uros Bizjak
2025-03-06 22:19 ` Ingo Molnar
2025-03-08 7:22 ` Uros Bizjak
2025-03-08 19:15 ` H. Peter Anvin
2025-03-05 19:55 ` Ingo Molnar
2025-03-05 20:13 ` Uros Bizjak
2025-03-05 20:21 ` Ingo Molnar
2025-03-06 9:38 ` Uros Bizjak
2025-03-05 20:20 ` Ingo Molnar [this message]
2025-03-06 10:52 ` Dirk Gouders
2025-03-06 10:59 ` Ingo Molnar
2025-03-05 20:36 ` Borislav Petkov
2025-03-05 21:26 ` Peter Zijlstra
2025-03-06 9:01 ` Uros Bizjak
2025-03-06 9:43 ` kernel: Current status of CONFIG_CC_OPTIMIZE_FOR_SIZE=y (was: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns) Ingo Molnar
2025-03-06 10:37 ` Arnd Bergmann
2025-03-06 20:37 ` David Laight
2025-03-03 13:12 ` [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns David Laight
2025-03-02 20:56 ` Uros Bizjak
2025-03-03 12:23 ` Uros Bizjak
2025-03-08 19:08 ` H. Peter Anvin
2025-03-09 7:50 ` Uros Bizjak
2025-03-09 9:46 ` David Laight
2025-03-09 9:57 ` Uros Bizjak
2025-03-06 9:57 ` Ingo Molnar
2025-03-06 10:26 ` Uros Bizjak
2025-03-06 10:38 ` Ingo Molnar
2025-03-06 10:50 ` Ingo Molnar
2025-03-06 13:56 ` Uros Bizjak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z8ix9YQEIdyAopCw@gmail.com \
--to=mingo@kernel.org \
--cc=bp@alien8.de \
--cc=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linuxfoundation.org \
--cc=ubizjak@gmail.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.