Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Uros Bizjak <ubizjak@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@intel.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Linus Torvalds <torvalds@linuxfoundation.org>
Subject: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns
Date: Wed, 5 Mar 2025 21:20:05 +0100	[thread overview]
Message-ID: <Z8ix9YQEIdyAopCw@gmail.com> (raw)
In-Reply-To: <Z8isNxBxC9pcG4KL@gmail.com>


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Uros Bizjak <ubizjak@gmail.com> wrote:
> 
> > On Sat, Mar 1, 2025 at 1:38 PM Borislav Petkov <bp@alien8.de> wrote:
> > >
> > > On Sat, Mar 01, 2025 at 10:05:56AM +0100, Uros Bizjak wrote:
> > > > OTOH, -Os, where different code size/performance heuristics are used, now
> > > > performs better w.r.t code size.
> > >
> > > Did anything change since:
> > >
> > > 281dc5c5ec0f ("Give up on pushing CC_OPTIMIZE_FOR_SIZE")
> > > 3a55fb0d9fe8 ("Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE")
> > >
> > > wrt -Os?
> > >
> > > Because if not, we still don't love -Os and you can drop the -Os argument.
> > 
> > The -Os argument was to show the effect of the patch when the compiler
> > is instructed to take care of the overall size. Giving the compiler
> > -O2 and then looking at the overall size of the produced binary is
> > just wrong.
> > 
> > > And without any perf data showing any improvement, this patch does nothing but
> > > enlarge -O2 size...
> > 
> > Even to my surprise, the patch has some noticeable effects on the
> > performance, please see the attachment in [1] for LMBench data or [2]
> > for some excerpts from the data. So, I think the patch has potential
> > to improve the performance.
> > 
> > [1] https://lore.kernel.org/lkml/CAFULd4YBcG45bigHBox2pu+To+Y5BzbRxG+pUr42AVOWSnfKsg@mail.gmail.com/
> > [2] https://lore.kernel.org/lkml/CAFULd4ZsSKwJ4Dz3cCAgaVsa4ypbb0e2savO-3_Ltbs=1wzgKQ@mail.gmail.com/
> 
> If you are measuring micro-costs, please make sure you pin the 
> workload to a single CPU (via 'taskset' for example) and run 'perf 
> stat --null --repeat 5' or so to measure the run-over-run noise of 
> the benchmark.

And if the benchmark is context-switching heavy, you'll want to use 
'perf stat -a' option to not have PMU context switching costs, and the 
-C option to only measure on the pinned CPU.


For example, to measure pipe handling overhead, the naive measurement is:

 starship:~> perf bench sched pipe
 # Running 'sched/pipe' benchmark:
 # Executed 1000000 pipe operations between two processes

     Total time: 6.939 [sec]

       6.939128 usecs/op
         144110 ops/sec
 starship:~> perf bench sched pipe
 # Running 'sched/pipe' benchmark:
 # Executed 1000000 pipe operations between two processes

     Total time: 6.879 [sec]

       6.879282 usecs/op
         145364 ops/sec

See how the run-to-run noise is 0.9%?

If we measure it naively with perf stat, we get:

 starship:~> perf stat perf bench sched pipe
 # Running 'sched/pipe' benchmark:
 # Executed 1000000 pipe operations between two processes

     Total time: 11.870 [sec]

      11.870403 usecs/op
          84243 ops/sec

 Performance counter stats for 'perf bench sched pipe':

         10,722.04 msec task-clock                       #    0.903 CPUs utilized             
         2,000,093      context-switches                 #  186.540 K/sec                     
               499      cpu-migrations                   #   46.540 /sec                      
             1,482      page-faults                      #  138.220 /sec                      
    27,853,380,218      cycles                           #    2.598 GHz                       
    18,434,409,889      stalled-cycles-frontend          #   66.18% frontend cycles idle      
    24,277,227,239      instructions                     #    0.87  insn per cycle            
                                                  #    0.76  stalled cycles per insn   
     5,001,727,980      branches                         #  466.490 M/sec                     
       572,756,283      branch-misses                    #   11.45% of all branches           

      11.875458968 seconds time elapsed

       0.271152000 seconds user
      11.272766000 seconds sys

See how the usecs/op increased by +70% due to PMU switching overhead?

With --null we can reduce the PMU switching overhead by only measuring 
elapsed time:

 starship:~> perf stat --null perf bench sched pipe
 # Running 'sched/pipe' benchmark:
 # Executed 1000000 pipe operations between two processes

     Total time: 6.916 [sec]

       6.916700 usecs/op
         144577 ops/sec

  Performance counter stats for 'perf bench sched pipe':

       6.921547909 seconds time elapsed

       0.341734000 seconds user
       6.215287000 seconds sys

But noise is still high:

 starship:~> perf stat --null --repeat 5 perf bench sched pipe
       6.854731 usecs/op
       7.082047 usecs/op
       7.087193 usecs/op
       6.934439 usecs/op
       7.056695 usecs/op
 ...
 Performance counter stats for 'perf bench sched pipe' (5 runs):

            7.0093 +- 0.0463 seconds time elapsed  ( +-  0.66% )

Likely due to the tasks migrating semi-randomly among cores.

We can pin them down to a single CPU (CPU2 in this case) via taskset:

 starship:~> taskset 4 perf stat --null --repeat 5 perf bench sched pipe
       5.575906 usecs/op
       5.637112 usecs/op
       5.532060 usecs/op
       5.703270 usecs/op
       5.506517 usecs/op
 
 Performance counter stats for 'perf bench sched pipe' (5 runs):

            5.5929 +- 0.0359 seconds time elapsed  ( +-  0.64% )

Note how performance increased by ~25%, due to lack of migration, but 
noise is still a bit high.

A good way to reduce noise is to measure instructions only:

 starship:~> taskset 0x4 perf stat -e instructions --repeat 5 perf bench sched pipe
       6.962279 usecs/op
       6.917374 usecs/op
       6.928672 usecs/op
       6.939555 usecs/op
       6.942980 usecs/op

 Performance counter stats for 'perf bench sched pipe' (5 runs):

    32,561,773,780      instructions                                                            ( +-  0.27% )

           6.93977 +- 0.00735 seconds time elapsed  ( +-  0.11% )

'Number of instructions executed' is an imperfect proxy for overhead. 
(Not every instruction has the same overhead - but for compiler code 
generation it's a useful proxy in most cases.)

But the best measurement is to avoid the PMU switching overhead via the 
'-a' option, and limiting the measurement to the saturated CPU#2:

 starship:~> taskset 0x4 perf stat -a -C 2 -e instructions --repeat 5 perf bench sched pipe
       5.808068 usecs/op
       5.843716 usecs/op
       5.826543 usecs/op
       5.801616 usecs/op
       5.793129 usecs/op

 Performance counter stats for 'system wide' (5 runs):

    32,244,691,275      instructions                                                            ( +-  0.21% )

           5.81624 +- 0.00912 seconds time elapsed  ( +-  0.16% )

Note how this measurement provides the highest performance for the 
workload, almost as good as --null.

 - Beware the difference in CPU mask parameters between taskset and perf 
   stat. (I tried to convince the perf tooling people to integrate 
   CPU-pinning into perf stat, but I digress.)

 - Beware of cpufreq considerations: changing CPU frequencies will skew 
   your workload's performance by a lot more than the 0.1% kind of 
   noise we are trying to gun for. Setting your CPU governor to 
   'performance' will eliminate some (but not all) cpufreq artifacts.

 - On modern systems there's also boot-to-boot variance of key data 
   structure alignment and cache access patterns, that can sometimes 
   rise beyond the noise of the measurement. These can send you on a 
   wild goose chase ...

Finally, you can use something like 'nice -n -10' to increase the 
priority of your benchmark and reduce the impact of other workloads 
running on your system.

Anyway, I think noise levels of around 0.1%-0.2% are about the best you 
can expect in context-switch heavy workloads. (A bit better in 
CPU-bound workloads with low context switching.)

Thanks,

	Ingo

next prev parent reply	other threads:[~2025-03-05 20:20 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-28 12:35 [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns Uros Bizjak
2025-02-28 13:13 ` Uros Bizjak
2025-02-28 16:48 ` Dave Hansen
2025-02-28 22:31   ` Uros Bizjak
2025-02-28 22:58     ` Dave Hansen
2025-03-01  9:05       ` Uros Bizjak
2025-03-01 12:38         ` Borislav Petkov
2025-03-05  8:54           ` Uros Bizjak
2025-03-05 17:04             ` Linus Torvalds
2025-03-05 19:40               ` Peter Zijlstra
2025-03-05 19:47               ` Uros Bizjak
2025-03-05 22:18                 ` David Laight
2025-03-05 20:14               ` David Laight
2025-03-06 10:45                 ` Uros Bizjak
2025-03-06 13:07                   ` Uros Bizjak
2025-03-06 22:19                     ` Ingo Molnar
2025-03-08  7:22                       ` Uros Bizjak
2025-03-08 19:15               ` H. Peter Anvin
2025-03-05 19:55             ` Ingo Molnar
2025-03-05 20:13               ` Uros Bizjak
2025-03-05 20:21                 ` Ingo Molnar
2025-03-06  9:38                   ` Uros Bizjak
2025-03-05 20:20               ` Ingo Molnar [this message]
2025-03-06 10:52                 ` Dirk Gouders
2025-03-06 10:59                   ` Ingo Molnar
2025-03-05 20:36             ` Borislav Petkov
2025-03-05 21:26               ` Peter Zijlstra
2025-03-06  9:01                 ` Uros Bizjak
2025-03-06  9:43                   ` kernel: Current status of CONFIG_CC_OPTIMIZE_FOR_SIZE=y (was: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns) Ingo Molnar
2025-03-06 10:37                     ` Arnd Bergmann
2025-03-06 20:37                     ` David Laight
2025-03-03 13:12       ` [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns David Laight
2025-03-02 20:56   ` Uros Bizjak
2025-03-03 12:23     ` Uros Bizjak
2025-03-08 19:08   ` H. Peter Anvin
2025-03-09  7:50     ` Uros Bizjak
2025-03-09  9:46       ` David Laight
2025-03-09  9:57         ` Uros Bizjak
2025-03-06  9:57 ` Ingo Molnar
2025-03-06 10:26   ` Uros Bizjak
2025-03-06 10:38     ` Ingo Molnar
2025-03-06 10:50       ` Ingo Molnar
2025-03-06 13:56   ` Uros Bizjak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8ix9YQEIdyAopCw@gmail.com \
    --to=mingo@kernel.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linuxfoundation.org \
    --cc=ubizjak@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.