From: Ingo Molnar <mingo@kernel.org>
To: Uros Bizjak <ubizjak@gmail.com>
Cc: David Laight <david.laight.linux@gmail.com>,
Linus Torvalds <torvalds@linuxfoundation.org>,
Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@intel.com>,
x86@kernel.org, linux-kernel@vger.kernel.org,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns
Date: Thu, 6 Mar 2025 23:19:13 +0100 [thread overview]
Message-ID: <Z8ofYTR9nou2650h@gmail.com> (raw)
In-Reply-To: <CAFULd4Yuhb-BbV9LAJ+edMRGEi2kTYfcq70=TTMaSXP3oxwfQQ@mail.gmail.com>
* Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Mar 6, 2025 at 11:45 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Wed, Mar 5, 2025 at 9:14 PM David Laight
> > <david.laight.linux@gmail.com> wrote:
> > >
> > > On Wed, 5 Mar 2025 07:04:08 -1000
> > > Linus Torvalds <torvalds@linuxfoundation.org> wrote:
> > >
> > > > On Tue, 4 Mar 2025 at 22:54, Uros Bizjak <ubizjak@gmail.com> wrote:
> > > > >
> > > > > Even to my surprise, the patch has some noticeable effects on the
> > > > > performance, please see the attachment in [1] for LMBench data or [2]
> > > > > for some excerpts from the data. So, I think the patch has potential
> > > > > to improve the performance.
> > > >
> > > > I suspect some of the performance difference - which looks
> > > > unexpectedly large - is due to having run them on a CPU with the
> > > > horrendous indirect return costs, and then inlining can make a huge
> > > > difference.
> > > ...
> > >
> > > Another possibility is that the processes are getting bounced around
> > > cpu in a slightly different way.
> > > An idle cpu might be running at 800MHz, run something that spins on it
> > > and the clock speed will soon jump to 4GHz.
> > > But if your 'spinning' process is migrated to a different cpu it starts
> > > again at 800MHz.
> > >
> > > (I had something where a fpga compile when from 12 mins to over 20 because
> > > the kernel RSB stuffing caused the scheduler to behave differently even
> > > though nothing was doing a lot of system calls.)
> > >
> > > All sorts of things can affect that - possibly even making some code faster!
> > >
> > > The (IIRC) 30k increase in code size will be a few functions being inlined.
> > > The bloat-o-meter might show which, and forcing a few inlines the same way
> > > should reduce that difference.
> >
> > bloat-o-meter is an excellent idea, I'll analyse binaries some more
> > and report my findings.
>
> Please find attached bloat.txt where:
>
> a) some functions now include once-called functions. These are:
>
> copy_process 6465 10191 +3726
> balance_dirty_pages_ratelimited_flags 237 2949 +2712
> icl_plane_update_noarm 5800 7969 +2169
> samsung_input_mapping 3375 5170 +1795
> ext4_do_update_inode.isra - 1526 +1526
>
> that now include:
>
> ext4_mark_iloc_dirty 1735 106 -1629
> samsung_gamepad_input_mapping.isra 2046 - -2046
> icl_program_input_csc 2203 - -2203
> copy_mm 2242 - -2242
> balance_dirty_pages 2657 - -2657
>
> b) ISRA [interprocedural scalar replacement of aggregates,
> interprocedural pass that removes unused function return values
> (turning functions returning a value which is never used into void
> functions) and removes unused function parameters. It can also
> replace an aggregate parameter by a set of other parameters
> representing part of the original, turning those passed by reference
> into new ones which pass the value directly.]
>
> ext4_do_update_inode.isra - 1526 +1526
> nfs4_begin_drain_session.isra - 249 +249
> nfs4_end_drain_session.isra - 168 +168
> __guc_action_register_multi_lrc_v70.isra 335 500 +165
> __i915_gem_free_objects.isra - 144 +144
> ...
> membarrier_register_private_expedited.isra 108 - -108
> syncobj_eventfd_entry_func.isra 445 314 -131
> __ext4_sb_bread_gfp.isra 140 - -140
> class_preempt_notrace_destructor.isra 145 - -145
> p9_fid_put.isra 151 - -151
> __mm_cid_try_get.isra 238 - -238
> membarrier_global_expedited.isra 294 - -294
> mm_cid_get.isra 295 - -295
> samsung_gamepad_input_mapping.isra.cold 604 - -604
> samsung_gamepad_input_mapping.isra 2046 - -2046
>
> c) different split points of hot/cold split that just move code around:
>
> samsung_input_mapping.cold 900 1500 +600
> __i915_request_reset.cold 311 389 +78
> nfs_update_inode.cold 77 153 +76
> __do_sys_swapon.cold 404 455 +51
> copy_process.cold - 45 +45
> tg3_get_invariants.cold 73 115 +42
> ...
> hibernate.cold 671 643 -28
> copy_mm.cold 31 - -31
> software_resume.cold 249 207 -42
> io_poll_wake.cold 106 54 -52
> samsung_gamepad_input_mapping.isra.cold 604 - -604
>
> c) full inline of small functions with locking insn (~150 cases).
> These bring in most of the performance increase because there is no
> call setup. E.g.:
>
> 0000000000a50e10 <release_devnum>:
> a50e10: 48 63 07 movslq (%rdi),%rax
> a50e13: 85 c0 test %eax,%eax
> a50e15: 7e 10 jle a50e27 <release_devnum+0x17>
> a50e17: 48 8b 4f 50 mov 0x50(%rdi),%rcx
> a50e1b: f0 48 0f b3 41 50 lock btr %rax,0x50(%rcx)
> a50e21: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
> a50e27: e9 00 00 00 00 jmp a50e2c <release_devnum+0x1c>
> a50e28: R_X86_64_PLT32 __x86_return_thunk-0x4
> a50e2c: 0f 1f 40 00 nopl 0x0(%rax)
>
> IMO, for 0.14% code increase, these changes are desirable.
I concur, and it's extra desirable IMHO due to the per function
overhead of CPU bug mitigations like retpolines.
The number of function calls executed in a workload can be measured via
perf on most modern x86 CPUs as well. For example on Zen5 CPUs the
number of RET instructions can be counted:
{
EventName: ex_ret_near_ret,
EventCode: 0xc8,
BriefDescription: Retired near returns (RET or RET Iw).
},
Which ought to be a good proxy for function calls (modulo
tail-optimized jumps).
Thanks,
Ingo
next prev parent reply other threads:[~2025-03-06 22:19 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-28 12:35 [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns Uros Bizjak
2025-02-28 13:13 ` Uros Bizjak
2025-02-28 16:48 ` Dave Hansen
2025-02-28 22:31 ` Uros Bizjak
2025-02-28 22:58 ` Dave Hansen
2025-03-01 9:05 ` Uros Bizjak
2025-03-01 12:38 ` Borislav Petkov
2025-03-05 8:54 ` Uros Bizjak
2025-03-05 17:04 ` Linus Torvalds
2025-03-05 19:40 ` Peter Zijlstra
2025-03-05 19:47 ` Uros Bizjak
2025-03-05 22:18 ` David Laight
2025-03-05 20:14 ` David Laight
2025-03-06 10:45 ` Uros Bizjak
2025-03-06 13:07 ` Uros Bizjak
2025-03-06 22:19 ` Ingo Molnar [this message]
2025-03-08 7:22 ` Uros Bizjak
2025-03-08 19:15 ` H. Peter Anvin
2025-03-05 19:55 ` Ingo Molnar
2025-03-05 20:13 ` Uros Bizjak
2025-03-05 20:21 ` Ingo Molnar
2025-03-06 9:38 ` Uros Bizjak
2025-03-05 20:20 ` Ingo Molnar
2025-03-06 10:52 ` Dirk Gouders
2025-03-06 10:59 ` Ingo Molnar
2025-03-05 20:36 ` Borislav Petkov
2025-03-05 21:26 ` Peter Zijlstra
2025-03-06 9:01 ` Uros Bizjak
2025-03-06 9:43 ` kernel: Current status of CONFIG_CC_OPTIMIZE_FOR_SIZE=y (was: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns) Ingo Molnar
2025-03-06 10:37 ` Arnd Bergmann
2025-03-06 20:37 ` David Laight
2025-03-03 13:12 ` [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns David Laight
2025-03-02 20:56 ` Uros Bizjak
2025-03-03 12:23 ` Uros Bizjak
2025-03-08 19:08 ` H. Peter Anvin
2025-03-09 7:50 ` Uros Bizjak
2025-03-09 9:46 ` David Laight
2025-03-09 9:57 ` Uros Bizjak
2025-03-06 9:57 ` Ingo Molnar
2025-03-06 10:26 ` Uros Bizjak
2025-03-06 10:38 ` Ingo Molnar
2025-03-06 10:50 ` Ingo Molnar
2025-03-06 13:56 ` Uros Bizjak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z8ofYTR9nou2650h@gmail.com \
--to=mingo@kernel.org \
--cc=bp@alien8.de \
--cc=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david.laight.linux@gmail.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=torvalds@linuxfoundation.org \
--cc=ubizjak@gmail.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.