All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Uros Bizjak <ubizjak@gmail.com>
Cc: David Laight <david.laight.linux@gmail.com>,
	Linus Torvalds <torvalds@linuxfoundation.org>,
	Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@intel.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns
Date: Thu, 6 Mar 2025 23:19:13 +0100	[thread overview]
Message-ID: <Z8ofYTR9nou2650h@gmail.com> (raw)
In-Reply-To: <CAFULd4Yuhb-BbV9LAJ+edMRGEi2kTYfcq70=TTMaSXP3oxwfQQ@mail.gmail.com>


* Uros Bizjak <ubizjak@gmail.com> wrote:

> On Thu, Mar 6, 2025 at 11:45 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Wed, Mar 5, 2025 at 9:14 PM David Laight
> > <david.laight.linux@gmail.com> wrote:
> > >
> > > On Wed, 5 Mar 2025 07:04:08 -1000
> > > Linus Torvalds <torvalds@linuxfoundation.org> wrote:
> > >
> > > > On Tue, 4 Mar 2025 at 22:54, Uros Bizjak <ubizjak@gmail.com> wrote:
> > > > >
> > > > > Even to my surprise, the patch has some noticeable effects on the
> > > > > performance, please see the attachment in [1] for LMBench data or [2]
> > > > > for some excerpts from the data. So, I think the patch has potential
> > > > > to improve the performance.
> > > >
> > > > I suspect some of the performance difference - which looks
> > > > unexpectedly large - is due to having run them on a CPU with the
> > > > horrendous indirect return costs, and then inlining can make a huge
> > > > difference.
> > > ...
> > >
> > > Another possibility is that the processes are getting bounced around
> > > cpu in a slightly different way.
> > > An idle cpu might be running at 800MHz, run something that spins on it
> > > and the clock speed will soon jump to 4GHz.
> > > But if your 'spinning' process is migrated to a different cpu it starts
> > > again at 800MHz.
> > >
> > > (I had something where a fpga compile when from 12 mins to over 20 because
> > > the kernel RSB stuffing caused the scheduler to behave differently even
> > > though nothing was doing a lot of system calls.)
> > >
> > > All sorts of things can affect that - possibly even making some code faster!
> > >
> > > The (IIRC) 30k increase in code size will be a few functions being inlined.
> > > The bloat-o-meter might show which, and forcing a few inlines the same way
> > > should reduce that difference.
> >
> > bloat-o-meter is an excellent idea, I'll analyse binaries some more
> > and report my findings.
> 
> Please find attached bloat.txt where:
> 
> a) some functions now include once-called functions. These are:
> 
> copy_process                                6465   10191   +3726
> balance_dirty_pages_ratelimited_flags        237    2949   +2712
> icl_plane_update_noarm                      5800    7969   +2169
> samsung_input_mapping                       3375    5170   +1795
> ext4_do_update_inode.isra                      -    1526   +1526
> 
> that now include:
> 
> ext4_mark_iloc_dirty                        1735     106   -1629
> samsung_gamepad_input_mapping.isra          2046       -   -2046
> icl_program_input_csc                       2203       -   -2203
> copy_mm                                     2242       -   -2242
> balance_dirty_pages                         2657       -   -2657
> 
> b) ISRA [interprocedural scalar replacement of aggregates,
> interprocedural pass that removes unused function return values
> (turning functions returning a value which is never used into void
> functions) and removes unused function parameters.  It can also
> replace an aggregate parameter by a set of other parameters
> representing part of the original, turning those passed by reference
> into new ones which pass the value directly.]
> 
> ext4_do_update_inode.isra                      -    1526   +1526
> nfs4_begin_drain_session.isra                  -     249    +249
> nfs4_end_drain_session.isra                    -     168    +168
> __guc_action_register_multi_lrc_v70.isra     335     500    +165
> __i915_gem_free_objects.isra                   -     144    +144
> ...
> membarrier_register_private_expedited.isra     108       -    -108
> syncobj_eventfd_entry_func.isra              445     314    -131
> __ext4_sb_bread_gfp.isra                     140       -    -140
> class_preempt_notrace_destructor.isra        145       -    -145
> p9_fid_put.isra                              151       -    -151
> __mm_cid_try_get.isra                        238       -    -238
> membarrier_global_expedited.isra             294       -    -294
> mm_cid_get.isra                              295       -    -295
> samsung_gamepad_input_mapping.isra.cold      604       -    -604
> samsung_gamepad_input_mapping.isra          2046       -   -2046
> 
> c) different split points of hot/cold split that just move code around:
> 
> samsung_input_mapping.cold                   900    1500    +600
> __i915_request_reset.cold                    311     389     +78
> nfs_update_inode.cold                         77     153     +76
> __do_sys_swapon.cold                         404     455     +51
> copy_process.cold                              -      45     +45
> tg3_get_invariants.cold                       73     115     +42
> ...
> hibernate.cold                               671     643     -28
> copy_mm.cold                                  31       -     -31
> software_resume.cold                         249     207     -42
> io_poll_wake.cold                            106      54     -52
> samsung_gamepad_input_mapping.isra.cold      604       -    -604
> 
> c) full inline of small functions with locking insn (~150 cases).
> These bring in most of the performance increase because there is no
> call setup. E.g.:
> 
> 0000000000a50e10 <release_devnum>:
>   a50e10:    48 63 07                 movslq (%rdi),%rax
>   a50e13:    85 c0                    test   %eax,%eax
>   a50e15:    7e 10                    jle    a50e27 <release_devnum+0x17>
>   a50e17:    48 8b 4f 50              mov    0x50(%rdi),%rcx
>   a50e1b:    f0 48 0f b3 41 50        lock btr %rax,0x50(%rcx)
>   a50e21:    c7 07 ff ff ff ff        movl   $0xffffffff,(%rdi)
>   a50e27:    e9 00 00 00 00           jmp    a50e2c <release_devnum+0x1c>
>             a50e28: R_X86_64_PLT32    __x86_return_thunk-0x4
>   a50e2c:    0f 1f 40 00              nopl   0x0(%rax)
> 
> IMO, for 0.14% code increase, these changes are desirable.

I concur, and it's extra desirable IMHO due to the per function 
overhead of CPU bug mitigations like retpolines.

The number of function calls executed in a workload can be measured via 
perf on most modern x86 CPUs as well. For example on Zen5 CPUs the 
number of RET instructions can be counted:

  {
    EventName: ex_ret_near_ret,
    EventCode: 0xc8,
    BriefDescription: Retired near returns (RET or RET Iw).
  },

Which ought to be a good proxy for function calls (modulo 
tail-optimized jumps).

Thanks,

	Ingo

  reply	other threads:[~2025-03-06 22:19 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-28 12:35 [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns Uros Bizjak
2025-02-28 13:13 ` Uros Bizjak
2025-02-28 16:48 ` Dave Hansen
2025-02-28 22:31   ` Uros Bizjak
2025-02-28 22:58     ` Dave Hansen
2025-03-01  9:05       ` Uros Bizjak
2025-03-01 12:38         ` Borislav Petkov
2025-03-05  8:54           ` Uros Bizjak
2025-03-05 17:04             ` Linus Torvalds
2025-03-05 19:40               ` Peter Zijlstra
2025-03-05 19:47               ` Uros Bizjak
2025-03-05 22:18                 ` David Laight
2025-03-05 20:14               ` David Laight
2025-03-06 10:45                 ` Uros Bizjak
2025-03-06 13:07                   ` Uros Bizjak
2025-03-06 22:19                     ` Ingo Molnar [this message]
2025-03-08  7:22                       ` Uros Bizjak
2025-03-08 19:15               ` H. Peter Anvin
2025-03-05 19:55             ` Ingo Molnar
2025-03-05 20:13               ` Uros Bizjak
2025-03-05 20:21                 ` Ingo Molnar
2025-03-06  9:38                   ` Uros Bizjak
2025-03-05 20:20               ` Ingo Molnar
2025-03-06 10:52                 ` Dirk Gouders
2025-03-06 10:59                   ` Ingo Molnar
2025-03-05 20:36             ` Borislav Petkov
2025-03-05 21:26               ` Peter Zijlstra
2025-03-06  9:01                 ` Uros Bizjak
2025-03-06  9:43                   ` kernel: Current status of CONFIG_CC_OPTIMIZE_FOR_SIZE=y (was: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns) Ingo Molnar
2025-03-06 10:37                     ` Arnd Bergmann
2025-03-06 20:37                     ` David Laight
2025-03-03 13:12       ` [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns David Laight
2025-03-02 20:56   ` Uros Bizjak
2025-03-03 12:23     ` Uros Bizjak
2025-03-08 19:08   ` H. Peter Anvin
2025-03-09  7:50     ` Uros Bizjak
2025-03-09  9:46       ` David Laight
2025-03-09  9:57         ` Uros Bizjak
2025-03-06  9:57 ` Ingo Molnar
2025-03-06 10:26   ` Uros Bizjak
2025-03-06 10:38     ` Ingo Molnar
2025-03-06 10:50       ` Ingo Molnar
2025-03-06 13:56   ` Uros Bizjak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8ofYTR9nou2650h@gmail.com \
    --to=mingo@kernel.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david.laight.linux@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=torvalds@linuxfoundation.org \
    --cc=ubizjak@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.