All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dmitry Ilvokhin <d@ilvokhin.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
	Boqun Feng <boqun@kernel.org>, Waiman Long <longman@redhat.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	Juergen Gross <jgross@suse.com>,
	Ajay Kaher <ajay.kaher@broadcom.com>,
	Alexey Makhalov <alexey.makhalov@broadcom.com>,
	Broadcom internal kernel review list
	<bcm-kernel-feedback-list@broadcom.com>,
	Thomas Gleixner <tglx@kernel.org>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Arnd Bergmann <arnd@arndb.de>, Dennis Zhou <dennis@kernel.org>,
	Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@gentwo.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	virtualization@lists.linux.dev, linux-arch@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	kernel-team@meta.com, "Paul E. McKenney" <paulmck@kernel.org>
Subject: Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
Date: Thu, 14 May 2026 12:34:55 +0000	[thread overview]
Message-ID: <agXBb0ga_6HJrrnm@shell.ilvokhin.com> (raw)
In-Reply-To: <20260513193342.GB2545104@noisy.programming.kicks-ass.net>

On Wed, May 13, 2026 at 09:33:42PM +0200, Peter Zijlstra wrote:
> On Tue, May 05, 2026 at 05:09:34PM +0000, Dmitry Ilvokhin wrote:
> > Use the arch-overridable queued_spin_release(), introduced in the
> > previous commit, to ensure the tracepoint works correctly across all
> > architectures, including those with custom unlock implementations (e.g.
> > x86 paravirt).
> > 
> > When the tracepoint is disabled, the only addition to the hot path is a
> > single NOP instruction (the static branch). When enabled, the contention
> > check, trace call, and unlock are combined in an out-of-line function to
> > minimize hot path impact, avoiding the compiler needing to preserve the
> > lock pointer in a callee-saved register across the trace call.
> > 
> > Binary size impact (x86_64, defconfig):
> >   uninlined unlock (common case): +680 bytes  (+0.00%)
> >   inlined unlock (worst case):    +83659 bytes (+0.21%)
> > 
> > The inlined unlock case could not be achieved through Kconfig options on
> > x86_64 as PREEMPT_BUILD unconditionally selects UNINLINE_SPIN_UNLOCK on
> > x86_64. The UNINLINE_SPIN_UNLOCK guards were manually inverted to force
> > inline the unlock path and estimate the worst case binary size increase.
> > 
> > In practice, configurations with UNINLINE_SPIN_UNLOCK=n have already
> > opted against binary size optimization, so the inlined worst case is
> > unlikely to be a concern.
> 
> This is not quite accurate. You add the (5byte) NOP for the static
> branch, but then you also add another 5 bytes for the CALL and at least
> another 2 bytes (possibly 5) for a JMP back into the previous stream.
> That is 12-15 bytes added to what was a single MOV instruction.
> 
> That is quite ludicrous.

Thanks for the feedback, Peter. This is exactly the kind of feedback I
was looking for.

I understand your concerns and initially I had exactly the same
thoughts, and after I looked into the generated code more carefully the
impact on the executed path is smaller than the total size increase
suggests.

Generated code of _raw_spin_unlock() for baseline (before the patch) is
31 bytes in total (x86_64, defconfig, GCC 11).

    3e0:  endbr64                          ; 4 bytes
    3e4:  movb $0x0,(%rdi)                 ; 3 bytes (unlock)
    3e7:  decl %gs:__preempt_count         ; 7 bytes
    3ee:  je   3f5                         ; 2 bytes
    3f0:  jmp  __x86_return_thunk          ; 5 bytes
    3f5:  call __SCT__preempt_schedule     ; 5 bytes
    3fa:  jmp  __x86_return_thunk          ; 5 bytes

Generated code of _raw_spin_unlock() with tracepoint (after the patch
applied) is 40 bytes in total.

    bc0:  endbr64                          ; 4 bytes
    bc4:  xchg %ax,%ax                     ; 2 bytes (NOP, static branch)
    bc6:  movb $0x0,(%rdi)                 ; 3 bytes (unlock)
    bc9:  decl %gs:__preempt_count         ; 7 bytes
    bd0:  je   bde                         ; 2 bytes
    bd2:  jmp  __x86_return_thunk          ; 5 bytes
    bd7:  call queued_spin_release_traced  ; 5 bytes
    bdc:  jmp  bc9                         ; 2 bytes
    bde:  call __SCT__preempt_schedule     ; 5 bytes
    be3:  jmp  __x86_return_thunk          ; 5 bytes

It is 40 bytes (+9 bytes compared to baseline, 2 bytes for NOP and 7
bytes for CALL and JMP).

But if we look at the executed path the picture is a bit different.

Baseline, in best case scenario of least number of executed
instructions.

    3e0:  endbr64                          ; 4 bytes (always executed)
    3e4:  movb $0x0,(%rdi)                 ; 3 bytes (unlock,
                                           ; always executed)
    3e7:  decl %gs:__preempt_count         ; 7 bytes (always executed)
    3ee:  je   3f5                         ; 2 bytes (always executed)
    3f0:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
                                           ; je is not taken)
                                           ; rest is not executed
    3f5:  call __SCT__preempt_schedule     ; 5 bytes
    3fa:  jmp  __x86_return_thunk          ; 5 bytes

Tracepoint (again same case of least number of executed instructions).

    bc0:  endbr64                          ; 4 bytes (always executed)
    bc4:  xchg %ax,%ax                     ; 2 bytes (always executed, this is an
                                           ; only addition on the execution path).
    bc6:  movb $0x0,(%rdi)                 ; 3 bytes (unlock, always executed)
    bc9:  decl %gs:__preempt_count         ; 7 bytes (always executed)
    bd0:  je   bde                         ; 2 bytes (always executed)
    bd2:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
                                           ; je is not taken)
                                           ; rest is not executed
    bd7:  call queued_spin_release_traced  ; 5 bytes
    bdc:  jmp  bc9                         ; 2 bytes
    bde:  call __SCT__preempt_schedule     ; 5 bytes
    be3:  jmp  __x86_return_thunk          ; 5 bytes

On the execution path we are getting 21 byte worth of instructions on
baseline against 23 bytes. The only addition on any executed path is the
2-byte NOP, that has a special treatment in CPU, cheap, but not entirely
free.

From a total size perspective it's 9 bytes, but on the executed path it's
a single 2-byte NOP.

Does this change the picture for you, or is the NOP still a concern for
this path?

> 
> I disagree that UNINLINE_SPIN_UNLOCK=n opts against binary size. For x86
> the unlock is smaller than a function call.
> 

Fair point on the UNINLINE_SPIN_UNLOCK characterization, but
UNINLINE_SPIN_UNLOCKis always "y" on x86_64. The inlined case only
applies to s390 (unconditionally), csky and loongarch (when
!PREEMPTION). I'll remove this, thanks.

> 
> I really don't see how this is worth it.


  reply	other threads:[~2026-05-14 12:35 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:09 [PATCH v6 0/7] locking: contended_release tracepoint instrumentation Dmitry Ilvokhin
2026-05-05 17:09 ` [PATCH v6 1/7] tracing/lock: Remove unnecessary linux/sched.h include Dmitry Ilvokhin
2026-05-05 17:09 ` [PATCH v6 2/7] locking/percpu-rwsem: Extract __percpu_up_read() Dmitry Ilvokhin
2026-05-05 17:09 ` [PATCH v6 3/7] locking: Add contended_release tracepoint to sleepable locks Dmitry Ilvokhin
2026-05-05 17:09 ` [PATCH v6 4/7] locking: Factor out queued_spin_release() Dmitry Ilvokhin
2026-05-13 15:37   ` Steven Rostedt
2026-05-05 17:09 ` [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock Dmitry Ilvokhin
2026-05-13 15:41   ` Steven Rostedt
2026-05-14 14:13     ` Dmitry Ilvokhin
2026-05-14 16:03       ` Steven Rostedt
2026-05-15 14:40         ` Dmitry Ilvokhin
2026-05-13 19:33   ` Peter Zijlstra
2026-05-14 12:34     ` Dmitry Ilvokhin [this message]
2026-05-05 17:09 ` [PATCH v6 6/7] locking: Factor out __queued_read_unlock()/__queued_write_unlock() Dmitry Ilvokhin
2026-05-13 15:41   ` Steven Rostedt
2026-05-05 17:09 ` [PATCH v6 7/7] locking: Add contended_release tracepoint to qrwlock Dmitry Ilvokhin
2026-05-13 15:43   ` Steven Rostedt
2026-05-13 19:26 ` [PATCH v6 0/7] locking: contended_release tracepoint instrumentation Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agXBb0ga_6HJrrnm@shell.ilvokhin.com \
    --to=d@ilvokhin.com \
    --cc=ajay.kaher@broadcom.com \
    --cc=alexey.makhalov@broadcom.com \
    --cc=arnd@arndb.de \
    --cc=bcm-kernel-feedback-list@broadcom.com \
    --cc=boqun@kernel.org \
    --cc=bp@alien8.de \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=dennis@kernel.org \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=kernel-team@meta.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@kernel.org \
    --cc=tj@kernel.org \
    --cc=tsbogend@alpha.franken.de \
    --cc=virtualization@lists.linux.dev \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.