Generic Linux architectural discussions
 help / color / mirror / Atom feed
From: Dmitry Ilvokhin <d@ilvokhin.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
	Boqun Feng <boqun@kernel.org>, Waiman Long <longman@redhat.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	Juergen Gross <jgross@suse.com>,
	Ajay Kaher <ajay.kaher@broadcom.com>,
	Alexey Makhalov <alexey.makhalov@broadcom.com>,
	Broadcom internal kernel review list
	<bcm-kernel-feedback-list@broadcom.com>,
	Thomas Gleixner <tglx@kernel.org>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Arnd Bergmann <arnd@arndb.de>, Dennis Zhou <dennis@kernel.org>,
	Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@gentwo.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	virtualization@lists.linux.dev, linux-arch@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	kernel-team@meta.com, "Paul E. McKenney" <paulmck@kernel.org>
Subject: Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
Date: Thu, 11 Jun 2026 07:17:41 +0000	[thread overview]
Message-ID: <aiphFXe_TPNPxZ_n@shell.ilvokhin.com> (raw)
In-Reply-To: <20260603120811.GW3493090@noisy.programming.kicks-ass.net>

On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> Also, I think someone should go do some performance runs with
> ARCH_INLINE_SPIN_* set for x86 just like for s390.

As promised, I set ARCH_INLINE_SPIN_UNLOCK{,_BH,_IRQ,_IRQRESTORE} for
x86 and measured the effect on a few real workloads.

Short version: inlining of _raw_spin_unlock() adds measurable kernel
i-cache pressure on every workload I tried, and on a
kernel-i-cache-bound one (nginx connection churn) it costs ~1.27%
throughput. I did not find a workload where it helps.

HOW BENCHMARKS WERE CHOSEN

The cost of inlining unlock is text footprint increase. Every unlock
site grows, and the extra bytes compete for the shared L1i. The bill is
paid by unrelated code, in both kernel and userspace.

Locktorture and similar microbenchmarks can't see this, because they
usually hammer a tiny loop that stays L1i-resident, so they measure
fast-path cycles, where inlining (fewer instructions per unlock) looks
neutral-to-good.

To make the cost visible, the workload has to have real instruction
cache pressure. To achieve that, it has to touch a lot of code.

A good way to screen benchmarks: look for high tma_frontend_bound
fraction from 'perf stat -M TopdownL1' and simultaneously require it to
spend non-trivial time in the kernel (be syscall-heavy).

SETUP

Hardware: 2x Intel Xeon Gold 6138 (Skylake-SP), 20 cores/socket, 40C/80T
with kernel built from locking/core branch. Baseline _raw_spin_unlock()
is out-of-line via UNINLINE_SPIN_UNLOCK=y. Experiment adds the four
selects above (exact patch is at the end of this message). Cache
geometry (lscpu -C):

NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL  SETS PHY-LINE COHERENCY-SIZE
L1d       32K     1.3M    8 Data            1    64        1             64
L1i       32K     1.3M    8 Instruction     1    64        1             64
L2         1M      40M   16 Unified         2  1024        1             64
L3      27.5M      55M   11 Unified         3 40960        1             64

Per run I collected cycles, instructions and L1i-misses. To stay within
the available PMU counters, each run used only 3 events: cycles,
instructions and one L1i filter (:u or :k). The NMI watchdog was off and
every run reported 100% counter enablement (no multiplexing). Userspace
and kernel misses therefore come from separate runs. Each benchmark was
run 20x per side: 10 with the :u counter, 10 with :k.  Cycles,
instructions and throughput are pooled across all 20, each L1i split
comes from its 10.

KERNEL IMAGE SIZE

To give a sense of the code-footprint increase, scripts/bloat-o-meter on
vmlinux, GCC 11, x86_64, defconfig + CONFIG_PARAVIRT_SPINLOCKS=y:

    Total: Before=23838694, After=23977159, chg +0.58%

ROCKSDB (DELETESEQ)

    db_bench -benchmarks=deleteseq

Metric                       Baseline      Experiment     Delta   Sig
----------------------------------------------------------------------
Instructions (total)    9,574,476,543   9,573,602,441    -0.01%   flat
L1i-miss :k (kernel)      198,588,165     216,672,536    +9.11%   **
L1i-miss :u (userspace)   593,276,235     616,433,813    +3.90%   **
Throughput ops/s            431,398         432,897      +0.35%   ns
Cycles (total)          4,681,002,302   4,665,106,876    -0.34%   ns
IPC                          2.045           2.052       +0.33%   ns
Time elapsed (s)            2.4012          2.3865       -0.62%   ns
----------------------------------------------------------------------
L1i-miss: higher = worse. Throughput: higher = better.
** = beyond per-run noise (+-0.1..0.36%), ns = within noise.

At constant instructions, inlining raises L1i misses +9.11% (kernel) and
+3.90% (userspace), both well beyond noise. Throughput, cycles, IPC and
wall-time all stay within run-to-run noise. So the i-cache cost is real,
but at IPC ~2 db_bench isn't fetch-bound at the app level, so it doesn't
surface.

No benefit from _raw_spin_unlock() inlining.

KERNEL BUILD

Building locking/core (defconfig), GCC 11.

    make -j80

Metric              Baseline      Experiment     Delta   Sig
-------------------------------------------------------------
L1i-miss :k          36.72G        37.51G       +2.16%   **
L1i-miss :u         246.99G       246.06G       -0.38%   **
Sys (s)             478.250       482.420       +0.87%   **
Time elapsed (s)    105.221       105.373       +0.14%   ns
User (s)           4022.046      4024.012       +0.05%   flat
Cycles            8,894.10G     8,902.12G       +0.09%   flat
Instructions      8,424.28G     8,426.48G       +0.03%   flat
IPC                   0.947         0.947       -0.06%   flat
-------------------------------------------------------------
L1i-miss/Sys: higher = worse.
** = beyond per-run noise, ns = within noise.

Kernel i-cache misses (+2.16%) and sys time (+0.87%) both rise and are
significant. Wall-time and userspace L1i are flat. Kernel build is
GCC/userspace-bound (User 4022s vs Sys 478s), so the added kernel fetch
cost is real but appears to sit off the critical path.

No benefit from _raw_spin_unlock() inlining.

NGINX

I ran nginx with taskset -c 2.

    perf stat -C 2 ... -- ab -n 100000 -c 80 http://127.0.0.1:8080/

Config for nginx was the following.

  worker_processes 1;
  error_log /tmp/ngx/error.log;
  pid       /tmp/ngx/nginx.pid;
  events { worker_connections 16384; }
  http {
      access_log off;
      server { listen 8080 reuseport; location / { return 200 "ok\n"; } }
  }


I used nginx version 1.20.1 (prebuilt, from CentOS repo).

Metric              Baseline      Experiment     Delta   Sig
------------------------------------------------------------
req/s (ab)           25,113        24,795       -1.27%   **
L1i MPKI :k          70.06         72.10        +2.92%   **
L1i MPKI :u          20.16         20.66        +2.50%   **
instructions          5.86G         5.83G       -0.50%   **
L1i-miss :k           0.41G         0.42G       +2.44%   **
L1i-miss :u           0.12G         0.12G       +1.95%   **
cycles                4.82G         4.81G       -0.28%   ns
IPC                   1.215         1.213       -0.22%   ns
perf time (s)         4.077         4.129       +1.26%   **
failed reqs              0             0          -      valid
------------------------------------------------------------
req/s: higher=better. MPKI: higher=worse.
** = beyond per-run noise, ns = within noise.

nginx connection-churn is the one workload that is genuinely
kernel-fetch-bound: MPKI:k ~70 and IPC ~1.2 (vs db_bench's 2.05). Here
the cost surfaces: req/s −1.27%. Misses rise in both domains (+2.9%
MPKI:k, +2.5% MPKI:u). Unlike kernel build, userspace is hit too,
because nginx runs user and kernel hot on the same core and the kernel
bloat pollutes the shared L1i.

And the kicker: instructions fell 0.5% (inlining removed the call/ret)
yet throughput dropped.

Caveat: ab is single-threaded, so it seems the worker core is
under-saturated: cycles is flat (−0.28%, ns) while wall-time rose
(+1.26%).

Measurable throughput regression from _raw_spin_unlock() inlining.

CONCLUSION

Inlining _raw_spin_unlock() raises kernel L1i misses on every workload.
It's an unconditional cost. Whether it costs the application throughput
depends on how kernel-fetch-bound the workload is.
  
The cost is real everywhere. It only surfaces as throughput regression
where the kernel is on the fetch critical path. And inlining did not
help in any workload I measured. The one micro-effect inlining produced
(-0.5% instructions on nginx) was erased by the added i-cache pressure.


From 99502328caed3c195e20cf194a1e8aa1563f3896 Mon Sep 17 00:00:00 2001
From: Dmitry Ilvokhin <d@ilvokhin.com>
Date: Thu, 4 Jun 2026 07:43:00 -0700
Subject: [PATCH] x86/locking: Inline the spin_unlock()

Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
 arch/x86/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fdaef60b46d6..c9a0638225fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -113,6 +113,10 @@ config X86
 	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_HAVE_EXTRA_ELF_NOTES
+	select ARCH_INLINE_SPIN_UNLOCK
+	select ARCH_INLINE_SPIN_UNLOCK_BH
+	select ARCH_INLINE_SPIN_UNLOCK_IRQ
+	select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE
 	select ARCH_MEMORY_ORDER_TSO
 	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
-- 
2.53.0-Meta


  parent reply	other threads:[~2026-06-11  7:17 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:09 [PATCH v6 0/7] locking: contended_release tracepoint instrumentation Dmitry Ilvokhin
2026-05-05 17:09 ` [PATCH v6 1/7] tracing/lock: Remove unnecessary linux/sched.h include Dmitry Ilvokhin
2026-05-05 17:09 ` [PATCH v6 2/7] locking/percpu-rwsem: Extract __percpu_up_read() Dmitry Ilvokhin
2026-05-05 17:09 ` [PATCH v6 3/7] locking: Add contended_release tracepoint to sleepable locks Dmitry Ilvokhin
2026-05-05 17:09 ` [PATCH v6 4/7] locking: Factor out queued_spin_release() Dmitry Ilvokhin
2026-05-13 15:37   ` Steven Rostedt
2026-05-05 17:09 ` [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock Dmitry Ilvokhin
2026-05-13 15:41   ` Steven Rostedt
2026-05-14 14:13     ` Dmitry Ilvokhin
2026-05-14 16:03       ` Steven Rostedt
2026-05-15 14:40         ` Dmitry Ilvokhin
2026-05-13 19:33   ` Peter Zijlstra
2026-05-14 12:34     ` Dmitry Ilvokhin
2026-05-27 13:30       ` Dmitry Ilvokhin
2026-06-03 12:08       ` Peter Zijlstra
2026-06-03 14:17         ` Dmitry Ilvokhin
2026-06-03 14:26           ` Peter Zijlstra
2026-06-11  7:17         ` Dmitry Ilvokhin [this message]
2026-06-11 13:44           ` Peter Zijlstra
2026-05-05 17:09 ` [PATCH v6 6/7] locking: Factor out __queued_read_unlock()/__queued_write_unlock() Dmitry Ilvokhin
2026-05-13 15:41   ` Steven Rostedt
2026-05-05 17:09 ` [PATCH v6 7/7] locking: Add contended_release tracepoint to qrwlock Dmitry Ilvokhin
2026-05-13 15:43   ` Steven Rostedt
2026-05-13 19:26 ` [PATCH v6 0/7] locking: contended_release tracepoint instrumentation Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiphFXe_TPNPxZ_n@shell.ilvokhin.com \
    --to=d@ilvokhin.com \
    --cc=ajay.kaher@broadcom.com \
    --cc=alexey.makhalov@broadcom.com \
    --cc=arnd@arndb.de \
    --cc=bcm-kernel-feedback-list@broadcom.com \
    --cc=boqun@kernel.org \
    --cc=bp@alien8.de \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=dennis@kernel.org \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=kernel-team@meta.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@kernel.org \
    --cc=tj@kernel.org \
    --cc=tsbogend@alpha.franken.de \
    --cc=virtualization@lists.linux.dev \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox