From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail.ilvokhin.com (mail.ilvokhin.com [178.62.254.231])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 94A2C33BBB1;
	Wed,  3 Jun 2026 14:17:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.62.254.231
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780496260; cv=none; b=gGUUszG56KgkeOlMnWVM6ZLFja56axpBtqbBmZ/ag8Q+u9lONmyzj4ki3h1jbz1qYW5SJzQ1cAceYQB7j6VGKRh1fnu8ztqxtYGWh3dlFXD4J0o3KKz87Zpz+6U1ztlv5bDBE8Psym3sHhk3wB8366yRdWoFIiKF6BV9kP494Gc=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780496260; c=relaxed/simple;
	bh=Z4Pf4sIO0yokS/Sf6opVPQp6UT9LjyvVbSLpqW/VJXM=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=ULxjst5WTXLIxCPIygR3t46E/Je+9b5TpcMmdH6tasYs+HzuRwO56rt/j1zPI4Jw2jN7w+nEGgCSKfyN+GNvLCwzc4RjerMG4U13LFPt5eaPBNUh3Dr1kbb+6exwycJjKQFNPOe0FgrW0kyjq+d4045f4G9uYHgF27/+liW4CKI=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=ilvokhin.com; spf=pass smtp.mailfrom=ilvokhin.com; dkim=pass (1024-bit key) header.d=ilvokhin.com header.i=@ilvokhin.com header.b=gLhilNgA; arc=none smtp.client-ip=178.62.254.231
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=ilvokhin.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=ilvokhin.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=ilvokhin.com header.i=@ilvokhin.com header.b="gLhilNgA"
Received: from shell.ilvokhin.com (shell.ilvokhin.com [138.68.190.75])
	(Authenticated sender: d@ilvokhin.com)
	by mail.ilvokhin.com (Postfix) with ESMTPSA id A76B4D10A6;
	Wed, 03 Jun 2026 14:17:30 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ilvokhin.com;
	s=mail; t=1780496251;
	bh=SfhK82Xil3vBLBzbtyq+NFAA8bzLZV/4SRFeV2fyIBU=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To;
	b=gLhilNgA/246de7OKuDBzuIzyl+ldRCWUnvUz13vpuaTrLa5lQCdfs0BRiS9CyABu
	 A885RQ4sztnBP3xCfblx2wxbXYzH+/yU+KJrXASNY9Wy8+6TjWfuu8hmB6q0torO+N
	 UusytLcEBa+RdN7aUdMerhdk06+/T4dloG+6ZunE=
Date: Wed, 3 Jun 2026 14:17:27 +0000
From: Dmitry Ilvokhin <d@ilvokhin.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
	Boqun Feng <boqun@kernel.org>, Waiman Long <longman@redhat.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	Juergen Gross <jgross@suse.com>,
	Ajay Kaher <ajay.kaher@broadcom.com>,
	Alexey Makhalov <alexey.makhalov@broadcom.com>,
	Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>,
	Thomas Gleixner <tglx@kernel.org>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org,
	"H. Peter Anvin" <hpa@zytor.com>, Arnd Bergmann <arnd@arndb.de>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@gentwo.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	virtualization@lists.linux.dev, linux-arch@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	kernel-team@meta.com, "Paul E. McKenney" <paulmck@kernel.org>
Subject: Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to
 qspinlock
Message-ID: <aiA3dzjKVq2_cpSY@shell.ilvokhin.com>
References: <cover.1777999826.git.d@ilvokhin.com>
 <5d7ea75ffe74a785e6b234ada9f23c6373d4b4c1.1777999826.git.d@ilvokhin.com>
 <20260513193342.GB2545104@noisy.programming.kicks-ass.net>
 <agXBb0ga_6HJrrnm@shell.ilvokhin.com>
 <20260603120811.GW3493090@noisy.programming.kicks-ass.net>
Precedence: bulk
X-Mailing-List: linux-trace-kernel@vger.kernel.org
List-Id: <linux-trace-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-trace-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-trace-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260603120811.GW3493090@noisy.programming.kicks-ass.net>

On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> On Thu, May 14, 2026 at 12:34:55PM +0000, Dmitry Ilvokhin wrote:
> 
> > Baseline, in best case scenario of least number of executed
> > instructions.
> > 
> >     3e0:  endbr64                          ; 4 bytes (always executed)
> >     3e4:  movb $0x0,(%rdi)                 ; 3 bytes (unlock,
> >                                            ; always executed)
> >     3e7:  decl %gs:__preempt_count         ; 7 bytes (always executed)
> >     3ee:  je   3f5                         ; 2 bytes (always executed)
> >     3f0:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
> >                                            ; je is not taken)
> >                                            ; rest is not executed
> >     3f5:  call __SCT__preempt_schedule     ; 5 bytes
> >     3fa:  jmp  __x86_return_thunk          ; 5 bytes
> > 
> > Tracepoint (again same case of least number of executed instructions).
> > 
> >     bc0:  endbr64                          ; 4 bytes (always executed)
> >     bc4:  xchg %ax,%ax                     ; 2 bytes (always executed, this is an
> >                                            ; only addition on the execution path).
> >     bc6:  movb $0x0,(%rdi)                 ; 3 bytes (unlock, always executed)
> >     bc9:  decl %gs:__preempt_count         ; 7 bytes (always executed)
> >     bd0:  je   bde                         ; 2 bytes (always executed)
> >     bd2:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
> >                                            ; je is not taken)
> >                                            ; rest is not executed
> >     bd7:  call queued_spin_release_traced  ; 5 bytes
> >     bdc:  jmp  bc9                         ; 2 bytes
> >     bde:  call __SCT__preempt_schedule     ; 5 bytes
> >     be3:  jmp  __x86_return_thunk          ; 5 bytes
> > 
> 
> So I've been playing with this a bit, and it is all really sad.
> 
> Now, since pretty much everybody+dog will have PARAVIRT_SPINLOCK=y, the
> 'best' solution would be changing that paravirt call with a
> static_call(), that actually shrinks the code by 1 byte.
> 
> And then this tracepoint nonsense can simply use a different unlock
> function, just like paravirt.
> 
> 0000 00000000000001d0 <_raw_spin_unlock>:
> 0000  1d0:	f3 0f 1e fa          	endbr64
> 0004  1d4:	ff 15 00 00 00 00    	call   *0x0(%rip)        # 1da <_raw_spin_unlock+0xa>	1d6: R_X86_64_PC32	pv_ops_lock+0x4
> 000a  1da:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        # 1e1 <_raw_spin_unlock+0x11>	1dd: R_X86_64_PC32	__preempt_count-0x4
> 0011  1e1:	74 06                	je     1e9 <_raw_spin_unlock+0x19>
> 0013  1e3:	2e e9 00 00 00 00    	cs jmp 1e9 <_raw_spin_unlock+0x19>	1e5: R_X86_64_PLT32	__x86_return_thunk-0x4
> 0019  1e9:	e8 00 00 00 00       	call   1ee <_raw_spin_unlock+0x1e>	1ea: R_X86_64_PLT32	__SCT__preempt_schedule-0x4
> 001e  1ee:	2e e9 00 00 00 00    	cs jmp 1f4 <_raw_spin_unlock+0x24>	1f0: R_X86_64_PLT32	__x86_return_thunk-0x4
> 
> 
> 0000 00000000000001d0 <_raw_spin_unlock>:
> 0000  1d0:	f3 0f 1e fa          	endbr64
> 0004  1d4:	e8 00 00 00 00       	call   1d9 <_raw_spin_unlock+0x9>	1d5: R_X86_64_PLT32	__SCT__queued_spin_unlock-0x4
> 0009  1d9:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        # 1e0 <_raw_spin_unlock+0x10>	1dc: R_X86_64_PC32	__preempt_count-0x4
> 0010  1e0:	74 06                	je     1e8 <_raw_spin_unlock+0x18>
> 0012  1e2:	2e e9 00 00 00 00    	cs jmp 1e8 <_raw_spin_unlock+0x18>	1e4: R_X86_64_PLT32	__x86_return_thunk-0x4
> 0018  1e8:	e8 00 00 00 00       	call   1ed <_raw_spin_unlock+0x1d>	1e9: R_X86_64_PLT32	__SCT__preempt_schedule-0x4
> 001d  1ed:	2e e9 00 00 00 00    	cs jmp 1f3 <_raw_spin_unlock+0x23>	1ef: R_X86_64_PLT32	__x86_return_thunk-0x4
> 
> 
> Something a little like so, which is completely untested, except to
> build kernel/locking/spinlock.o (with clang-23).

Thanks a lot for taking a look, Peter.

I like the static_call idea. It's truly zero cost on x86 (and, as you
note, even a byte smaller). The one caveat is that it relies on
HAVE_STATIC_CALL_INLINE to stay free. 

So my plan would be: static_call where HAVE_STATIC_CALL_INLINE is
available (x86), and a static branch fallback elsewhere, gated behind a
default-off config so it imposes nothing on arches/kernels that don't
opt in. I'm mostly interested in x86, but would like arm64 to work too,
which would use the fallback.

Concretely:

1. Split the sleepable-lock patches out and send them separately.
   They're independent of the static call work and look far less
   controversial.

2. Convert the paravirt spinlock unlock to a static_call, as the
   foundation for the unlock tracepoint. I'm happy to take a stab at it.
   Let me know if you'd rather do it yourself.

3. Build the unlock tracepoint on top: static_call where it's cheap,
   config-gated static_branch fallback where it isn't.

Does this plan sound reasonable to you?

> 
> Also, I think someone should go do some performance runs with
> ARCH_INLINE_SPIN_* set for x86 just like for s390.

That's a good point, I'll run benchmarks and report back with the
results.