[PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Uros Bizjak <ubizjak@gmail.com>
To: x86@kernel.org, linux-kernel@vger.kernel.org
Cc: Uros Bizjak <ubizjak@gmail.com>,
	Sean Christopherson <seanjc@google.com>,
	Nadav Amit <namit@vmware.com>, Ingo Molnar <mingo@kernel.org>,
	Andy Lutomirski <luto@kernel.org>,
	Brian Gerst <brgerst@gmail.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE
Date: Sun, 15 Oct 2023 22:24:41 +0200	[thread overview]
Message-ID: <20231015202523.189168-3-ubizjak@gmail.com> (raw)
In-Reply-To: <20231015202523.189168-1-ubizjak@gmail.com>

*EXPERIMENTAL PATCH, NOT FOR MERGE*

The patch introduces 'rdgsbase' alternative for CPUs with
X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
only decoding in the first decoder etc. But we're talking single-cycle
kind of effects, and the rdgsbase case should be much better from
a cache perspective and might use fewer memory pipeline resources to
offset the fact that it uses an unusual front end decoder resource...

The drawback of the patch is also larger binary size:

   text    data     bss     dec     hex filename
25549518        4388100  808452 30746070        1d525d6 vmlinux-new.o
25523192        4388212  808452 30719856        1d4bf70 vmlinux-old.o

that increases by 25.7k (0.103%), due to 1578 rdgsbase altinstructions
that are placed in the text section. The increase in text-size is not
"real" - the 'rdgsbase' instruction should be smaller than a 'mov %gs';
binary size increases because we obviously have two instructions, but
the actual *executable* part likely stays the same, and it's just that
we grow the altinstruction metadata.

Sean says:

"A significant percentage of data accesses in Intel's TDX-Module[*] use
this pattern, e.g. even global data is relative to GS.base in the module
due its rather odd and restricted environment.  Back in the early days
of TDX, the module used RD{FS,GS}BASE instead of prefixes to get
pointers to per-CPU and global data structures in the TDX-Module.  It's
been a few years so I forget the exact numbers, but at the time a single
transition between guest and host would have something like ~100 reads
of FS.base or GS.base.  Switching from RD{FS,GS}BASE to prefixed accesses
reduced the latency for a guest<->host transition through the TDX-Module
by several thousand cycles, as every RD{FS,GS}BASE had a latency of
~18 cycles (again, going off 3+ year old memories).

The TDX-Module code is pretty much a pathological worth case scenario,
but I suspect its usage is very similar to most usage of raw_cpu_ptr(),
e.g. get a pointer to some data structure and then do multiple
reads/writes from/to that data structure.

The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy
to emulate. If a hypervisor/VMM is advertising FSGSBASE even when it's
not supported by hardware, e.g. to migrate VMs to older hardware, then
every RDGSBASE will end up taking a few thousand cycles
(#UD -> VM-Exit -> emulate).  I would be surprised if any hypervisor
actually does this as it would be easier/smarter to simply not advertise
FSGSBASE if migrating to older hardware might be necessary, e.g. KVM
doesn't support emulating RD{FS,GS}BASE.  But at the same time, the whole
reason I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was
because I had hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM
TDX development on older hardware.  I.e. it's not impossible that this
code could run on hardware where RDGSBASE is emulated in software.

{RD,WR}{FS,GS}BASE were added as faster alternatives to {RD,WR}MSR,
not to accelerate actual accesses to per-CPU data, TLS, etc.  E.g.
loading a 64-bit base via a MOV to FS/GS is impossible.  And presumably
saving a userspace controlled by actually accessing FS/GS is dangerous
for one reason or another.

The instructions are guarded by a CR4 bit, the ucode cost just to check
CR4.FSGSBASE is probably non-trivial."

Cc: Sean Christopherson <seanjc@google.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
 arch/x86/include/asm/percpu.h | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 54746903b8c3..e047a0bc5554 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -49,36 +49,31 @@
 #define __force_percpu_prefix	"%%"__stringify(__percpu_seg)":"
 #define __my_cpu_offset		this_cpu_read(this_cpu_off)

-#ifdef CONFIG_USE_X86_SEG_SUPPORT
-/*
- * Efficient implementation for cases in which the compiler supports
- * named address spaces.  Allows the compiler to perform additional
- * optimizations that can save more instructions.
- */
+#ifdef CONFIG_X86_64
 #define arch_raw_cpu_ptr(ptr)					\
 ({								\
 	unsigned long tcp_ptr__;				\
-	tcp_ptr__ = __raw_cpu_read(, this_cpu_off);		\
+	asm (ALTERNATIVE("movq " __percpu_arg(1) ", %0",	\
+			 "rdgsbase %0",				\
+			 X86_FEATURE_FSGSBASE)			\
+	     : "=r" (tcp_ptr__)					\
+	     : "m" (__my_cpu_var(this_cpu_off)));		\
 								\
 	tcp_ptr__ += (unsigned long)(ptr);			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
 })
-#else /* CONFIG_USE_X86_SEG_SUPPORT */
-/*
- * Compared to the generic __my_cpu_offset version, the following
- * saves one instruction and avoids clobbering a temp register.
- */
+#else /* CONFIG_X86_64 */
 #define arch_raw_cpu_ptr(ptr)					\
 ({								\
 	unsigned long tcp_ptr__;				\
-	asm ("mov " __percpu_arg(1) ", %0"			\
+	asm ("movl " __percpu_arg(1) ", %0"			\
 	     : "=r" (tcp_ptr__)					\
 	     : "m" (__my_cpu_var(this_cpu_off)));		\
 								\
 	tcp_ptr__ += (unsigned long)(ptr);			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
 })
-#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+#endif /* CONFIG_X86_64 */

 #else /* CONFIG_SMP */
 #define __percpu_seg_override
-- 
2.41.0

next prev parent reply	other threads:[~2023-10-15 20:25 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-15 20:24 [PATCH -tip 1/3] x86/percpu: Rewrite arch_raw_cpu_ptr() Uros Bizjak
2023-10-15 20:24 ` [PATCH -tip 2/3] x86/percpu: Use C for arch_raw_cpu_ptr() Uros Bizjak
2023-10-16 11:09   ` [tip: x86/percpu] x86/percpu: Use C for arch_raw_cpu_ptr(), to improve code generation tip-bot2 for Uros Bizjak
2023-10-15 20:24 ` Uros Bizjak [this message]
2023-10-16 11:14   ` [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE Ingo Molnar
2023-10-16 19:29     ` Sean Christopherson
2023-10-16 19:54       ` Linus Torvalds
2023-10-16 11:09 ` [tip: x86/percpu] x86/percpu: Rewrite arch_raw_cpu_ptr() to be easier for compilers to optimize tip-bot2 for Uros Bizjak

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:54746903b8c dfblob:e047a0bc555 )
 OR (
bs:"[PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231015202523.189168-3-ubizjak@gmail.com \
    --to=ubizjak@gmail.com \
    --cc=brgerst@gmail.com \
    --cc=dvlasenk@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jpoimboe@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@kernel.org \
    --cc=namit@vmware.com \
    --cc=peterz@infradead.org \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox