[PATCH -tip 1/3] x86/percpu: Rewrite arch_raw_cpu

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH -tip 1/3] x86/percpu: Rewrite arch_raw_cpu_ptr()
@ 2023-10-15 20:24 Uros Bizjak
  2023-10-15 20:24 ` [PATCH -tip 2/3] x86/percpu: Use C for arch_raw_cpu_ptr() Uros Bizjak
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Uros Bizjak @ 2023-10-15 20:24 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Nadav Amit, Ingo Molnar, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf

Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then
add the ptr value to the base. This way, the compiler can propagate
addend to the following instruction and simplify address calculation.

E.g.: address calcuation in amd_pmu_enable_virt() improves from:

    48 c7 c0 00 00 00 00 	mov    $0x0,%rax
	87b7: R_X86_64_32S	cpu_hw_events

    65 48 03 05 00 00 00 	add    %gs:0x0(%rip),%rax
    00
	87bf: R_X86_64_PC32	this_cpu_off-0x4

    48 c7 80 28 13 00 00 	movq   $0x0,0x1328(%rax)
    00 00 00 00

to:

    65 48 8b 05 00 00 00 	mov    %gs:0x0(%rip),%rax
    00
	8798: R_X86_64_PC32	this_cpu_off-0x4
    48 c7 80 00 00 00 00 	movq   $0x0,0x0(%rax)
    00 00 00 00
	87a6: R_X86_64_32S	cpu_hw_events+0x1328

The compiler also eliminates additional redundant loads from
this_cpu_off, reducing the number of percpu offset reads
from 1668 to 1646.

Cc: Nadav Amit <namit@vmware.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
 arch/x86/include/asm/percpu.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 60ea7755c0fe..915675f3ad60 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -56,9 +56,11 @@
 #define arch_raw_cpu_ptr(ptr)					\
 ({								\
 	unsigned long tcp_ptr__;				\
-	asm ("add " __percpu_arg(1) ", %0"			\
+	asm ("mov " __percpu_arg(1) ", %0"			\
 	     : "=r" (tcp_ptr__)					\
-	     : "m" (__my_cpu_var(this_cpu_off)), "0" (ptr));	\
+	     : "m" (__my_cpu_var(this_cpu_off)));		\
+								\
+	tcp_ptr__ += (unsigned long)(ptr);			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
 })
 #else /* CONFIG_SMP */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH -tip 2/3] x86/percpu: Use C for arch_raw_cpu_ptr()
  2023-10-15 20:24 [PATCH -tip 1/3] x86/percpu: Rewrite arch_raw_cpu_ptr() Uros Bizjak
@ 2023-10-15 20:24 ` Uros Bizjak
  2023-10-16 11:09   ` [tip: x86/percpu] x86/percpu: Use C for arch_raw_cpu_ptr(), to improve code generation tip-bot2 for Uros Bizjak
  2023-10-15 20:24 ` [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE Uros Bizjak
  2023-10-16 11:09 ` [tip: x86/percpu] x86/percpu: Rewrite arch_raw_cpu_ptr() to be easier for compilers to optimize tip-bot2 for Uros Bizjak
  2 siblings, 1 reply; 8+ messages in thread
From: Uros Bizjak @ 2023-10-15 20:24 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H . Peter Anvin, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	Josh Poimboeuf, Nadav Amit

Implement arch_raw_cpu_ptr() in C to allow the compiler to perform
better optimizations, such as setting an appropriate base to compute
the address. The compiler is free to choose either MOV or ADD from
this_cpu_off address to construct the optimal final address.

There are some other issues when memory access to the percpu area is
implemented with an asm. Compilers can not eliminate asm common
subexpressions over basic block boundaries, but are extremely good
at optimizing memory access. By implementing arch_raw_cpu_ptr() in C,
the compiler can eliminate additional redundant loads from this_cpu_off,
further reducing the number of percpu offset reads from 1646 to 1631.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Co-developed-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
v2: Use CONFIG_USE_X86_SEG_SUPPORT to handle
    cases where KASAN is enabled.
v3: Split the patch to the part that introduces MOV with
    tcp_ptr__ += (unsigned long)(ptr) and the part that
    introduces named address spaces.
---
 arch/x86/include/asm/percpu.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 915675f3ad60..54746903b8c3 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -49,6 +49,21 @@
 #define __force_percpu_prefix	"%%"__stringify(__percpu_seg)":"
 #define __my_cpu_offset		this_cpu_read(this_cpu_off)
 
+#ifdef CONFIG_USE_X86_SEG_SUPPORT
+/*
+ * Efficient implementation for cases in which the compiler supports
+ * named address spaces.  Allows the compiler to perform additional
+ * optimizations that can save more instructions.
+ */
+#define arch_raw_cpu_ptr(ptr)					\
+({								\
+	unsigned long tcp_ptr__;				\
+	tcp_ptr__ = __raw_cpu_read(, this_cpu_off);		\
+								\
+	tcp_ptr__ += (unsigned long)(ptr);			\
+	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
+})
+#else /* CONFIG_USE_X86_SEG_SUPPORT */
 /*
  * Compared to the generic __my_cpu_offset version, the following
  * saves one instruction and avoids clobbering a temp register.
@@ -63,6 +78,8 @@
 	tcp_ptr__ += (unsigned long)(ptr);			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
 })
+#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+
 #else /* CONFIG_SMP */
 #define __percpu_seg_override
 #define __percpu_prefix		""
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [tip: x86/percpu] x86/percpu: Use C for arch_raw_cpu_ptr(), to improve code generation
  2023-10-15 20:24 ` [PATCH -tip 2/3] x86/percpu: Use C for arch_raw_cpu_ptr() Uros Bizjak
@ 2023-10-16 11:09   ` tip-bot2 for Uros Bizjak
  0 siblings, 0 replies; 8+ messages in thread
From: tip-bot2 for Uros Bizjak @ 2023-10-16 11:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Nadav Amit, Uros Bizjak, Ingo Molnar, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Josh Poimboeuf, Sean Christopherson, x86, linux-kernel

The following commit has been merged into the x86/percpu branch of tip:

Commit-ID:     1d10f3aec2bb734b4b594afe8c1bd0aa656a7e4d
Gitweb:        https://git.kernel.org/tip/1d10f3aec2bb734b4b594afe8c1bd0aa656a7e4d
Author:        Uros Bizjak <ubizjak@gmail.com>
AuthorDate:    Sun, 15 Oct 2023 22:24:40 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 16 Oct 2023 12:52:02 +02:00

x86/percpu: Use C for arch_raw_cpu_ptr(), to improve code generation

Implement arch_raw_cpu_ptr() in C to allow the compiler to perform
better optimizations, such as setting an appropriate base to compute
the address. The compiler is free to choose either MOV or ADD from
this_cpu_off address to construct the optimal final address.

There are some other issues when memory access to the percpu area is
implemented with an asm. Compilers can not eliminate asm common
subexpressions over basic block boundaries, but are extremely good
at optimizing memory access. By implementing arch_raw_cpu_ptr() in C,
the compiler can eliminate additional redundant loads from this_cpu_off,
further reducing the number of percpu offset reads from 1646 to 1631
on a test build, a -0.9% reduction.

Co-developed-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20231015202523.189168-2-ubizjak@gmail.com
---
 arch/x86/include/asm/percpu.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 915675f..5474690 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -49,6 +49,21 @@
 #define __force_percpu_prefix	"%%"__stringify(__percpu_seg)":"
 #define __my_cpu_offset		this_cpu_read(this_cpu_off)
 
+#ifdef CONFIG_USE_X86_SEG_SUPPORT
+/*
+ * Efficient implementation for cases in which the compiler supports
+ * named address spaces.  Allows the compiler to perform additional
+ * optimizations that can save more instructions.
+ */
+#define arch_raw_cpu_ptr(ptr)					\
+({								\
+	unsigned long tcp_ptr__;				\
+	tcp_ptr__ = __raw_cpu_read(, this_cpu_off);		\
+								\
+	tcp_ptr__ += (unsigned long)(ptr);			\
+	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
+})
+#else /* CONFIG_USE_X86_SEG_SUPPORT */
 /*
  * Compared to the generic __my_cpu_offset version, the following
  * saves one instruction and avoids clobbering a temp register.
@@ -63,6 +78,8 @@
 	tcp_ptr__ += (unsigned long)(ptr);			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
 })
+#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+
 #else /* CONFIG_SMP */
 #define __percpu_seg_override
 #define __percpu_prefix		""

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE
  2023-10-15 20:24 [PATCH -tip 1/3] x86/percpu: Rewrite arch_raw_cpu_ptr() Uros Bizjak
  2023-10-15 20:24 ` [PATCH -tip 2/3] x86/percpu: Use C for arch_raw_cpu_ptr() Uros Bizjak
@ 2023-10-15 20:24 ` Uros Bizjak
  2023-10-16 11:14   ` Ingo Molnar
  2023-10-16 11:09 ` [tip: x86/percpu] x86/percpu: Rewrite arch_raw_cpu_ptr() to be easier for compilers to optimize tip-bot2 for Uros Bizjak
  2 siblings, 1 reply; 8+ messages in thread
From: Uros Bizjak @ 2023-10-15 20:24 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Sean Christopherson, Nadav Amit, Ingo Molnar,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf

*EXPERIMENTAL PATCH, NOT FOR MERGE*

The patch introduces 'rdgsbase' alternative for CPUs with
X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
only decoding in the first decoder etc. But we're talking single-cycle
kind of effects, and the rdgsbase case should be much better from
a cache perspective and might use fewer memory pipeline resources to
offset the fact that it uses an unusual front end decoder resource...

The drawback of the patch is also larger binary size:

   text    data     bss     dec     hex filename
25549518        4388100  808452 30746070        1d525d6 vmlinux-new.o
25523192        4388212  808452 30719856        1d4bf70 vmlinux-old.o

that increases by 25.7k (0.103%), due to 1578 rdgsbase altinstructions
that are placed in the text section. The increase in text-size is not
"real" - the 'rdgsbase' instruction should be smaller than a 'mov %gs';
binary size increases because we obviously have two instructions, but
the actual *executable* part likely stays the same, and it's just that
we grow the altinstruction metadata.

Sean says:

"A significant percentage of data accesses in Intel's TDX-Module[*] use
this pattern, e.g. even global data is relative to GS.base in the module
due its rather odd and restricted environment.  Back in the early days
of TDX, the module used RD{FS,GS}BASE instead of prefixes to get
pointers to per-CPU and global data structures in the TDX-Module.  It's
been a few years so I forget the exact numbers, but at the time a single
transition between guest and host would have something like ~100 reads
of FS.base or GS.base.  Switching from RD{FS,GS}BASE to prefixed accesses
reduced the latency for a guest<->host transition through the TDX-Module
by several thousand cycles, as every RD{FS,GS}BASE had a latency of
~18 cycles (again, going off 3+ year old memories).

The TDX-Module code is pretty much a pathological worth case scenario,
but I suspect its usage is very similar to most usage of raw_cpu_ptr(),
e.g. get a pointer to some data structure and then do multiple
reads/writes from/to that data structure.

The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy
to emulate. If a hypervisor/VMM is advertising FSGSBASE even when it's
not supported by hardware, e.g. to migrate VMs to older hardware, then
every RDGSBASE will end up taking a few thousand cycles
(#UD -> VM-Exit -> emulate).  I would be surprised if any hypervisor
actually does this as it would be easier/smarter to simply not advertise
FSGSBASE if migrating to older hardware might be necessary, e.g. KVM
doesn't support emulating RD{FS,GS}BASE.  But at the same time, the whole
reason I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was
because I had hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM
TDX development on older hardware.  I.e. it's not impossible that this
code could run on hardware where RDGSBASE is emulated in software.

{RD,WR}{FS,GS}BASE were added as faster alternatives to {RD,WR}MSR,
not to accelerate actual accesses to per-CPU data, TLS, etc.  E.g.
loading a 64-bit base via a MOV to FS/GS is impossible.  And presumably
saving a userspace controlled by actually accessing FS/GS is dangerous
for one reason or another.

The instructions are guarded by a CR4 bit, the ucode cost just to check
CR4.FSGSBASE is probably non-trivial."

Cc: Sean Christopherson <seanjc@google.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
 arch/x86/include/asm/percpu.h | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 54746903b8c3..e047a0bc5554 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -49,36 +49,31 @@
 #define __force_percpu_prefix	"%%"__stringify(__percpu_seg)":"
 #define __my_cpu_offset		this_cpu_read(this_cpu_off)

-#ifdef CONFIG_USE_X86_SEG_SUPPORT
-/*
- * Efficient implementation for cases in which the compiler supports
- * named address spaces.  Allows the compiler to perform additional
- * optimizations that can save more instructions.
- */
+#ifdef CONFIG_X86_64
 #define arch_raw_cpu_ptr(ptr)					\
 ({								\
 	unsigned long tcp_ptr__;				\
-	tcp_ptr__ = __raw_cpu_read(, this_cpu_off);		\
+	asm (ALTERNATIVE("movq " __percpu_arg(1) ", %0",	\
+			 "rdgsbase %0",				\
+			 X86_FEATURE_FSGSBASE)			\
+	     : "=r" (tcp_ptr__)					\
+	     : "m" (__my_cpu_var(this_cpu_off)));		\
 								\
 	tcp_ptr__ += (unsigned long)(ptr);			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
 })
-#else /* CONFIG_USE_X86_SEG_SUPPORT */
-/*
- * Compared to the generic __my_cpu_offset version, the following
- * saves one instruction and avoids clobbering a temp register.
- */
+#else /* CONFIG_X86_64 */
 #define arch_raw_cpu_ptr(ptr)					\
 ({								\
 	unsigned long tcp_ptr__;				\
-	asm ("mov " __percpu_arg(1) ", %0"			\
+	asm ("movl " __percpu_arg(1) ", %0"			\
 	     : "=r" (tcp_ptr__)					\
 	     : "m" (__my_cpu_var(this_cpu_off)));		\
 								\
 	tcp_ptr__ += (unsigned long)(ptr);			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
 })
-#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+#endif /* CONFIG_X86_64 */

 #else /* CONFIG_SMP */
 #define __percpu_seg_override
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE
  2023-10-15 20:24 ` [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE Uros Bizjak
@ 2023-10-16 11:14   ` Ingo Molnar
  2023-10-16 19:29     ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2023-10-16 11:14 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Sean Christopherson, Nadav Amit,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf,
	Borislav Petkov


* Uros Bizjak <ubizjak@gmail.com> wrote:

> Sean says:
> 
> "A significant percentage of data accesses in Intel's TDX-Module[*] use
> this pattern, e.g. even global data is relative to GS.base in the module
> due its rather odd and restricted environment.  Back in the early days
> of TDX, the module used RD{FS,GS}BASE instead of prefixes to get
> pointers to per-CPU and global data structures in the TDX-Module.  It's
> been a few years so I forget the exact numbers, but at the time a single
> transition between guest and host would have something like ~100 reads
> of FS.base or GS.base.  Switching from RD{FS,GS}BASE to prefixed accesses
> reduced the latency for a guest<->host transition through the TDX-Module
> by several thousand cycles, as every RD{FS,GS}BASE had a latency of
> ~18 cycles (again, going off 3+ year old memories).
> 
> The TDX-Module code is pretty much a pathological worth case scenario,
> but I suspect its usage is very similar to most usage of raw_cpu_ptr(),
> e.g. get a pointer to some data structure and then do multiple
> reads/writes from/to that data structure.
> 
> The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy

[ Obsessive-compulsive nitpicking:

     s/RD{FS,FS}GSBASE
      /RD{FS,GS}BASE
]

> to emulate. If a hypervisor/VMM is advertising FSGSBASE even when it's
> not supported by hardware, e.g. to migrate VMs to older hardware, then
> every RDGSBASE will end up taking a few thousand cycles
> (#UD -> VM-Exit -> emulate).  I would be surprised if any hypervisor
> actually does this as it would be easier/smarter to simply not advertise
> FSGSBASE if migrating to older hardware might be necessary, e.g. KVM
> doesn't support emulating RD{FS,GS}BASE.  But at the same time, the whole
> reason I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was
> because I had hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM
> TDX development on older hardware.  I.e. it's not impossible that this
> code could run on hardware where RDGSBASE is emulated in software.
> 
> {RD,WR}{FS,GS}BASE were added as faster alternatives to {RD,WR}MSR,
> not to accelerate actual accesses to per-CPU data, TLS, etc.  E.g.
> loading a 64-bit base via a MOV to FS/GS is impossible.  And presumably
> saving a userspace controlled by actually accessing FS/GS is dangerous
> for one reason or another.
> 
> The instructions are guarded by a CR4 bit, the ucode cost just to check
> CR4.FSGSBASE is probably non-trivial."

BTW., a side note regarding the very last paragraph and the CR4 bit ucode 
cost, given that SMAP is CR4 controlled too:

 #define X86_CR4_FSGSBASE_BIT    16 /* enable RDWRFSGS support */
 #define X86_CR4_FSGSBASE        _BITUL(X86_CR4_FSGSBASE_BIT)
 ...
 #define X86_CR4_SMAP_BIT        21 /* enable SMAP support */
 #define X86_CR4_SMAP            _BITUL(X86_CR4_SMAP_BIT)

And this modifies the behavior of STAC/CLAC, of which we have ~300 
instances in a defconfig kernel image:

  kepler:~/tip> objdump -wdr vmlinux | grep -w 'stac' x | wc  -l
  119

  kepler:~/tip> objdump -wdr vmlinux | grep -w 'clac' x | wc  -l
  188

Are we certain that ucode on modern x86 CPUs check CR4 for every affected 
instruction?

Could they perhaps use something faster, such as internal 
microcode-patching (is that a thing?), to turn support for certain 
instructions on/off when the relevant CR4 bit is modified, without
having to genuinely access CR4 for every instruction executed?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE
  2023-10-16 11:14   ` Ingo Molnar
@ 2023-10-16 19:29     ` Sean Christopherson
  2023-10-16 19:54       ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2023-10-16 19:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Uros Bizjak, x86, linux-kernel, Nadav Amit, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Borislav Petkov

On Mon, Oct 16, 2023, Ingo Molnar wrote:
> 
> * Uros Bizjak <ubizjak@gmail.com> wrote:
> 
> > Sean says:
> > The instructions are guarded by a CR4 bit, the ucode cost just to check
> > CR4.FSGSBASE is probably non-trivial."
> 
> BTW., a side note regarding the very last paragraph and the CR4 bit ucode 
> cost, given that SMAP is CR4 controlled too:
> 
>  #define X86_CR4_FSGSBASE_BIT    16 /* enable RDWRFSGS support */
>  #define X86_CR4_FSGSBASE        _BITUL(X86_CR4_FSGSBASE_BIT)
>  ...
>  #define X86_CR4_SMAP_BIT        21 /* enable SMAP support */
>  #define X86_CR4_SMAP            _BITUL(X86_CR4_SMAP_BIT)
> 
> And this modifies the behavior of STAC/CLAC, of which we have ~300 
> instances in a defconfig kernel image:
> 
>   kepler:~/tip> objdump -wdr vmlinux | grep -w 'stac' x | wc  -l
>   119
> 
>   kepler:~/tip> objdump -wdr vmlinux | grep -w 'clac' x | wc  -l
>   188
> 
> Are we certain that ucode on modern x86 CPUs check CR4 for every affected 
> instruction?

Not certain at all.  I agree the CR4.FSGSBASE thing could be a complete non-issue
and was just me speculating.

> Could they perhaps use something faster, such as internal microcode-patching
> (is that a thing?), to turn support for certain instructions on/off when the
> relevant CR4 bit is modified, without having to genuinely access CR4 for
> every instruction executed?

I don't know the exact details, but Intel's VMRESUME ucode flow uses some form of
magic to skip consistency checks that aren't relevant for the current (or target)
mode, *without* using conditional branches.  So it's definitely possible/probable
that similar magic is used to expedite things like CPL and CR4 checks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE
  2023-10-16 19:29     ` Sean Christopherson
@ 2023-10-16 19:54       ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2023-10-16 19:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ingo Molnar, Uros Bizjak, x86, linux-kernel, Nadav Amit,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin,
	Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 2473 bytes --]

On Mon, 16 Oct 2023 at 12:29, Sean Christopherson <seanjc@google.com> wrote:
>
> > Are we certain that ucode on modern x86 CPUs check CR4 for every affected
> > instruction?
>
> Not certain at all.  I agree the CR4.FSGSBASE thing could be a complete non-issue
> and was just me speculating.

Note that my timings on two fairly different arches do put the cost of
'rdgsbase' at 2 cycles, so it's not microcoded in the sense of jumping
off to some microcode sequence that has a noticeable overhead.

So it's almost certainly what Intel calls a "complex decoder" case
that generates up to 4 uops inline and only decodes in the first
decode slot.

One of the uops could easily be a cr4 check, that's not an uncommon
thing for those kinds of instructions.

If somebody wants to try my truly atrocious test program on other
machines, go right ahead. It's attached. I'm not proud of it. It's a
hack.

Do something like this:

    $ gcc -O2 t.c
    $ ./a.out
      "nop"=0l: 0.380925
      "nop"=0l: 0.380640
      "nop"=0l: 0.380373
      "mov %1,%0":"=r"(base):"m"(zero)=0l: 0.787984
      "rdgsbase %0":"=r"(base)=0l: 2.626625

and you'll see that a no-op takes about a third of a cycle on my Zen 2
core (according to this truly stupid benchmark). With some small
overhead.

And a "mov memory to register" shows up as ~3/4 cycle, but it's really
probably that the core can do two of them per cycle, and then the
chain of adds (see how that benchmark makes sure the result is "used")
adds some more overhead etc.

And the 'rdgsbase' is about two cycles, and presumably is fully
serialized, so all the loop overhead and adding results then shows up
as that extra .6 of a cycle on average.

But doing cycle estimations on OoO machines is "guess rough patterns",
so take all the above with a big pinch of salt. And feel free to test
it on other cores than the ones I did (Intel Skylake and and AMD Zen
2). You migth want to put your machine into "performance" mode or
other things to actually make it run at the highest frequency to get
more repeatable numbers.

The Skylake core does better on the nops (I think Intel gets rid of
them earlier in the decode stages and they basically disappear in the
uop cache), and can do three loads per cycle. So rdgsbase looks
relatively slower on my Skylake at about 3 cycles per op, but when you
look at an individual instruction, that's a fairly artificial thing.
You don't run these things in the uop cache in reality.

              Linus

[-- Attachment #2: t.c --]
[-- Type: text/x-c-code, Size: 1142 bytes --]

#include <stdio.h>

#define NR 100000000

#define LOOP(x) for(int i = 0; i < NR/16; i++) do { \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
	asm volatile(x); sum += base; \
} while (0)

static inline unsigned int rdtsc(void)
{
	unsigned int a,d;
	asm volatile("rdtsc":"=a"(a),"=d"(d)::"memory");
	return a;
}

#define TEST(x) do { \
	unsigned int s = rdtsc(); \
	LOOP(x); \
	s = rdtsc()-s; \
	fprintf(stderr, "  " #x "=%ul: %f\n", sum, s / (double)NR); \
} while (0)

int main(int argc, char **argv)
{
	unsigned long base = 0, sum = 0;
	unsigned long zero = 0;

	TEST("nop");
	TEST("nop");
	TEST("nop");
	TEST("mov %1,%0":"=r"(base):"m"(zero));
	TEST("rdgsbase %0":"=r"(base));
	return 0;
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip: x86/percpu] x86/percpu: Rewrite arch_raw_cpu_ptr() to be easier for compilers to optimize
  2023-10-15 20:24 [PATCH -tip 1/3] x86/percpu: Rewrite arch_raw_cpu_ptr() Uros Bizjak
  2023-10-15 20:24 ` [PATCH -tip 2/3] x86/percpu: Use C for arch_raw_cpu_ptr() Uros Bizjak
  2023-10-15 20:24 ` [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE Uros Bizjak
@ 2023-10-16 11:09 ` tip-bot2 for Uros Bizjak
  2 siblings, 0 replies; 8+ messages in thread
From: tip-bot2 for Uros Bizjak @ 2023-10-16 11:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Uros Bizjak, Ingo Molnar, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linus Torvalds, Josh Poimboeuf,
	Sean Christopherson, x86, linux-kernel

The following commit has been merged into the x86/percpu branch of tip:

Commit-ID:     a048d3abae7c33f0a3f4575fab15ac5504d443f7
Gitweb:        https://git.kernel.org/tip/a048d3abae7c33f0a3f4575fab15ac5504d443f7
Author:        Uros Bizjak <ubizjak@gmail.com>
AuthorDate:    Sun, 15 Oct 2023 22:24:39 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 16 Oct 2023 12:51:58 +02:00

x86/percpu: Rewrite arch_raw_cpu_ptr() to be easier for compilers to optimize

Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then
add the ptr value to the base. This way, the compiler can propagate
addend to the following instruction and simplify address calculation.

E.g.: address calcuation in amd_pmu_enable_virt() improves from:

    48 c7 c0 00 00 00 00	mov    $0x0,%rax
	87b7: R_X86_64_32S	cpu_hw_events

    65 48 03 05 00 00 00	add    %gs:0x0(%rip),%rax
    00
	87bf: R_X86_64_PC32	this_cpu_off-0x4

    48 c7 80 28 13 00 00	movq   $0x0,0x1328(%rax)
    00 00 00 00

to:

    65 48 8b 05 00 00 00	mov    %gs:0x0(%rip),%rax
    00
	8798: R_X86_64_PC32	this_cpu_off-0x4
    48 c7 80 00 00 00 00	movq   $0x0,0x0(%rax)
    00 00 00 00
	87a6: R_X86_64_32S	cpu_hw_events+0x1328

The compiler also eliminates additional redundant loads from
this_cpu_off, reducing the number of percpu offset reads
from 1668 to 1646 on a test build, a -1.3% reduction.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20231015202523.189168-1-ubizjak@gmail.com
---
 arch/x86/include/asm/percpu.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 60ea775..915675f 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -56,9 +56,11 @@
 #define arch_raw_cpu_ptr(ptr)					\
 ({								\
 	unsigned long tcp_ptr__;				\
-	asm ("add " __percpu_arg(1) ", %0"			\
+	asm ("mov " __percpu_arg(1) ", %0"			\
 	     : "=r" (tcp_ptr__)					\
-	     : "m" (__my_cpu_var(this_cpu_off)), "0" (ptr));	\
+	     : "m" (__my_cpu_var(this_cpu_off)));		\
+								\
+	tcp_ptr__ += (unsigned long)(ptr);			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
 })
 #else /* CONFIG_SMP */

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-10-16 19:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-15 20:24 [PATCH -tip 1/3] x86/percpu: Rewrite arch_raw_cpu_ptr() Uros Bizjak
2023-10-15 20:24 ` [PATCH -tip 2/3] x86/percpu: Use C for arch_raw_cpu_ptr() Uros Bizjak
2023-10-16 11:09   ` [tip: x86/percpu] x86/percpu: Use C for arch_raw_cpu_ptr(), to improve code generation tip-bot2 for Uros Bizjak
2023-10-15 20:24 ` [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE Uros Bizjak
2023-10-16 11:14   ` Ingo Molnar
2023-10-16 19:29     ` Sean Christopherson
2023-10-16 19:54       ` Linus Torvalds
2023-10-16 11:09 ` [tip: x86/percpu] x86/percpu: Rewrite arch_raw_cpu_ptr() to be easier for compilers to optimize tip-bot2 for Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox