[PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace
@ 2024-08-07  5:47 Xin Li (Intel)
  2024-08-07  5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel)
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Xin Li (Intel) @ 2024-08-07  5:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3,
	seanjc

This patch set moves writing MSR_IA32_FRED_RSP0 to return to userspace
from context switch.

In the discussion of save/restore host/guest FRED RSP0 for KVM, Sean
proposed to have the kernel write MSR_IA32_FRED_RSP0 on return to
userspace, i.e., arch_exit_to_user_mode_prepare(), instead of on context
switch. [1]

hpa suggested to test ti_work for zero and then process individual bits
in arch_exit_to_user_mode_prepare.  And a quick measurement shows that
in most cases, ti_work values passed to arch_exit_to_user_mode_prepare()
are zeros, e.g., 99% in kernel build tests.  This zero test change was
then sent to Intel 0day tests, and no perf regression is reported.

Besides, per the discussion of how to write MSR_IA32_FRED_RSP0 with the
introduction of WRMSRNS [2], use the alternatives mechanism to choose
WRMSRNS when it's available, otherwise fallback to WRMSR.

[1] https://lore.kernel.org/lkml/ZpkfkSMPiXrS9r2K@google.com/
[2] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/

Andrew Cooper (1):
  x86/msr: Switch between WRMSRNS and WRMSR with the alternatives
    mechanism

Xin Li (Intel) (2):
  x86/entry: Test ti_work for zero before processing individual bits
  x86/entry: Set FRED RSP0 on return to userspace instead of context
    switch

 arch/x86/include/asm/entry-common.h | 21 ++++++++++++++-------
 arch/x86/include/asm/msr.h          | 28 ++++++++++++++--------------
 arch/x86/include/asm/switch_to.h    |  3 +--
 arch/x86/include/asm/thread_info.h  |  2 ++
 arch/x86/kernel/cpu/cpuid-deps.c    |  1 -
 5 files changed, 31 insertions(+), 24 deletions(-)

base-commit: 9ebdc7589cbb5c976e6c8807cbe13f263d70d32c
-- 
2.45.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits
  2024-08-07  5:47 [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace Xin Li (Intel)
@ 2024-08-07  5:47 ` Xin Li (Intel)
  2024-08-07 16:21   ` Brian Gerst
  2024-08-07 18:08   ` Thomas Gleixner
  2024-08-07  5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel)
  2024-08-07  5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel)
  2 siblings, 2 replies; 37+ messages in thread
From: Xin Li (Intel) @ 2024-08-07  5:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3,
	seanjc

In most cases, ti_work values passed to arch_exit_to_user_mode_prepare()
are zeros, e.g., 99% in kernel build tests.  So an obvious optimization
is to test ti_work for zero before processing individual bits in it.

In addition, Intel 0day tests find no perf regression with this change.

Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
---
 arch/x86/include/asm/entry-common.h | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index fb2809b20b0a..4c78b99060b5 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
 static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 						  unsigned long ti_work)
 {
-	if (ti_work & _TIF_USER_RETURN_NOTIFY)
-		fire_user_return_notifiers();
+	if (unlikely(ti_work)) {
+		if (ti_work & _TIF_USER_RETURN_NOTIFY)
+			fire_user_return_notifiers();
 
-	if (unlikely(ti_work & _TIF_IO_BITMAP))
-		tss_update_io_bitmap();
+		if (unlikely(ti_work & _TIF_IO_BITMAP))
+			tss_update_io_bitmap();
 
-	fpregs_assert_state_consistent();
-	if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
-		switch_fpu_return();
+		fpregs_assert_state_consistent();
+		if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
+			switch_fpu_return();
+	}
 
 #ifdef CONFIG_COMPAT
 	/*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-07  5:47 [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace Xin Li (Intel)
  2024-08-07  5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel)
@ 2024-08-07  5:47 ` Xin Li (Intel)
  2024-08-07 18:11   ` Thomas Gleixner
  2024-08-09 23:07   ` Andrew Cooper
  2024-08-07  5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel)
  2 siblings, 2 replies; 37+ messages in thread
From: Xin Li (Intel) @ 2024-08-07  5:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3,
	seanjc

From: Andrew Cooper <andrew.cooper3@citrix.com>

Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
use the alternatives mechanism to choose WRMSRNS when it's available,
otherwise fallback to WRMSR.

[1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
---
 arch/x86/include/asm/msr.h | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index d642037f9ed5..3e402d717815 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
 		     : : "c" (msr), "a"(low), "d" (high) : "memory");
 }
 
-/*
- * WRMSRNS behaves exactly like WRMSR with the only difference being
- * that it is not a serializing instruction by default.
- */
-static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
-{
-	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
-	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
-		     "2:\n"
-		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
-		     : : "c" (msr), "a"(low), "d" (high));
-}
-
 #define native_rdmsr(msr, val1, val2)			\
 do {							\
 	u64 __val = __rdmsr((msr));			\
@@ -312,9 +299,22 @@ do {							\
 
 #endif	/* !CONFIG_PARAVIRT_XXL */
 
+/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
+#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
+
+/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
 static __always_inline void wrmsrns(u32 msr, u64 val)
 {
-	__wrmsrns(msr, val, val >> 32);
+	/*
+	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
+	 * DS prefix to avoid a trailing NOP.
+	 */
+	asm volatile("1: "
+		     ALTERNATIVE("ds wrmsr",
+				 WRMSRNS, X86_FEATURE_WRMSRNS)
+		     "2:\n"
+		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
+		     : : "c" (msr), "a" ((u32)val), "d" ((u32)(val >> 32)));
 }
 
 /*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch
  2024-08-07  5:47 [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace Xin Li (Intel)
  2024-08-07  5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel)
  2024-08-07  5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel)
@ 2024-08-07  5:47 ` Xin Li (Intel)
  2024-08-07 18:39   ` Thomas Gleixner
  2024-08-09 10:45   ` Nikolay Borisov
  2 siblings, 2 replies; 37+ messages in thread
From: Xin Li (Intel) @ 2024-08-07  5:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3,
	seanjc

FRED RSP0 is a per task constant pointing to top of its kernel stack
for user level event delivery, and needs to be updated when a task is
scheduled in.

Introduce a new TI flag TIF_LOAD_USER_STATES to track whether FRED RSP0
needs to be loaded, and do the actual load of FRED RSP0 in
arch_exit_to_user_mode_prepare() if the TI flag is set, thus to avoid a
fair number of WRMSRs in both KVM and the kernel.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
---
 arch/x86/include/asm/entry-common.h | 5 +++++
 arch/x86/include/asm/switch_to.h    | 3 +--
 arch/x86/include/asm/thread_info.h  | 2 ++
 arch/x86/kernel/cpu/cpuid-deps.c    | 1 -
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 4c78b99060b5..ae365579efb3 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -51,6 +51,11 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 		if (ti_work & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cpu_feature_enabled(X86_FEATURE_FRED) &&
+		    (ti_work & _TIF_LOAD_USER_STATES))
+			wrmsrns(MSR_IA32_FRED_RSP0,
+				(unsigned long)task_stack_page(current) + THREAD_SIZE);
+
 		if (unlikely(ti_work & _TIF_IO_BITMAP))
 			tss_update_io_bitmap();
 
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index c3bd0c0758c9..a31ea544cc0e 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -71,8 +71,7 @@ static inline void update_task_stack(struct task_struct *task)
 	this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
 #else
 	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
-		/* WRMSRNS is a baseline feature for FRED. */
-		wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE);
+		set_thread_flag(TIF_LOAD_USER_STATES);
 	} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
 		/* Xen PV enters the kernel on the thread stack. */
 		load_sp0(task_top_of_stack(task));
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 12da7dfd5ef1..fb51904651c0 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -106,6 +106,7 @@ struct thread_info {
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
 #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
+#define TIF_LOAD_USER_STATES	30	/* Load user level states */
 
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
@@ -128,6 +129,7 @@ struct thread_info {
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
+#define _TIF_LOAD_USER_STATES	(1 << TIF_LOAD_USER_STATES)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index b7d9f530ae16..8bd84114c2d9 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -83,7 +83,6 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XFD       },
 	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
 	{ X86_FEATURE_FRED,			X86_FEATURE_LKGS      },
-	{ X86_FEATURE_FRED,			X86_FEATURE_WRMSRNS   },
 	{}
 };
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits
  2024-08-07  5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel)
@ 2024-08-07 16:21   ` Brian Gerst
  2024-08-07 23:03     ` Xin Li
  2024-08-07 18:08   ` Thomas Gleixner
  1 sibling, 1 reply; 37+ messages in thread
From: Brian Gerst @ 2024-08-07 16:21 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: linux-kernel, tglx, mingo, bp, dave.hansen, x86, hpa, peterz,
	andrew.cooper3, seanjc

On Wed, Aug 7, 2024 at 1:51 AM Xin Li (Intel) <xin@zytor.com> wrote:
>
> In most cases, ti_work values passed to arch_exit_to_user_mode_prepare()
> are zeros, e.g., 99% in kernel build tests.  So an obvious optimization
> is to test ti_work for zero before processing individual bits in it.
>
> In addition, Intel 0day tests find no perf regression with this change.
>
> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> ---
>  arch/x86/include/asm/entry-common.h | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
> index fb2809b20b0a..4c78b99060b5 100644
> --- a/arch/x86/include/asm/entry-common.h
> +++ b/arch/x86/include/asm/entry-common.h
> @@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
>  static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
>                                                   unsigned long ti_work)
>  {
> -       if (ti_work & _TIF_USER_RETURN_NOTIFY)
> -               fire_user_return_notifiers();
> +       if (unlikely(ti_work)) {
> +               if (ti_work & _TIF_USER_RETURN_NOTIFY)
> +                       fire_user_return_notifiers();
>
> -       if (unlikely(ti_work & _TIF_IO_BITMAP))
> -               tss_update_io_bitmap();
> +               if (unlikely(ti_work & _TIF_IO_BITMAP))
> +                       tss_update_io_bitmap();
>
> -       fpregs_assert_state_consistent();
> -       if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
> -               switch_fpu_return();
> +               fpregs_assert_state_consistent();

This call was originally unconditional, and does nothing if
TIF_NEED_FPU_LOAD is set.

> +               if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
> +                       switch_fpu_return();
> +       }
>
>  #ifdef CONFIG_COMPAT
>         /*
> --
> 2.45.2
>
>

Brian Gerst

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits
  2024-08-07  5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel)
  2024-08-07 16:21   ` Brian Gerst
@ 2024-08-07 18:08   ` Thomas Gleixner
  2024-08-07 18:21     ` Thomas Gleixner
  2024-08-07 23:01     ` Xin Li
  1 sibling, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2024-08-07 18:08 UTC (permalink / raw)
  To: Xin Li (Intel), linux-kernel
  Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc

On Tue, Aug 06 2024 at 22:47, Xin Li wrote:
> In most cases, ti_work values passed to arch_exit_to_user_mode_prepare()
> are zeros, e.g., 99% in kernel build tests.  So an obvious optimization
> is to test ti_work for zero before processing individual bits in it.
>
> In addition, Intel 0day tests find no perf regression with this change.
>
> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> ---
>  arch/x86/include/asm/entry-common.h | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
> index fb2809b20b0a..4c78b99060b5 100644
> --- a/arch/x86/include/asm/entry-common.h
> +++ b/arch/x86/include/asm/entry-common.h
> @@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
>  static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
>  						  unsigned long ti_work)
>  {
> -	if (ti_work & _TIF_USER_RETURN_NOTIFY)
> -		fire_user_return_notifiers();
> +	if (unlikely(ti_work)) {
> +		if (ti_work & _TIF_USER_RETURN_NOTIFY)
> +			fire_user_return_notifiers();
>  
> -	if (unlikely(ti_work & _TIF_IO_BITMAP))
> -		tss_update_io_bitmap();
> +		if (unlikely(ti_work & _TIF_IO_BITMAP))
> +			tss_update_io_bitmap();
>  
> -	fpregs_assert_state_consistent();

Please keep this unconditional and independent of ti_work. It's a debug
feature and you kill coverage with making it conditional on ti_work.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-07  5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel)
@ 2024-08-07 18:11   ` Thomas Gleixner
  2024-08-09 23:07   ` Andrew Cooper
  1 sibling, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2024-08-07 18:11 UTC (permalink / raw)
  To: Xin Li (Intel), linux-kernel
  Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc

On Tue, Aug 06 2024 at 22:47, Xin Li wrote:
>  
> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
> +
> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>  static __always_inline void wrmsrns(u32 msr, u64 val)
>  {
> -	__wrmsrns(msr, val, val >> 32);
> +	/*
> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
> +	 * DS prefix to avoid a trailing NOP.
> +	 */
> +	asm volatile("1: "
> +		     ALTERNATIVE("ds wrmsr",
> +				 WRMSRNS, X86_FEATURE_WRMSRNS)

Please get rid of this horrible line break. You have 100 characters line width.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits
  2024-08-07 18:08   ` Thomas Gleixner
@ 2024-08-07 18:21     ` Thomas Gleixner
  2024-08-07 23:01     ` Xin Li
  1 sibling, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2024-08-07 18:21 UTC (permalink / raw)
  To: Xin Li (Intel), linux-kernel
  Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc

On Wed, Aug 07 2024 at 20:08, Thomas Gleixner wrote:
> On Tue, Aug 06 2024 at 22:47, Xin Li wrote:
>> In most cases, ti_work values passed to arch_exit_to_user_mode_prepare()
>> are zeros, e.g., 99% in kernel build tests.  So an obvious optimization
>> is to test ti_work for zero before processing individual bits in it.
>>
>> In addition, Intel 0day tests find no perf regression with this change.
>>
>> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
>> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>> ---
>>  arch/x86/include/asm/entry-common.h | 16 +++++++++-------
>>  1 file changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
>> index fb2809b20b0a..4c78b99060b5 100644
>> --- a/arch/x86/include/asm/entry-common.h
>> +++ b/arch/x86/include/asm/entry-common.h
>> @@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
>>  static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
>>  						  unsigned long ti_work)
>>  {
>> -	if (ti_work & _TIF_USER_RETURN_NOTIFY)
>> -		fire_user_return_notifiers();
>> +	if (unlikely(ti_work)) {
>> +		if (ti_work & _TIF_USER_RETURN_NOTIFY)
>> +			fire_user_return_notifiers();
>>  
>> -	if (unlikely(ti_work & _TIF_IO_BITMAP))
>> -		tss_update_io_bitmap();
>> +		if (unlikely(ti_work & _TIF_IO_BITMAP))
>> +			tss_update_io_bitmap();
>>  
>> -	fpregs_assert_state_consistent();
>
> Please keep this unconditional and independent of ti_work. It's a debug
> feature and you kill coverage with making it conditional on ti_work.

Also spare the extra indentation level and do:

static inline void arch_exit_work(unsigned long ti_work)
{
	if (ti_work & _TIF_USER_RETURN_NOTIFY)
		fire_user_return_notifiers();

	if (unlikely(ti_work & _TIF_IO_BITMAP))
		tss_update_io_bitmap();

	fpregs_assert_state_consistent();
	if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
		switch_fpu_return();
}

static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
						  unsigned long ti_work)
{
	if (IS_ENABLED(CONFIG_X86_DEBUG_FPU) || unlikely(ti_work))
		arch_exit_work(ti_work);
        ...

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch
  2024-08-07  5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel)
@ 2024-08-07 18:39   ` Thomas Gleixner
  2024-08-07 21:55     ` Sean Christopherson
  2024-08-09 10:45   ` Nikolay Borisov
  1 sibling, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2024-08-07 18:39 UTC (permalink / raw)
  To: Xin Li (Intel), linux-kernel
  Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc

On Tue, Aug 06 2024 at 22:47, Xin Li wrote:
> --- a/arch/x86/include/asm/entry-common.h
> +++ b/arch/x86/include/asm/entry-common.h
> @@ -51,6 +51,11 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
>  		if (ti_work & _TIF_USER_RETURN_NOTIFY)
>  			fire_user_return_notifiers();
>  
> +		if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> +		    (ti_work & _TIF_LOAD_USER_STATES))
> +			wrmsrns(MSR_IA32_FRED_RSP0,
> +				(unsigned long)task_stack_page(current) + THREAD_SIZE);

Eyes are bleeding. If you do the inline in patch 1 then this becomes

	if (cpu_feature_enabled(X86_FEATURE_FRED) && (ti_work & _TIF_LOAD_USER_STATES))
		wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(current) + THREAD_SIZE);

which is in the 100 character limit and readable.

Though I wonder if this should not have a per CPU cache for that.

        if (cpu_feature_enabled(X86_FEATURE_FRED) {
        	unsigned long rsp0 = (unsigned long) task_stack_page(current) + THREAD_SIZE;

		if (__this_cpu_read(cpu_fred_rsp0) != rsp0) {
			wrmsrns(MSR_IA32_FRED_RSP0, fred_rsp0);
                        this_cpu_write(cpu_fred_rsp0, rsp0);
                }
        }	

Hmm?

>  		if (unlikely(ti_work & _TIF_IO_BITMAP))
>  			tss_update_io_bitmap();
>  
> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index c3bd0c0758c9..a31ea544cc0e 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -71,8 +71,7 @@ static inline void update_task_stack(struct task_struct *task)
>  	this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
>  #else
>  	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
> -		/* WRMSRNS is a baseline feature for FRED. */

This comment is wrong when using the alternative WRMSR fallback and
should be remove by the alternatives patch.

> -		wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE);
> +		set_thread_flag(TIF_LOAD_USER_STATES);

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch
  2024-08-07 18:39   ` Thomas Gleixner
@ 2024-08-07 21:55     ` Sean Christopherson
  2024-08-07 23:04       ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Sean Christopherson @ 2024-08-07 21:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Xin Li (Intel), linux-kernel, mingo, bp, dave.hansen, x86, hpa,
	peterz, andrew.cooper3

On Wed, Aug 07, 2024, Thomas Gleixner wrote:
> Though I wonder if this should not have a per CPU cache for that.
> 
>         if (cpu_feature_enabled(X86_FEATURE_FRED) {
>         	unsigned long rsp0 = (unsigned long) task_stack_page(current) + THREAD_SIZE;
> 
> 		if (__this_cpu_read(cpu_fred_rsp0) != rsp0) {
> 			wrmsrns(MSR_IA32_FRED_RSP0, fred_rsp0);
>                         this_cpu_write(cpu_fred_rsp0, rsp0);
>                 }
>         }	
> 
> Hmm?

A per-CPU cache would work for KVM too, KVM would just need to stuff the cache
with the guest's value instead of setting _TIF_LOAD_USER_STATES.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits
  2024-08-07 18:08   ` Thomas Gleixner
  2024-08-07 18:21     ` Thomas Gleixner
@ 2024-08-07 23:01     ` Xin Li
  1 sibling, 0 replies; 37+ messages in thread
From: Xin Li @ 2024-08-07 23:01 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel
  Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc

On 8/7/2024 11:08 AM, Thomas Gleixner wrote:
>> -	fpregs_assert_state_consistent();
> Please keep this unconditional and independent of ti_work. It's a debug
> feature and you kill coverage with making it conditional on ti_work.

Sigh, I'm an idiot.

Thanks!
     Xin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits
  2024-08-07 16:21   ` Brian Gerst
@ 2024-08-07 23:03     ` Xin Li
  0 siblings, 0 replies; 37+ messages in thread
From: Xin Li @ 2024-08-07 23:03 UTC (permalink / raw)
  To: Brian Gerst
  Cc: linux-kernel, tglx, mingo, bp, dave.hansen, x86, hpa, peterz,
	andrew.cooper3, seanjc

On 8/7/2024 9:21 AM, Brian Gerst wrote:
>> +               fpregs_assert_state_consistent();
> This call was originally unconditional, and does nothing if
> TIF_NEED_FPU_LOAD is set.

lost my mind! Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch
  2024-08-07 21:55     ` Sean Christopherson
@ 2024-08-07 23:04       ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2024-08-07 23:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xin Li (Intel), linux-kernel, mingo, bp, dave.hansen, x86, hpa,
	peterz, andrew.cooper3

On Wed, Aug 07 2024 at 14:55, Sean Christopherson wrote:
> On Wed, Aug 07, 2024, Thomas Gleixner wrote:
>> Though I wonder if this should not have a per CPU cache for that.
>> 
>>         if (cpu_feature_enabled(X86_FEATURE_FRED) {
>>         	unsigned long rsp0 = (unsigned long) task_stack_page(current) + THREAD_SIZE;
>> 
>> 		if (__this_cpu_read(cpu_fred_rsp0) != rsp0) {
>> 			wrmsrns(MSR_IA32_FRED_RSP0, fred_rsp0);
>>                         this_cpu_write(cpu_fred_rsp0, rsp0);
>>                 }
>>         }	
>> 
>> Hmm?
>
> A per-CPU cache would work for KVM too, KVM would just need to stuff the cache
> with the guest's value instead of setting _TIF_LOAD_USER_STATES.

There are two options vs. the cache:

1) Use a cache only

static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
						  unsigned long ti_work)
{
	if (IS_ENABLED(CONFIG_X86_DEBUG_FPU) || unlikely(ti_work))
		arch_exit_work(ti_work);

        fred_rsp0_magic();

2) Use both the TIF bit and the cache

static inline void arch_exit_work(unsigned long ti_work)
{
        if (....)
        ...
        fred_rsp0_magic();
}

#1 has the charm that it avoids arch_exit_work() completely when ti_work
   is 0, but has the unconditional check on the per CPU cache variable
   and the extra conditional on every return

#2 has the charm that it avoids touching the per CPU cache variable on
   return when the TIF bit is not set, but comes with the downside of
   going through the ti_work chain to update RSP0

I'm inclined to claim that #1 wins. Why?

The probability for a RSP0 update is higher than the probability for
ti_work != 0, and for syscall heavy workloads the extra per CPU read and
the conditional is not really relevant when we stick this into struct
pcpu_hot. That cacheline is hot anyway in that code path due to
'current' and other members there.

But that's a problem for micro benchmark experts to figure out. I'm not
one of them :)

In any case the cache should be a win over an unconditionl MSR write
independent of MRMSRNS.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch
  2024-08-07  5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel)
  2024-08-07 18:39   ` Thomas Gleixner
@ 2024-08-09 10:45   ` Nikolay Borisov
  2024-08-09 17:37     ` Xin Li
  1 sibling, 1 reply; 37+ messages in thread
From: Nikolay Borisov @ 2024-08-09 10:45 UTC (permalink / raw)
  To: Xin Li (Intel), linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3,
	seanjc



On 7.08.24 г. 8:47 ч., Xin Li (Intel) wrote:
> FRED RSP0 is a per task constant pointing to top of its kernel stack
> for user level event delivery, and needs to be updated when a task is
> scheduled in.
> 
> Introduce a new TI flag TIF_LOAD_USER_STATES to track whether FRED RSP0
> needs to be loaded, and do the actual load of FRED RSP0 in
> arch_exit_to_user_mode_prepare() if the TI flag is set, thus to avoid a
> fair number of WRMSRs in both KVM and the kernel.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> ---
>   arch/x86/include/asm/entry-common.h | 5 +++++
>   arch/x86/include/asm/switch_to.h    | 3 +--
>   arch/x86/include/asm/thread_info.h  | 2 ++
>   arch/x86/kernel/cpu/cpuid-deps.c    | 1 -
>   4 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
> index 4c78b99060b5..ae365579efb3 100644
> --- a/arch/x86/include/asm/entry-common.h
> +++ b/arch/x86/include/asm/entry-common.h
> @@ -51,6 +51,11 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
>   		if (ti_work & _TIF_USER_RETURN_NOTIFY)
>   			fire_user_return_notifiers();
>   
> +		if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> +		    (ti_work & _TIF_LOAD_USER_STATES))
> +			wrmsrns(MSR_IA32_FRED_RSP0,
> +				(unsigned long)task_stack_page(current) + THREAD_SIZE);
> +
>   		if (unlikely(ti_work & _TIF_IO_BITMAP))
>   			tss_update_io_bitmap();
>   
> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index c3bd0c0758c9..a31ea544cc0e 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -71,8 +71,7 @@ static inline void update_task_stack(struct task_struct *task)
>   	this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
>   #else
>   	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
> -		/* WRMSRNS is a baseline feature for FRED. */
> -		wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE);
> +		set_thread_flag(TIF_LOAD_USER_STATES);
>   	} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
>   		/* Xen PV enters the kernel on the thread stack. */
>   		load_sp0(task_top_of_stack(task));
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 12da7dfd5ef1..fb51904651c0 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -106,6 +106,7 @@ struct thread_info {
>   #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
>   #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
>   #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
> +#define TIF_LOAD_USER_STATES	30	/* Load user level states */

Wouldn't something along the l ines of TIF_LOAD_FRED_RSP be more 
descriptive, or it's expected that this flag can cover more state in the 
future?

>   
>   #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>   #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
> @@ -128,6 +129,7 @@ struct thread_info {
>   #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
>   #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
>   #define _TIF_ADDR32		(1 << TIF_ADDR32)
> +#define _TIF_LOAD_USER_STATES	(1 << TIF_LOAD_USER_STATES)
>   
>   /* flags to check in __switch_to() */
>   #define _TIF_WORK_CTXSW_BASE					\
> diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
> index b7d9f530ae16..8bd84114c2d9 100644
> --- a/arch/x86/kernel/cpu/cpuid-deps.c
> +++ b/arch/x86/kernel/cpu/cpuid-deps.c
> @@ -83,7 +83,6 @@ static const struct cpuid_dep cpuid_deps[] = {
>   	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XFD       },
>   	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
>   	{ X86_FEATURE_FRED,			X86_FEATURE_LKGS      },
> -	{ X86_FEATURE_FRED,			X86_FEATURE_WRMSRNS   },
>   	{}
>   };
>   

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch
  2024-08-09 10:45   ` Nikolay Borisov
@ 2024-08-09 17:37     ` Xin Li
  2024-08-09 22:43       ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Xin Li @ 2024-08-09 17:37 UTC (permalink / raw)
  To: Nikolay Borisov, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3,
	seanjc

On 8/9/2024 3:45 AM, Nikolay Borisov wrote:
>> +#define TIF_LOAD_USER_STATES    30    /* Load user level states */
> 
> Wouldn't something along the l ines of TIF_LOAD_FRED_RSP be more 
> descriptive, or it's expected that this flag can cover more state in the 
> future?

Sean mentioned TIF_LOAD_FRED_RSP, however we also have FRED SSP0, which
needs to be handled similarly.

And we also want to use it to cover future states.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch
  2024-08-09 17:37     ` Xin Li
@ 2024-08-09 22:43       ` H. Peter Anvin
  0 siblings, 0 replies; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-09 22:43 UTC (permalink / raw)
  To: Xin Li, Nikolay Borisov, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, andrew.cooper3, seanjc

On August 9, 2024 10:37:08 AM PDT, Xin Li <xin@zytor.com> wrote:
>On 8/9/2024 3:45 AM, Nikolay Borisov wrote:
>>> +#define TIF_LOAD_USER_STATES    30    /* Load user level states */
>> 
>> Wouldn't something along the l ines of TIF_LOAD_FRED_RSP be more descriptive, or it's expected that this flag can cover more state in the future?
>
>Sean mentioned TIF_LOAD_FRED_RSP, however we also have FRED SSP0, which
>needs to be handled similarly.
>
>And we also want to use it to cover future states.

It is very likely that there will be additional uses of this beyond FRED.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-07  5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel)
  2024-08-07 18:11   ` Thomas Gleixner
@ 2024-08-09 23:07   ` Andrew Cooper
  2024-08-10  0:01     ` H. Peter Anvin
  2024-08-16 17:52     ` Xin Li
  1 sibling, 2 replies; 37+ messages in thread
From: Andrew Cooper @ 2024-08-09 23:07 UTC (permalink / raw)
  To: Xin Li (Intel), linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, seanjc

On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
> From: Andrew Cooper <andrew.cooper3@citrix.com>
>
> Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
> use the alternatives mechanism to choose WRMSRNS when it's available,
> otherwise fallback to WRMSR.
>
> [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> ---
>  arch/x86/include/asm/msr.h | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index d642037f9ed5..3e402d717815 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>  		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>  }
>  
> -/*
> - * WRMSRNS behaves exactly like WRMSR with the only difference being
> - * that it is not a serializing instruction by default.
> - */
> -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
> -{
> -	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
> -	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
> -		     "2:\n"
> -		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
> -		     : : "c" (msr), "a"(low), "d" (high));
> -}
> -
>  #define native_rdmsr(msr, val1, val2)			\
>  do {							\
>  	u64 __val = __rdmsr((msr));			\
> @@ -312,9 +299,22 @@ do {							\
>  
>  #endif	/* !CONFIG_PARAVIRT_XXL */
>  
> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
> +
> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>  static __always_inline void wrmsrns(u32 msr, u64 val)
>  {
> -	__wrmsrns(msr, val, val >> 32);
> +	/*
> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
> +	 * DS prefix to avoid a trailing NOP.
> +	 */
> +	asm volatile("1: "
> +		     ALTERNATIVE("ds wrmsr",

This isn't the version I presented, and there's no discussion of the
alteration.

The choice of CS over DS was deliberate, and came from Intel:

https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

So unless Intel want to retract that whitepaper, and all the binutils
work with it, I'd suggest keeping it as CS like we use elsewhere, and as
explicitly instructed by Intel.

~Andrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-09 23:07   ` Andrew Cooper
@ 2024-08-10  0:01     ` H. Peter Anvin
  2024-08-10  0:25       ` Andrew Cooper
  2024-08-16 17:52     ` Xin Li
  1 sibling, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-10  0:01 UTC (permalink / raw)
  To: Andrew Cooper, Xin Li (Intel), linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc

On August 9, 2024 4:07:35 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
>> use the alternatives mechanism to choose WRMSRNS when it's available,
>> otherwise fallback to WRMSR.
>>
>> [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>> ---
>>  arch/x86/include/asm/msr.h | 28 ++++++++++++++--------------
>>  1 file changed, 14 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>> index d642037f9ed5..3e402d717815 100644
>> --- a/arch/x86/include/asm/msr.h
>> +++ b/arch/x86/include/asm/msr.h
>> @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>>  		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>>  }
>>  
>> -/*
>> - * WRMSRNS behaves exactly like WRMSR with the only difference being
>> - * that it is not a serializing instruction by default.
>> - */
>> -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
>> -{
>> -	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
>> -	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
>> -		     "2:\n"
>> -		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>> -		     : : "c" (msr), "a"(low), "d" (high));
>> -}
>> -
>>  #define native_rdmsr(msr, val1, val2)			\
>>  do {							\
>>  	u64 __val = __rdmsr((msr));			\
>> @@ -312,9 +299,22 @@ do {							\
>>  
>>  #endif	/* !CONFIG_PARAVIRT_XXL */
>>  
>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>> +
>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>  static __always_inline void wrmsrns(u32 msr, u64 val)
>>  {
>> -	__wrmsrns(msr, val, val >> 32);
>> +	/*
>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>> +	 * DS prefix to avoid a trailing NOP.
>> +	 */
>> +	asm volatile("1: "
>> +		     ALTERNATIVE("ds wrmsr",
>
>This isn't the version I presented, and there's no discussion of the
>alteration.
>
>The choice of CS over DS was deliberate, and came from Intel:
>
>https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>
>So unless Intel want to retract that whitepaper, and all the binutils
>work with it, I'd suggest keeping it as CS like we use elsewhere, and as
>explicitly instructed by Intel.
>
>~Andrew

I looked around the kernel, and I believe we are inconsistent. I see both 0x2e (CS) and 0x3e (DS) prefixes used for padding where open-coded.

We can't use cs in all cases, since you can't do a store to the code segment (always readonly) so we use 0x3e (DS) to patch out LOCK.

In the paper you describe, it only mentions 0x2e as a "benign prefix" in a specific example, not as any kind of specific recommendation. It is particularly irrelevant when it comes to padding a two instructions to the same length as the paper deals with assignment. 

If you want, I'm perfectly happy to go and ask if there is any general recommendation (except for direct conditional branch hints, of course.)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-10  0:01     ` H. Peter Anvin
@ 2024-08-10  0:25       ` Andrew Cooper
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Cooper @ 2024-08-10  0:25 UTC (permalink / raw)
  To: H. Peter Anvin, Xin Li (Intel), linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc

On 10/08/2024 1:01 am, H. Peter Anvin wrote:
> On August 9, 2024 4:07:35 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>>
>>> Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
>>> use the alternatives mechanism to choose WRMSRNS when it's available,
>>> otherwise fallback to WRMSR.
>>>
>>> [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>>> ---
>>>  arch/x86/include/asm/msr.h | 28 ++++++++++++++--------------
>>>  1 file changed, 14 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>>> index d642037f9ed5..3e402d717815 100644
>>> --- a/arch/x86/include/asm/msr.h
>>> +++ b/arch/x86/include/asm/msr.h
>>> @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>>>  		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>>>  }
>>>  
>>> -/*
>>> - * WRMSRNS behaves exactly like WRMSR with the only difference being
>>> - * that it is not a serializing instruction by default.
>>> - */
>>> -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
>>> -{
>>> -	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
>>> -	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
>>> -		     "2:\n"
>>> -		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>> -		     : : "c" (msr), "a"(low), "d" (high));
>>> -}
>>> -
>>>  #define native_rdmsr(msr, val1, val2)			\
>>>  do {							\
>>>  	u64 __val = __rdmsr((msr));			\
>>> @@ -312,9 +299,22 @@ do {							\
>>>  
>>>  #endif	/* !CONFIG_PARAVIRT_XXL */
>>>  
>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>> +
>>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>>  static __always_inline void wrmsrns(u32 msr, u64 val)
>>>  {
>>> -	__wrmsrns(msr, val, val >> 32);
>>> +	/*
>>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>>> +	 * DS prefix to avoid a trailing NOP.
>>> +	 */
>>> +	asm volatile("1: "
>>> +		     ALTERNATIVE("ds wrmsr",
>> This isn't the version I presented, and there's no discussion of the
>> alteration.
>>
>> The choice of CS over DS was deliberate, and came from Intel:
>>
>> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>>
>> So unless Intel want to retract that whitepaper, and all the binutils
>> work with it, I'd suggest keeping it as CS like we use elsewhere, and as
>> explicitly instructed by Intel.
>>
>> ~Andrew
> I looked around the kernel, and I believe we are inconsistent. I see both 0x2e (CS) and 0x3e (DS) prefixes used for padding where open-coded.
>
> We can't use cs in all cases, since you can't do a store to the code segment (always readonly) so we use 0x3e (DS) to patch out LOCK.
>
> In the paper you describe, it only mentions 0x2e as a "benign prefix" in a specific example, not as any kind of specific recommendation. It is particularly irrelevant when it comes to padding a two instructions to the same length as the paper deals with assignment. 
>
> If you want, I'm perfectly happy to go and ask if there is any general recommendation (except for direct conditional branch hints, of course.)

It would be lovely if there could be a single coherent statement.

In addition to store semantics, off the top of my head:

* CS is the P4 hint-not-taken (presumably Jcc only), ignored now.

* DS is both the P4 hint-taken (presumably Jcc only), newly reintroduced
in Redwood Cove with tweaked semantics (definitely Jcc only), and the
CET notrack prefix (JMP/CALL *IND only).

Plus whatever else I've missed[1].

~Andrew

[1] I'm ignoring XuCode mode memory operand semantics as not relevant to
CPL0 software (where AIUI unprefixed is physical and DS prefixed is
virtual) but I'm noting it here to highlight that there's definitely
extra complexity beyond what is in SDM Vol2.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-09 23:07   ` Andrew Cooper
  2024-08-10  0:01     ` H. Peter Anvin
@ 2024-08-16 17:52     ` Xin Li
  2024-08-16 18:40       ` Andrew Cooper
  2024-08-17 14:23       ` Borislav Petkov
  1 sibling, 2 replies; 37+ messages in thread
From: Xin Li @ 2024-08-16 17:52 UTC (permalink / raw)
  To: Andrew Cooper, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, seanjc

On 8/9/2024 4:07 PM, Andrew Cooper wrote:
> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>> +
>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>   {
>> -	__wrmsrns(msr, val, val >> 32);
>> +	/*
>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>> +	 * DS prefix to avoid a trailing NOP.
>> +	 */
>> +	asm volatile("1: "
>> +		     ALTERNATIVE("ds wrmsr",
> 
> This isn't the version I presented, and there's no discussion of the
> alteration.

I'm trying to implement wrmsr() as

static __always_inline void wrmsr(u32 msr, u64 val)
{
	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
		     "D" (msr), "S" (val));
}


As the CALL instruction is 5-byte long, and we need to pad nop for both
WRMSR and WRMSRNS, what about not using segment prefix at all?


> The choice of CS over DS was deliberate, and came from Intel:
> 
> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
> 
> So unless Intel want to retract that whitepaper, and all the binutils
> work with it, I'd suggest keeping it as CS like we use elsewhere, and as
> explicitly instructed by Intel.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 17:52     ` Xin Li
@ 2024-08-16 18:40       ` Andrew Cooper
  2024-08-16 19:18         ` H. Peter Anvin
  2024-08-16 21:26         ` H. Peter Anvin
  2024-08-17 14:23       ` Borislav Petkov
  1 sibling, 2 replies; 37+ messages in thread
From: Andrew Cooper @ 2024-08-16 18:40 UTC (permalink / raw)
  To: Xin Li, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, seanjc

On 16/08/2024 6:52 pm, Xin Li wrote:
> On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>> +
>>> +/* Non-serializing WRMSR, when available.  Falls back to a
>>> serializing WRMSR. */
>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>   {
>>> -    __wrmsrns(msr, val, val >> 32);
>>> +    /*
>>> +     * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a
>>> redundant
>>> +     * DS prefix to avoid a trailing NOP.
>>> +     */
>>> +    asm volatile("1: "
>>> +             ALTERNATIVE("ds wrmsr",
>>
>> This isn't the version I presented, and there's no discussion of the
>> alteration.
>
> I'm trying to implement wrmsr() as
>
> static __always_inline void wrmsr(u32 msr, u64 val)
> {
>     asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS,
> X86_FEATURE_WRMSRNS,
>                      "call asm_xen_write_msr", X86_FEATURE_XENPV)
>              "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>              : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>              "D" (msr), "S" (val));
> }
>
>
> As the CALL instruction is 5-byte long, and we need to pad nop for both
> WRMSR and WRMSRNS, what about not using segment prefix at all?

The prefix was a minor optimisation to avoid having a trailing nop at
the end.

When combined with a call, you need 3 prefixes on WRMSR and 2 prefixes
on WRMSRNS to make all options be 5 bytes long.

That said, there's already a paravirt hook for this, and if you're
looking to work around the code gen mess for that, then doing it like
this by doubling up into rdi and rsi isn't great either.

My suggestion, not that I've had time to experiment, was to change
paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
edx:eax into rsi.  That way the top level wrmsr() retains sensible
codegen for native even when paravirt is active.

~Andrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 18:40       ` Andrew Cooper
@ 2024-08-16 19:18         ` H. Peter Anvin
  2024-08-16 21:45           ` Andrew Cooper
  2024-08-16 21:26         ` H. Peter Anvin
  1 sibling, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-16 19:18 UTC (permalink / raw)
  To: Andrew Cooper, Xin Li, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc

On August 16, 2024 11:40:05 AM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>On 16/08/2024 6:52 pm, Xin Li wrote:
>> On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>>> +
>>>> +/* Non-serializing WRMSR, when available.  Falls back to a
>>>> serializing WRMSR. */
>>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>>   {
>>>> -    __wrmsrns(msr, val, val >> 32);
>>>> +    /*
>>>> +     * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a
>>>> redundant
>>>> +     * DS prefix to avoid a trailing NOP.
>>>> +     */
>>>> +    asm volatile("1: "
>>>> +             ALTERNATIVE("ds wrmsr",
>>>
>>> This isn't the version I presented, and there's no discussion of the
>>> alteration.
>>
>> I'm trying to implement wrmsr() as
>>
>> static __always_inline void wrmsr(u32 msr, u64 val)
>> {
>>     asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS,
>> X86_FEATURE_WRMSRNS,
>>                      "call asm_xen_write_msr", X86_FEATURE_XENPV)
>>              "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>              : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>>              "D" (msr), "S" (val));
>> }
>>
>>
>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>
>The prefix was a minor optimisation to avoid having a trailing nop at
>the end.
>
>When combined with a call, you need 3 prefixes on WRMSR and 2 prefixes
>on WRMSRNS to make all options be 5 bytes long.
>
>That said, there's already a paravirt hook for this, and if you're
>looking to work around the code gen mess for that, then doing it like
>this by doubling up into rdi and rsi isn't great either.
>
>My suggestion, not that I've had time to experiment, was to change
>paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>edx:eax into rsi.  That way the top level wrmsr() retains sensible
>codegen for native even when paravirt is active.
>
>~Andrew

Heh, that was my suggestion, except that I suggested defining it so rax pass the full value; the generation of edx still is necessary but there is no real reason to have to recombine them. (One could even add that code to the assembly pattern as the CALL instruction is longer.)

My biggest question is how the #GP on an invalid MSR access is handled with Xen. The rest is trivial.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 18:40       ` Andrew Cooper
  2024-08-16 19:18         ` H. Peter Anvin
@ 2024-08-16 21:26         ` H. Peter Anvin
  2024-08-16 22:27           ` Andrew Cooper
  2024-08-16 22:59           ` H. Peter Anvin
  1 sibling, 2 replies; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-16 21:26 UTC (permalink / raw)
  To: Andrew Cooper, Xin Li, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc

[-- Attachment #1: Type: text/plain, Size: 713 bytes --]

On 8/16/24 11:40, Andrew Cooper wrote:
>>
>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>> WRMSR and WRMSRNS, what about not using segment prefix at all?
> 

You can use up to 4 prefixes of any kind (which includes opcode prefixes 
before 0F) before most decoders start hurting, so we can pad it out to 5 
bytes by doing 3f 3f .. .. ..

> 
> My suggestion, not that I've had time to experiment, was to change
> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
> edx:eax into rsi.  That way the top level wrmsr() retains sensible
> codegen for native even when paravirt is active.
> 

I have attached what should be an "obvious" example... famous last words.

	-hpa

[-- Attachment #2: xen.S --]
[-- Type: text/plain, Size: 1692 bytes --]

/*
 * Input in %rax, MSR number in %ecx
 * %rdx is clobbered, as the native stub is assumed to do
 *
 * Change xen_do_write_msr to return its error code instead
 * of passing a pointer	- this gives the extra benefit that
 * we can get the *actual invocation address* in the error
 * messages.
 *
 * ex_handler_msr() should be changed to get the MSR data
 * from %rax regardless of Xen vs native; alternatively the
 * Xen handler can set up %edx as well.
 *
 * Let the native pattern look like:
 *
 * 48 89 c2                mov    %rax,%rdx
 * 48 c1 ea 20             shr    $32,%rdx
 * 3e 0f 30                ds wrmsr		<--- trap point
 *
 * ... which can be replaced with ...
 * 48 89 c2                mov    %rax,%rdx
 * 48 c1 ea 20             shr    $32,%rdx
 * 0f 01 c6                wrmsrns		<--- trap point
 *
 * ... or ...
 * e8 xx xx xx xx          call asm_xen_write_msr
 * 74 02                   jz 1f
 * 3e 0f 0b                ds ud2                  <--- trap point
 *                      1:
 * ds ud2 can of course be replaced with any other 3-byte trapping
 * instruction.
 *
 * This also removes the need for Xen to maintain different safe and
 * unsafe MSR routines, as the difference is handled by the same
 * trap handler as is used natively.
 */
 SYM_FUNC_START(asm_xen_write_msr)
	FRAME_BEGIN
	push %rax		/* Save in case of error */
	push %rcx
	push %rsi
	push %rdi
	push %r8
	push %r9
	push %r10
	push %r11
	mov %rax,%rdi
	mov %ecx,%esi
	call xen_do_write_msr
	test %eax,%eax		/* ZF=0 on error */
	pop %r11
	pop %r10
	pop %r9
	pop %r8
	pop %rdi
	pop %rsi
	pop %rcx
#ifdef OPTIONAL
	mov 4(%rsp),%edx
#endif
	pop %rax
	FRAME_END
	RET
SYM_FUNC_END(asm_xen_write_msr)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 19:18         ` H. Peter Anvin
@ 2024-08-16 21:45           ` Andrew Cooper
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Cooper @ 2024-08-16 21:45 UTC (permalink / raw)
  To: H. Peter Anvin, Xin Li, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc

On 16/08/2024 8:18 pm, H. Peter Anvin wrote:
> On August 16, 2024 11:40:05 AM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 16/08/2024 6:52 pm, Xin Li wrote:
>>> On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>>>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>>>> +
>>>>> +/* Non-serializing WRMSR, when available.  Falls back to a
>>>>> serializing WRMSR. */
>>>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>>>   {
>>>>> -    __wrmsrns(msr, val, val >> 32);
>>>>> +    /*
>>>>> +     * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a
>>>>> redundant
>>>>> +     * DS prefix to avoid a trailing NOP.
>>>>> +     */
>>>>> +    asm volatile("1: "
>>>>> +             ALTERNATIVE("ds wrmsr",
>>>> This isn't the version I presented, and there's no discussion of the
>>>> alteration.
>>> I'm trying to implement wrmsr() as
>>>
>>> static __always_inline void wrmsr(u32 msr, u64 val)
>>> {
>>>     asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS,
>>> X86_FEATURE_WRMSRNS,
>>>                      "call asm_xen_write_msr", X86_FEATURE_XENPV)
>>>              "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>>              : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>>>              "D" (msr), "S" (val));
>>> }
>>>
>>>
>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>> The prefix was a minor optimisation to avoid having a trailing nop at
>> the end.
>>
>> When combined with a call, you need 3 prefixes on WRMSR and 2 prefixes
>> on WRMSRNS to make all options be 5 bytes long.
>>
>> That said, there's already a paravirt hook for this, and if you're
>> looking to work around the code gen mess for that, then doing it like
>> this by doubling up into rdi and rsi isn't great either.
>>
>> My suggestion, not that I've had time to experiment, was to change
>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>> codegen for native even when paravirt is active.
>>
>> ~Andrew
> Heh, that was my suggestion, except that I suggested defining it so rax pass the full value; the generation of edx still is necessary but there is no real reason to have to recombine them. (One could even add that code to the assembly pattern as the CALL instruction is longer.)

Hmm yeah, having %rax be full is likely how the logic is going to look
before generating %edx.

> My biggest question is how the #GP on an invalid MSR access is handled with Xen. The rest is trivial.

For historical reasons it's a mess.

xen_do_write_msr() does several things.

* Turns FSBASE/GSBASE/GSKERN into their respective hypercalls (although
no error checking at all!)
* Discards modifications to the SYSCALL/SYSENTER MSRs

Otherwise, uses the native accessor, taking the #GP path to Xen and is
emulated (including full decode).

Way back in the day, XenPV's paravirt wrmsr would swallow #GP and
pretend success, and then started depending on this behaviour in order
to boot.  But times have moved on, and even the normal accessors are
really safe+warn, and I'm sure all of this could be cleaned up.

For ages I've been wanting to make a single PV_PRIV_OP hypercall so we
can skip the x86 emulator, but I've never had time to do that either.

~Andrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 21:26         ` H. Peter Anvin
@ 2024-08-16 22:27           ` Andrew Cooper
  2024-08-16 22:34             ` Andrew Cooper
  2024-08-16 22:59           ` H. Peter Anvin
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Cooper @ 2024-08-16 22:27 UTC (permalink / raw)
  To: H. Peter Anvin, Xin Li, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc, Juergen Gross

On 16/08/2024 10:26 pm, H. Peter Anvin wrote:
> On 8/16/24 11:40, Andrew Cooper wrote:
>>>
>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>>
>
> You can use up to 4 prefixes of any kind (which includes opcode
> prefixes before 0F) before most decoders start hurting, so we can pad
> it out to 5 bytes by doing 3f 3f .. .. ..
>
>>
>> My suggestion, not that I've had time to experiment, was to change
>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>> codegen for native even when paravirt is active.
>>
>
> I have attached what should be an "obvious" example... famous last words.

Ah, now I see what you mean about Xen's #GP semantics.

That's a neat way of doing it.  It means the faulting path will really
take 2 faults on Xen, but it's a faulting path anyway so speed is
already out of the window.

Do you mind about teaching the #UD handler to deal with WRMSR like that?

I ask, because I can't think of anything nicer.

There are plenty of 3-byte instructions which #GP in PV guests (CPL3),
and LTR is my go-to for debugging purposes, as it's not emulated by Xen.

Anything here (and it can't be an actual WRMSR) will be slightly
confusing to read in an OOPS, especially #UD for what is logically a #GP.

But, a clear UD of some form in the disassembly is probably better than
a random other instruction unrelated to the operation.

~Andrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 22:27           ` Andrew Cooper
@ 2024-08-16 22:34             ` Andrew Cooper
  2024-08-16 23:03               ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Cooper @ 2024-08-16 22:34 UTC (permalink / raw)
  To: H. Peter Anvin, Xin Li, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc, Juergen Gross

On 16/08/2024 11:27 pm, Andrew Cooper wrote:
> On 16/08/2024 10:26 pm, H. Peter Anvin wrote:
>> On 8/16/24 11:40, Andrew Cooper wrote:
>>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>> You can use up to 4 prefixes of any kind (which includes opcode
>> prefixes before 0F) before most decoders start hurting, so we can pad
>> it out to 5 bytes by doing 3f 3f .. .. ..
>>
>>> My suggestion, not that I've had time to experiment, was to change
>>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>>> codegen for native even when paravirt is active.
>>>
>> I have attached what should be an "obvious" example... famous last words.
> Ah, now I see what you mean about Xen's #GP semantics.
>
> That's a neat way of doing it.  It means the faulting path will really
> take 2 faults on Xen, but it's a faulting path anyway so speed is
> already out of the window.
>
> Do you mind about teaching the #UD handler to deal with WRMSR like that?
>
> I ask, because I can't think of anything nicer.
>
> There are plenty of 3-byte instructions which #GP in PV guests (CPL3),
> and LTR is my go-to for debugging purposes, as it's not emulated by Xen.
>
> Anything here (and it can't be an actual WRMSR) will be slightly
> confusing to read in an OOPS, especially #UD for what is logically a #GP.
>
> But, a clear UD of some form in the disassembly is probably better than
> a random other instruction unrelated to the operation.
>
> ~Andrew

Oh, P.S.

We can probably drop most of the register manipulation by making the new
xen_do_write_msr be no_caller_saved_registers.  As we're intentionally
not a C ABI to start with, we might as well not spill registers we don't
use either.

~Andrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 21:26         ` H. Peter Anvin
  2024-08-16 22:27           ` Andrew Cooper
@ 2024-08-16 22:59           ` H. Peter Anvin
  2024-08-17 23:51             ` H. Peter Anvin
  1 sibling, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-16 22:59 UTC (permalink / raw)
  To: Andrew Cooper, Xin Li, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc

[-- Attachment #1: Type: text/plain, Size: 3193 bytes --]

On 8/16/24 14:26, H. Peter Anvin wrote:
> On 8/16/24 11:40, Andrew Cooper wrote:
>>>
>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>>
> 
> You can use up to 4 prefixes of any kind (which includes opcode prefixes 
> before 0F) before most decoders start hurting, so we can pad it out to 5 
> bytes by doing 3f 3f .. .. ..
> 
>>
>> My suggestion, not that I've had time to experiment, was to change
>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>> codegen for native even when paravirt is active.
>>
> 
> I have attached what should be an "obvious" example... famous last words.
> 

After looking at what xen_do_write_msr looks like, I realized we can do 
*much* better than that. This means using two alternative sequences, one 
for native/Xen and the other for native wrmsr/wrmsrns.

The call/jz sequence is *exactly* the same length as mov, shr, which 
means that xen_do_write_msr can simply return a flag if it wants the MSR 
access to be performed natively.

An important point is that in no code path can the error code be set to 
-EIO except by native MSR invocation, so the only possibilities are 
"done successfully" or "execute natively." [That being said, that would 
not be *that* hard to fix if needed.]

The new C prototype for xen_do_write_msr() becomes:

bool xen_do_write_msr(uint32_t msr, uint64_t val)

... with a true return meaning "execute natively."

(I changed the order of the arguments from my last version since it is 
more consistent with what is used everywhere else.)

RDMSR is a bit trickier. I think the best option there is to set up a 
new permission trap handler type that amounts to "get the address from 
the stack, apply a specific offset, and invoke the fixup handler for 
that address:


	case EX_TYPE_UPLEVEL: {
		/* Let reg hold the unsigned number of machine
		 * words to pop off the stack before the return
		 * address, and imm the signed offset from the
		 * return address to the desired trap point.
		 *
		 * pointer in units of machine words, and imm the
		 * signed offset from this stack word...
		 */
		unsigned long *sp = (unsigned long *)regs->sp + reg;
		regs->ip = *sp++ + (int16_t)imm;
		regs->sp = (unsigned long)sp;
		goto again;	/* Loop back to the beginning */
	}

Again, "obviously correct" code attached.

	-hpa

NOTE:

I had to dig several levels down to uncover actual situation, but 
pmu_msr_write() is actually a completely pointless function: all it does 
is shuffle some arguments, then calls pmu_msr_chk_emulated() and if it 
returns true AND the emulated flag is clear then does *exactly the same 
thing* that the calling code would have done if pmu_msr_write() itself 
had returned true... in other words, the whole function could be 
replaced by:

bool pmu_msr_write(uint32_t msr, uint64_t val)
{
         bool emulated;

         return pmu_msr_chk_emulated(msr, &val, false, &emulated) &&
  	       emulated;
}

pmu_msr_read() does the equivalent stupidity; the obvious fix(es) to 
pmu_msr_chk_emulated() I hope are obvious.



[-- Attachment #2: xen.S --]
[-- Type: text/plain, Size: 3135 bytes --]

/*
 * Input in %rax, MSR number in %ecx
 *
 * %edx is set up to match %rax >> 32 like the native stub
 * is expected to do
 *
 * Change xen_do_write_msr to return a true value if
 * the MSR access should be executed natively (or vice versa,
 * if you prefer.)
 *
 * bool xen_do_write_msr(uint32_t msr, uint64_t value)
 *
 * Let the native pattern look like:
 *
 * 48 89 c2                mov    %rax,%rdx
 * 48 c1 ea 20             shr    $32,%rdx
 * 3e 0f 30                ds wrmsr		<--- trap point
 *
 * ... which can be replaced with ...
 * 48 89 c2                mov    %rax,%rdx
 * 48 c1 ea 20             shr    $32,%rdx
 * 0f 01 c6                wrmsrns		<--- trap point
 *
 * FOR XEN, replace the FIRST SEVEN BYTES with:
 * e8 xx xx xx xx          call asm_xen_write_msr
 * 74 03                   jz .+5
 *
 * If ZF=0 then this will fall down to the actual native WRMSR[NS]
 * instruction.
 *
 * This also removes the need for Xen to maintain different safe and
 * unsafe MSR routines, as the difference is handled by the same
 * trap handler as is used natively.
 */
 SYM_FUNC_START(asm_xen_write_msr)
	FRAME_BEGIN
	push %rax		/* Save in case of native fallback */
	push %rcx
	push %rsi
	push %rdi
	push %r8
	push %r9
	push %r10
	push %r11
	mov %ecx,%edi		/* MSR number */
	mov %rax,%rsi		/* MSR data */
	call xen_do_write_msr
	test %al,%al		/* ZF=1 means we are done */
	pop %r11
	pop %r10
	pop %r9
	pop %r8
	pop %rdi
	pop %rsi
	pop %rcx
	mov 4(%rsp),%edx	/* Set up %edx for native execution */
	pop %rax
	FRAME_END
	RET
SYM_FUNC_END(asm_xen_write_msr)

/*
 * RDMSR code. It isn't quite as clean; it requires a new trap handler
 * type:
 *
 *	case EX_TYPE_UPLEVEL: {
 *		/* Let reg hold the unsigned number of machine
 *		 * words to pop off the stack before the return
 *		 * address, and imm the signed offset from the
 *		 * return address to the desired trap point.
 *		 *
 *		 * pointer in units of machine words, and imm the
 *		 * signed offset from this stack word...
 *		 * /
 *		unsigned long *sp = (unsigned long *)regs->sp + reg;
 *		regs->ip = *sp++ + (int16_t)imm;
 *		regs->sp = (unsigned long)sp;
 *		goto again;	/* Loop back to the beginning * /
 *	}
 *
 * The prototype of the Xen C code:
 * struct { uint64_t val, bool native } xen_do_read_msr(uint32_t msr)
 *
 * Native code:
 * 0f 32                   rdmsr	<--- trap point
 * 48 c1 e2 20             shl    $0x20,%rdx
 * 48 09 d0                or     %rdx,%rax
 *
 * Xen code (cs rex is just for padding)
 * 2e 40 e8 xx xx xx xx    cs rex call asm_xen_read_msr
*/

SYM_FUNC_START(asm_xen_read_msr)
	FRAME_BEGIN
	push %rcx
	push %rsi
	push %rdi
	push %r8
	push %r9
	push %r10
	push %r11
	mov %ecx,%edi		/* MSR number */
	call xen_do_read_msr
	test %dl,%dl		/* ZF=1 means we are done */
	pop %r11
	pop %r10
	pop %r9
	pop %r8
	pop %rdi
	pop %rsi
	pop %rcx
	jz 2f
1:
	rdmsr
	_ASM_EXTABLE_TYPE(1b, 2f, \
		EX_TYPE_UPLEVEL|EX_DATA_IMM(-7)|EX_DATA_REG(0))
	shl $32,%rdx
	or %rdx,%rax
	/*
	 * The top of the stack points directly at the return address;
	 * back up by 7 bytes from the return address.
	 */
2:
	FRAME_END
	RET
SYM_FUNC_END(asm_xen_read_msr)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 22:34             ` Andrew Cooper
@ 2024-08-16 23:03               ` H. Peter Anvin
  0 siblings, 0 replies; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-16 23:03 UTC (permalink / raw)
  To: Andrew Cooper, Xin Li, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc, Juergen Gross

On August 16, 2024 3:34:53 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>On 16/08/2024 11:27 pm, Andrew Cooper wrote:
>> On 16/08/2024 10:26 pm, H. Peter Anvin wrote:
>>> On 8/16/24 11:40, Andrew Cooper wrote:
>>>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>>> You can use up to 4 prefixes of any kind (which includes opcode
>>> prefixes before 0F) before most decoders start hurting, so we can pad
>>> it out to 5 bytes by doing 3f 3f .. .. ..
>>>
>>>> My suggestion, not that I've had time to experiment, was to change
>>>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>>>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>>>> codegen for native even when paravirt is active.
>>>>
>>> I have attached what should be an "obvious" example... famous last words.
>> Ah, now I see what you mean about Xen's #GP semantics.
>>
>> That's a neat way of doing it.  It means the faulting path will really
>> take 2 faults on Xen, but it's a faulting path anyway so speed is
>> already out of the window.
>>
>> Do you mind about teaching the #UD handler to deal with WRMSR like that?
>>
>> I ask, because I can't think of anything nicer.
>>
>> There are plenty of 3-byte instructions which #GP in PV guests (CPL3),
>> and LTR is my go-to for debugging purposes, as it's not emulated by Xen.
>>
>> Anything here (and it can't be an actual WRMSR) will be slightly
>> confusing to read in an OOPS, especially #UD for what is logically a #GP.
>>
>> But, a clear UD of some form in the disassembly is probably better than
>> a random other instruction unrelated to the operation.
>>
>> ~Andrew
>
>Oh, P.S.
>
>We can probably drop most of the register manipulation by making the new
>xen_do_write_msr be no_caller_saved_registers.  As we're intentionally
>not a C ABI to start with, we might as well not spill registers we don't
>use either.
>
>~Andrew

If you are willing to require a new enough gcc, certainly.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 17:52     ` Xin Li
  2024-08-16 18:40       ` Andrew Cooper
@ 2024-08-17 14:23       ` Borislav Petkov
  2024-08-17 14:43         ` Borislav Petkov
  2024-08-17 19:22         ` H. Peter Anvin
  1 sibling, 2 replies; 37+ messages in thread
From: Borislav Petkov @ 2024-08-17 14:23 UTC (permalink / raw)
  To: Xin Li, Andrew Cooper, linux-kernel
  Cc: tglx, mingo, dave.hansen, x86, hpa, peterz, seanjc

On August 16, 2024 8:52:51 PM GMT+03:00, Xin Li <xin@zytor.com> wrote:
>On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>> +
>>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>   {
>>> -	__wrmsrns(msr, val, val >> 32);
>>> +	/*
>>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>>> +	 * DS prefix to avoid a trailing NOP.
>>> +	 */
>>> +	asm volatile("1: "
>>> +		     ALTERNATIVE("ds wrmsr",
>> 
>> This isn't the version I presented, and there's no discussion of the
>> alteration.
>
>I'm trying to implement wrmsr() as
>
>static __always_inline void wrmsr(u32 msr, u64 val)
>{
>	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
>					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
>		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>		     "D" (msr), "S" (val));
>}
>
>
>As the CALL instruction is 5-byte long, and we need to pad nop for both
>WRMSR and WRMSRNS, what about not using segment prefix at all?

The alternatives macros can handle arbitrary insn sizes - no need for any padding.


-- 
Sent from a small device: formatting sucks and brevity is inevitable.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-17 14:23       ` Borislav Petkov
@ 2024-08-17 14:43         ` Borislav Petkov
  2024-08-17 15:44           ` Xin Li
  2024-08-17 19:22           ` H. Peter Anvin
  2024-08-17 19:22         ` H. Peter Anvin
  1 sibling, 2 replies; 37+ messages in thread
From: Borislav Petkov @ 2024-08-17 14:43 UTC (permalink / raw)
  To: Xin Li, Andrew Cooper, linux-kernel
  Cc: tglx, mingo, dave.hansen, x86, hpa, peterz, seanjc

On Sat, Aug 17, 2024 at 05:23:07PM +0300, Borislav Petkov wrote:
> >static __always_inline void wrmsr(u32 msr, u64 val)
> >{
> >	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
> >					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
> >		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
> >		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
> >		     "D" (msr), "S" (val));
> >}
> >
> >
> >As the CALL instruction is 5-byte long, and we need to pad nop for both
> >WRMSR and WRMSRNS, what about not using segment prefix at all?
> 
> The alternatives macros can handle arbitrary insn sizes - no need for any padding.

What you cannot do is have a CALL insn in only one of the alternative
insn groups because the compiler is free to clobber callee regs and
those might be live where you place the alternative and thus will have
a nice lovely corruption.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-17 14:43         ` Borislav Petkov
@ 2024-08-17 15:44           ` Xin Li
  2024-08-17 19:22           ` H. Peter Anvin
  1 sibling, 0 replies; 37+ messages in thread
From: Xin Li @ 2024-08-17 15:44 UTC (permalink / raw)
  To: Borislav Petkov, Andrew Cooper, linux-kernel
  Cc: tglx, mingo, dave.hansen, x86, hpa, peterz, seanjc

On 8/17/2024 7:43 AM, Borislav Petkov wrote:
> On Sat, Aug 17, 2024 at 05:23:07PM +0300, Borislav Petkov wrote:
>>> static __always_inline void wrmsr(u32 msr, u64 val)
>>> {
>>> 	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
>>> 					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
>>> 		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>> 		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>>> 		     "D" (msr), "S" (val));
>>> }
>>>
>>>
>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>>
>> The alternatives macros can handle arbitrary insn sizes - no need for any padding.
> 
> What you cannot do is have a CALL insn in only one of the alternative
> insn groups because the compiler is free to clobber callee regs and
> those might be live where you place the alternative and thus will have
> a nice lovely corruption.

Yeah, asm_xen_write_msr() is a wrapper function to xen_write_msr() which
saves/restores those regs.

My first draft patch calls xen_write_msr() directly and works fine.  But
hpa said the same thing as you said.

Thanks!
     Xin


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-17 14:23       ` Borislav Petkov
  2024-08-17 14:43         ` Borislav Petkov
@ 2024-08-17 19:22         ` H. Peter Anvin
  2024-08-18  5:49           ` Borislav Petkov
  1 sibling, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-17 19:22 UTC (permalink / raw)
  To: Borislav Petkov, Xin Li, Andrew Cooper, linux-kernel
  Cc: tglx, mingo, dave.hansen, x86, peterz, seanjc

On August 17, 2024 7:23:07 AM PDT, Borislav Petkov <bp@alien8.de> wrote:
>On August 16, 2024 8:52:51 PM GMT+03:00, Xin Li <xin@zytor.com> wrote:
>>On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>>> +
>>>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>>   {
>>>> -	__wrmsrns(msr, val, val >> 32);
>>>> +	/*
>>>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>>>> +	 * DS prefix to avoid a trailing NOP.
>>>> +	 */
>>>> +	asm volatile("1: "
>>>> +		     ALTERNATIVE("ds wrmsr",
>>> 
>>> This isn't the version I presented, and there's no discussion of the
>>> alteration.
>>
>>I'm trying to implement wrmsr() as
>>
>>static __always_inline void wrmsr(u32 msr, u64 val)
>>{
>>	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
>>					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
>>		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>>		     "D" (msr), "S" (val));
>>}
>>
>>
>>As the CALL instruction is 5-byte long, and we need to pad nop for both
>>WRMSR and WRMSRNS, what about not using segment prefix at all?
>
>The alternatives macros can handle arbitrary insn sizes - no need for any padding.
>
>

The padding avoids unnecessary NOPs.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-17 14:43         ` Borislav Petkov
  2024-08-17 15:44           ` Xin Li
@ 2024-08-17 19:22           ` H. Peter Anvin
  2024-08-18  5:59             ` Borislav Petkov
  1 sibling, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-17 19:22 UTC (permalink / raw)
  To: Borislav Petkov, Xin Li, Andrew Cooper, linux-kernel
  Cc: tglx, mingo, dave.hansen, x86, peterz, seanjc

On August 17, 2024 7:43:16 AM PDT, Borislav Petkov <bp@alien8.de> wrote:
>On Sat, Aug 17, 2024 at 05:23:07PM +0300, Borislav Petkov wrote:
>> >static __always_inline void wrmsr(u32 msr, u64 val)
>> >{
>> >	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
>> >					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
>> >		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>> >		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>> >		     "D" (msr), "S" (val));
>> >}
>> >
>> >
>> >As the CALL instruction is 5-byte long, and we need to pad nop for both
>> >WRMSR and WRMSRNS, what about not using segment prefix at all?
>> 
>> The alternatives macros can handle arbitrary insn sizes - no need for any padding.
>
>What you cannot do is have a CALL insn in only one of the alternative
>insn groups because the compiler is free to clobber callee regs and
>those might be live where you place the alternative and thus will have
>a nice lovely corruption.
>

Only if the call goes to a C function.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-16 22:59           ` H. Peter Anvin
@ 2024-08-17 23:51             ` H. Peter Anvin
  0 siblings, 0 replies; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-17 23:51 UTC (permalink / raw)
  To: Andrew Cooper, Xin Li, linux-kernel
  Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc

[-- Attachment #1: Type: text/plain, Size: 1188 bytes --]

On 8/16/24 15:59, H. Peter Anvin wrote:
> 
> RDMSR is a bit trickier. I think the best option there is to set up a 
> new trap fixup handler type that amounts to "get the address from 
> the stack, apply a specific offset, and invoke the fixup handler for 
> that address:
> 
> 
>      case EX_TYPE_UPLEVEL: {
>          /* Let reg hold the unsigned number of machine
>           * words to pop off the stack before the return
>           * address, and imm the signed offset from the
>           * return address to the desired trap point.
>           *
>           * pointer in units of machine words, and imm the
>           * signed offset from this stack word...
>           */
>          unsigned long *sp = (unsigned long *)regs->sp + reg;
>          regs->ip = *sp++ + (int16_t)imm;
>          regs->sp = (unsigned long)sp;
>          goto again;    /* Loop back to the beginning */
>      }
> 
> Again, "obviously correct" code attached.
> 

Here is an untested patch implemented the above functionality, tidied up 
and wrapped in a macro as _ASM_EXTABLE_FUNC_REWIND().

	-hpa

[-- Attachment #2: 0001-x86-extable-implement-EX_TYPE_FUNC_REWIND.patch --]
[-- Type: text/x-patch, Size: 6901 bytes --]

From 35f99255f2bc0174aba90c44b3e70415f0658257 Mon Sep 17 00:00:00 2001
From: "H. Peter Anvin" <hpa@zytor.com>
Date: Sat, 17 Aug 2024 16:45:24 -0700
Subject: [PATCH] x86/extable: implement EX_TYPE_FUNC_REWIND

Add a new exception type, which allows emulating an exception as if it
had happened at or near the call site of a function. This allows a
function call inside an alternative for instruction emulation to "kick
back" the exception into the alternatives pattern, possibly invoking a
different exception handling pattern there, or at least indicating the
"real" location of the fault.

Untested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
---
 arch/x86/include/asm/asm.h                 |   6 +
 arch/x86/include/asm/extable_fixup_types.h |   1 +
 arch/x86/mm/extable.c                      | 136 +++++++++++++--------
 3 files changed, 92 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 2bec0c89a95c..7398261b0f4a 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -232,5 +232,11 @@ register unsigned long current_stack_pointer asm(_ASM_SP);
 #define _ASM_EXTABLE_FAULT(from, to)				\
 	_ASM_EXTABLE_TYPE(from, to, EX_TYPE_FAULT)
 
+#define _ASM_EXTABLE_FUNC_REWIND(from, ipdelta, spdelta)	\
+	_ASM_EXTABLE_TYPE(from, from /* unused */,		\
+			  EX_TYPE_FUNC_REWIND |			\
+			  EX_DATA_REG(spdelta) |		\
+			  EX_DATA_IMM(ipdelta))
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_X86_ASM_H */
diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
index 906b0d5541e8..2eed7f39893f 100644
--- a/arch/x86/include/asm/extable_fixup_types.h
+++ b/arch/x86/include/asm/extable_fixup_types.h
@@ -67,5 +67,6 @@
 #define	EX_TYPE_ZEROPAD			20 /* longword load with zeropad on fault */
 
 #define	EX_TYPE_ERETU			21
+#define EX_TYPE_FUNC_REWIND		22
 
 #endif
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 51986e8a9d35..076f1492b935 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -290,6 +290,28 @@ static bool ex_handler_eretu(const struct exception_table_entry *fixup,
 }
 #endif
 
+/*
+ * Emulate a fault taken at the call site of a function.
+ * The combined reg and flags field are used as an unsigned
+ * number of machine words to pop off the stack before the
+ * return address, then the signed imm field is used as a delta
+ * from the return IP address.
+ */
+static bool ex_handler_func_rewind(struct pt_regs *regs, int data)
+{
+	const long ipdelta = FIELD_GET(EX_DATA_IMM_MASK,  data);
+	const unsigned long pops =
+		FIELD_GET(EX_DATA_REG_MASK|EX_DATA_FLAG_MASK, data);
+	unsigned long *sp;
+
+	sp = (unsigned long *)regs->sp;
+	sp += pops;
+	regs->ip = *sp++ + ipdelta;
+	regs->sp = (unsigned long)sp;
+
+	return true;
+}
+
 int ex_get_fixup_type(unsigned long ip)
 {
 	const struct exception_table_entry *e = search_exception_tables(ip);
@@ -302,6 +324,7 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 {
 	const struct exception_table_entry *e;
 	int type, reg, imm;
+	bool again;
 
 #ifdef CONFIG_PNPBIOS
 	if (unlikely(SEGMENT_IS_PNP_CODE(regs->cs))) {
@@ -317,60 +340,71 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 	}
 #endif
 
-	e = search_exception_tables(regs->ip);
-	if (!e)
-		return 0;
-
-	type = FIELD_GET(EX_DATA_TYPE_MASK, e->data);
-	reg  = FIELD_GET(EX_DATA_REG_MASK,  e->data);
-	imm  = FIELD_GET(EX_DATA_IMM_MASK,  e->data);
-
-	switch (type) {
-	case EX_TYPE_DEFAULT:
-	case EX_TYPE_DEFAULT_MCE_SAFE:
-		return ex_handler_default(e, regs);
-	case EX_TYPE_FAULT:
-	case EX_TYPE_FAULT_MCE_SAFE:
-		return ex_handler_fault(e, regs, trapnr);
-	case EX_TYPE_UACCESS:
-		return ex_handler_uaccess(e, regs, trapnr, fault_addr);
-	case EX_TYPE_CLEAR_FS:
-		return ex_handler_clear_fs(e, regs);
-	case EX_TYPE_FPU_RESTORE:
-		return ex_handler_fprestore(e, regs);
-	case EX_TYPE_BPF:
-		return ex_handler_bpf(e, regs);
-	case EX_TYPE_WRMSR:
-		return ex_handler_msr(e, regs, true, false, reg);
-	case EX_TYPE_RDMSR:
-		return ex_handler_msr(e, regs, false, false, reg);
-	case EX_TYPE_WRMSR_SAFE:
-		return ex_handler_msr(e, regs, true, true, reg);
-	case EX_TYPE_RDMSR_SAFE:
-		return ex_handler_msr(e, regs, false, true, reg);
-	case EX_TYPE_WRMSR_IN_MCE:
-		ex_handler_msr_mce(regs, true);
-		break;
-	case EX_TYPE_RDMSR_IN_MCE:
-		ex_handler_msr_mce(regs, false);
-		break;
-	case EX_TYPE_POP_REG:
-		regs->sp += sizeof(long);
-		fallthrough;
-	case EX_TYPE_IMM_REG:
-		return ex_handler_imm_reg(e, regs, reg, imm);
-	case EX_TYPE_FAULT_SGX:
-		return ex_handler_sgx(e, regs, trapnr);
-	case EX_TYPE_UCOPY_LEN:
-		return ex_handler_ucopy_len(e, regs, trapnr, fault_addr, reg, imm);
-	case EX_TYPE_ZEROPAD:
-		return ex_handler_zeropad(e, regs, fault_addr);
+	do {
+		e = search_exception_tables(regs->ip);
+		if (!e)
+			return 0;
+
+		again = false;
+
+		type = FIELD_GET(EX_DATA_TYPE_MASK, e->data);
+		reg  = FIELD_GET(EX_DATA_REG_MASK,  e->data);
+		imm  = FIELD_GET(EX_DATA_IMM_MASK,  e->data);
+
+		switch (type) {
+		case EX_TYPE_DEFAULT:
+		case EX_TYPE_DEFAULT_MCE_SAFE:
+			return ex_handler_default(e, regs);
+		case EX_TYPE_FAULT:
+		case EX_TYPE_FAULT_MCE_SAFE:
+			return ex_handler_fault(e, regs, trapnr);
+		case EX_TYPE_UACCESS:
+			return ex_handler_uaccess(e, regs, trapnr, fault_addr);
+		case EX_TYPE_CLEAR_FS:
+			return ex_handler_clear_fs(e, regs);
+		case EX_TYPE_FPU_RESTORE:
+			return ex_handler_fprestore(e, regs);
+		case EX_TYPE_BPF:
+			return ex_handler_bpf(e, regs);
+		case EX_TYPE_WRMSR:
+			return ex_handler_msr(e, regs, true, false, reg);
+		case EX_TYPE_RDMSR:
+			return ex_handler_msr(e, regs, false, false, reg);
+		case EX_TYPE_WRMSR_SAFE:
+			return ex_handler_msr(e, regs, true, true, reg);
+		case EX_TYPE_RDMSR_SAFE:
+			return ex_handler_msr(e, regs, false, true, reg);
+		case EX_TYPE_WRMSR_IN_MCE:
+			ex_handler_msr_mce(regs, true);
+			break;
+		case EX_TYPE_RDMSR_IN_MCE:
+			ex_handler_msr_mce(regs, false);
+			break;
+		case EX_TYPE_POP_REG:
+			regs->sp += sizeof(long);
+			fallthrough;
+		case EX_TYPE_IMM_REG:
+			return ex_handler_imm_reg(e, regs, reg, imm);
+		case EX_TYPE_FAULT_SGX:
+			return ex_handler_sgx(e, regs, trapnr);
+		case EX_TYPE_UCOPY_LEN:
+			return ex_handler_ucopy_len(e, regs, trapnr, fault_addr, reg, imm);
+		case EX_TYPE_ZEROPAD:
+			return ex_handler_zeropad(e, regs, fault_addr);
 #ifdef CONFIG_X86_FRED
-	case EX_TYPE_ERETU:
-		return ex_handler_eretu(e, regs, error_code);
+		case EX_TYPE_ERETU:
+			return ex_handler_eretu(e, regs, error_code);
 #endif
-	}
+		case EX_TYPE_FUNC_REWIND:
+			again = ex_handler_func_rewind(regs, e->data);
+			break;
+		default:
+			break;	/* Will BUG() */
+		}
+	} while (again);
+
 	BUG();
+	return 0;
 }
 
 extern unsigned int early_recursion_flag;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-17 19:22         ` H. Peter Anvin
@ 2024-08-18  5:49           ` Borislav Petkov
  2024-08-18  6:16             ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2024-08-18  5:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xin Li, Andrew Cooper, linux-kernel, tglx, mingo, dave.hansen,
	x86, peterz, seanjc

On Sat, Aug 17, 2024 at 12:22:22PM -0700, H. Peter Anvin wrote:
> The padding avoids unnecessary NOPs.

And a NOP vs some other prefix to have one less insn matters practically
how exactly here?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-17 19:22           ` H. Peter Anvin
@ 2024-08-18  5:59             ` Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2024-08-18  5:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xin Li, Andrew Cooper, linux-kernel, tglx, mingo, dave.hansen,
	x86, peterz, seanjc

On Sat, Aug 17, 2024 at 12:22:56PM -0700, H. Peter Anvin wrote:
> Only if the call goes to a C function.

Aha, you have a patch in the thread which calls a by-hand asm thing.

I haven't looked in detail but we have a THUNK macro for such stuff.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
  2024-08-18  5:49           ` Borislav Petkov
@ 2024-08-18  6:16             ` H. Peter Anvin
  0 siblings, 0 replies; 37+ messages in thread
From: H. Peter Anvin @ 2024-08-18  6:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Xin Li, Andrew Cooper, linux-kernel, tglx, mingo, dave.hansen,
	x86, peterz, seanjc

On 8/17/24 22:49, Borislav Petkov wrote:
> On Sat, Aug 17, 2024 at 12:22:22PM -0700, H. Peter Anvin wrote:
>> The padding avoids unnecessary NOPs.
> 
> And a NOP vs some other prefix to have one less insn matters practically
> how exactly here?

Pretty much nothing, but it is zero cost and possibly some insanely 
small gain.

	-hpa


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-08-18  6:17 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07  5:47 [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace Xin Li (Intel)
2024-08-07  5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel)
2024-08-07 16:21   ` Brian Gerst
2024-08-07 23:03     ` Xin Li
2024-08-07 18:08   ` Thomas Gleixner
2024-08-07 18:21     ` Thomas Gleixner
2024-08-07 23:01     ` Xin Li
2024-08-07  5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel)
2024-08-07 18:11   ` Thomas Gleixner
2024-08-09 23:07   ` Andrew Cooper
2024-08-10  0:01     ` H. Peter Anvin
2024-08-10  0:25       ` Andrew Cooper
2024-08-16 17:52     ` Xin Li
2024-08-16 18:40       ` Andrew Cooper
2024-08-16 19:18         ` H. Peter Anvin
2024-08-16 21:45           ` Andrew Cooper
2024-08-16 21:26         ` H. Peter Anvin
2024-08-16 22:27           ` Andrew Cooper
2024-08-16 22:34             ` Andrew Cooper
2024-08-16 23:03               ` H. Peter Anvin
2024-08-16 22:59           ` H. Peter Anvin
2024-08-17 23:51             ` H. Peter Anvin
2024-08-17 14:23       ` Borislav Petkov
2024-08-17 14:43         ` Borislav Petkov
2024-08-17 15:44           ` Xin Li
2024-08-17 19:22           ` H. Peter Anvin
2024-08-18  5:59             ` Borislav Petkov
2024-08-17 19:22         ` H. Peter Anvin
2024-08-18  5:49           ` Borislav Petkov
2024-08-18  6:16             ` H. Peter Anvin
2024-08-07  5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel)
2024-08-07 18:39   ` Thomas Gleixner
2024-08-07 21:55     ` Sean Christopherson
2024-08-07 23:04       ` Thomas Gleixner
2024-08-09 10:45   ` Nikolay Borisov
2024-08-09 17:37     ` Xin Li
2024-08-09 22:43       ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).