* [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace
@ 2024-08-07 5:47 Xin Li (Intel)
2024-08-07 5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel)
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Xin Li (Intel) @ 2024-08-07 5:47 UTC (permalink / raw)
To: linux-kernel
Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3,
seanjc
This patch set moves writing MSR_IA32_FRED_RSP0 to return to userspace
from context switch.
In the discussion of save/restore host/guest FRED RSP0 for KVM, Sean
proposed to have the kernel write MSR_IA32_FRED_RSP0 on return to
userspace, i.e., arch_exit_to_user_mode_prepare(), instead of on context
switch. [1]
hpa suggested to test ti_work for zero and then process individual bits
in arch_exit_to_user_mode_prepare. And a quick measurement shows that
in most cases, ti_work values passed to arch_exit_to_user_mode_prepare()
are zeros, e.g., 99% in kernel build tests. This zero test change was
then sent to Intel 0day tests, and no perf regression is reported.
Besides, per the discussion of how to write MSR_IA32_FRED_RSP0 with the
introduction of WRMSRNS [2], use the alternatives mechanism to choose
WRMSRNS when it's available, otherwise fallback to WRMSR.
[1] https://lore.kernel.org/lkml/ZpkfkSMPiXrS9r2K@google.com/
[2] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/
Andrew Cooper (1):
x86/msr: Switch between WRMSRNS and WRMSR with the alternatives
mechanism
Xin Li (Intel) (2):
x86/entry: Test ti_work for zero before processing individual bits
x86/entry: Set FRED RSP0 on return to userspace instead of context
switch
arch/x86/include/asm/entry-common.h | 21 ++++++++++++++-------
arch/x86/include/asm/msr.h | 28 ++++++++++++++--------------
arch/x86/include/asm/switch_to.h | 3 +--
arch/x86/include/asm/thread_info.h | 2 ++
arch/x86/kernel/cpu/cpuid-deps.c | 1 -
5 files changed, 31 insertions(+), 24 deletions(-)
base-commit: 9ebdc7589cbb5c976e6c8807cbe13f263d70d32c
--
2.45.2
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits 2024-08-07 5:47 [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace Xin Li (Intel) @ 2024-08-07 5:47 ` Xin Li (Intel) 2024-08-07 16:21 ` Brian Gerst 2024-08-07 18:08 ` Thomas Gleixner 2024-08-07 5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel) 2024-08-07 5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel) 2 siblings, 2 replies; 37+ messages in thread From: Xin Li (Intel) @ 2024-08-07 5:47 UTC (permalink / raw) To: linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc In most cases, ti_work values passed to arch_exit_to_user_mode_prepare() are zeros, e.g., 99% in kernel build tests. So an obvious optimization is to test ti_work for zero before processing individual bits in it. In addition, Intel 0day tests find no perf regression with this change. Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> --- arch/x86/include/asm/entry-common.h | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index fb2809b20b0a..4c78b99060b5 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, unsigned long ti_work) { - if (ti_work & _TIF_USER_RETURN_NOTIFY) - fire_user_return_notifiers(); + if (unlikely(ti_work)) { + if (ti_work & _TIF_USER_RETURN_NOTIFY) + fire_user_return_notifiers(); - if (unlikely(ti_work & _TIF_IO_BITMAP)) - tss_update_io_bitmap(); + if (unlikely(ti_work & _TIF_IO_BITMAP)) + tss_update_io_bitmap(); - fpregs_assert_state_consistent(); - if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) - switch_fpu_return(); + fpregs_assert_state_consistent(); + if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) + switch_fpu_return(); + } #ifdef CONFIG_COMPAT /* -- 2.45.2 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits 2024-08-07 5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel) @ 2024-08-07 16:21 ` Brian Gerst 2024-08-07 23:03 ` Xin Li 2024-08-07 18:08 ` Thomas Gleixner 1 sibling, 1 reply; 37+ messages in thread From: Brian Gerst @ 2024-08-07 16:21 UTC (permalink / raw) To: Xin Li (Intel) Cc: linux-kernel, tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc On Wed, Aug 7, 2024 at 1:51 AM Xin Li (Intel) <xin@zytor.com> wrote: > > In most cases, ti_work values passed to arch_exit_to_user_mode_prepare() > are zeros, e.g., 99% in kernel build tests. So an obvious optimization > is to test ti_work for zero before processing individual bits in it. > > In addition, Intel 0day tests find no perf regression with this change. > > Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> > Signed-off-by: Xin Li (Intel) <xin@zytor.com> > --- > arch/x86/include/asm/entry-common.h | 16 +++++++++------- > 1 file changed, 9 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h > index fb2809b20b0a..4c78b99060b5 100644 > --- a/arch/x86/include/asm/entry-common.h > +++ b/arch/x86/include/asm/entry-common.h > @@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) > static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, > unsigned long ti_work) > { > - if (ti_work & _TIF_USER_RETURN_NOTIFY) > - fire_user_return_notifiers(); > + if (unlikely(ti_work)) { > + if (ti_work & _TIF_USER_RETURN_NOTIFY) > + fire_user_return_notifiers(); > > - if (unlikely(ti_work & _TIF_IO_BITMAP)) > - tss_update_io_bitmap(); > + if (unlikely(ti_work & _TIF_IO_BITMAP)) > + tss_update_io_bitmap(); > > - fpregs_assert_state_consistent(); > - if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) > - switch_fpu_return(); > + fpregs_assert_state_consistent(); This call was originally unconditional, and does nothing if TIF_NEED_FPU_LOAD is set. > + if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) > + switch_fpu_return(); > + } > > #ifdef CONFIG_COMPAT > /* > -- > 2.45.2 > > Brian Gerst ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits 2024-08-07 16:21 ` Brian Gerst @ 2024-08-07 23:03 ` Xin Li 0 siblings, 0 replies; 37+ messages in thread From: Xin Li @ 2024-08-07 23:03 UTC (permalink / raw) To: Brian Gerst Cc: linux-kernel, tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc On 8/7/2024 9:21 AM, Brian Gerst wrote: >> + fpregs_assert_state_consistent(); > This call was originally unconditional, and does nothing if > TIF_NEED_FPU_LOAD is set. lost my mind! Thanks! ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits 2024-08-07 5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel) 2024-08-07 16:21 ` Brian Gerst @ 2024-08-07 18:08 ` Thomas Gleixner 2024-08-07 18:21 ` Thomas Gleixner 2024-08-07 23:01 ` Xin Li 1 sibling, 2 replies; 37+ messages in thread From: Thomas Gleixner @ 2024-08-07 18:08 UTC (permalink / raw) To: Xin Li (Intel), linux-kernel Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc On Tue, Aug 06 2024 at 22:47, Xin Li wrote: > In most cases, ti_work values passed to arch_exit_to_user_mode_prepare() > are zeros, e.g., 99% in kernel build tests. So an obvious optimization > is to test ti_work for zero before processing individual bits in it. > > In addition, Intel 0day tests find no perf regression with this change. > > Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> > Signed-off-by: Xin Li (Intel) <xin@zytor.com> > --- > arch/x86/include/asm/entry-common.h | 16 +++++++++------- > 1 file changed, 9 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h > index fb2809b20b0a..4c78b99060b5 100644 > --- a/arch/x86/include/asm/entry-common.h > +++ b/arch/x86/include/asm/entry-common.h > @@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) > static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, > unsigned long ti_work) > { > - if (ti_work & _TIF_USER_RETURN_NOTIFY) > - fire_user_return_notifiers(); > + if (unlikely(ti_work)) { > + if (ti_work & _TIF_USER_RETURN_NOTIFY) > + fire_user_return_notifiers(); > > - if (unlikely(ti_work & _TIF_IO_BITMAP)) > - tss_update_io_bitmap(); > + if (unlikely(ti_work & _TIF_IO_BITMAP)) > + tss_update_io_bitmap(); > > - fpregs_assert_state_consistent(); Please keep this unconditional and independent of ti_work. It's a debug feature and you kill coverage with making it conditional on ti_work. Thanks, tglx ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits 2024-08-07 18:08 ` Thomas Gleixner @ 2024-08-07 18:21 ` Thomas Gleixner 2024-08-07 23:01 ` Xin Li 1 sibling, 0 replies; 37+ messages in thread From: Thomas Gleixner @ 2024-08-07 18:21 UTC (permalink / raw) To: Xin Li (Intel), linux-kernel Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc On Wed, Aug 07 2024 at 20:08, Thomas Gleixner wrote: > On Tue, Aug 06 2024 at 22:47, Xin Li wrote: >> In most cases, ti_work values passed to arch_exit_to_user_mode_prepare() >> are zeros, e.g., 99% in kernel build tests. So an obvious optimization >> is to test ti_work for zero before processing individual bits in it. >> >> In addition, Intel 0day tests find no perf regression with this change. >> >> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> >> Signed-off-by: Xin Li (Intel) <xin@zytor.com> >> --- >> arch/x86/include/asm/entry-common.h | 16 +++++++++------- >> 1 file changed, 9 insertions(+), 7 deletions(-) >> >> diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h >> index fb2809b20b0a..4c78b99060b5 100644 >> --- a/arch/x86/include/asm/entry-common.h >> +++ b/arch/x86/include/asm/entry-common.h >> @@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) >> static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, >> unsigned long ti_work) >> { >> - if (ti_work & _TIF_USER_RETURN_NOTIFY) >> - fire_user_return_notifiers(); >> + if (unlikely(ti_work)) { >> + if (ti_work & _TIF_USER_RETURN_NOTIFY) >> + fire_user_return_notifiers(); >> >> - if (unlikely(ti_work & _TIF_IO_BITMAP)) >> - tss_update_io_bitmap(); >> + if (unlikely(ti_work & _TIF_IO_BITMAP)) >> + tss_update_io_bitmap(); >> >> - fpregs_assert_state_consistent(); > > Please keep this unconditional and independent of ti_work. It's a debug > feature and you kill coverage with making it conditional on ti_work. Also spare the extra indentation level and do: static inline void arch_exit_work(unsigned long ti_work) { if (ti_work & _TIF_USER_RETURN_NOTIFY) fire_user_return_notifiers(); if (unlikely(ti_work & _TIF_IO_BITMAP)) tss_update_io_bitmap(); fpregs_assert_state_consistent(); if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) switch_fpu_return(); } static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, unsigned long ti_work) { if (IS_ENABLED(CONFIG_X86_DEBUG_FPU) || unlikely(ti_work)) arch_exit_work(ti_work); ... Thanks, tglx ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits 2024-08-07 18:08 ` Thomas Gleixner 2024-08-07 18:21 ` Thomas Gleixner @ 2024-08-07 23:01 ` Xin Li 1 sibling, 0 replies; 37+ messages in thread From: Xin Li @ 2024-08-07 23:01 UTC (permalink / raw) To: Thomas Gleixner, linux-kernel Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc On 8/7/2024 11:08 AM, Thomas Gleixner wrote: >> - fpregs_assert_state_consistent(); > Please keep this unconditional and independent of ti_work. It's a debug > feature and you kill coverage with making it conditional on ti_work. Sigh, I'm an idiot. Thanks! Xin ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-07 5:47 [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace Xin Li (Intel) 2024-08-07 5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel) @ 2024-08-07 5:47 ` Xin Li (Intel) 2024-08-07 18:11 ` Thomas Gleixner 2024-08-09 23:07 ` Andrew Cooper 2024-08-07 5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel) 2 siblings, 2 replies; 37+ messages in thread From: Xin Li (Intel) @ 2024-08-07 5:47 UTC (permalink / raw) To: linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc From: Andrew Cooper <andrew.cooper3@citrix.com> Per the discussion about FRED MSR writes with WRMSRNS instruction [1], use the alternatives mechanism to choose WRMSRNS when it's available, otherwise fallback to WRMSR. [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/ Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> --- arch/x86/include/asm/msr.h | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index d642037f9ed5..3e402d717815 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high) : : "c" (msr), "a"(low), "d" (high) : "memory"); } -/* - * WRMSRNS behaves exactly like WRMSR with the only difference being - * that it is not a serializing instruction by default. - */ -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high) -{ - /* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */ - asm volatile("1: .byte 0x0f,0x01,0xc6\n" - "2:\n" - _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) - : : "c" (msr), "a"(low), "d" (high)); -} - #define native_rdmsr(msr, val1, val2) \ do { \ u64 __val = __rdmsr((msr)); \ @@ -312,9 +299,22 @@ do { \ #endif /* !CONFIG_PARAVIRT_XXL */ +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) + +/* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ static __always_inline void wrmsrns(u32 msr, u64 val) { - __wrmsrns(msr, val, val >> 32); + /* + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant + * DS prefix to avoid a trailing NOP. + */ + asm volatile("1: " + ALTERNATIVE("ds wrmsr", + WRMSRNS, X86_FEATURE_WRMSRNS) + "2:\n" + _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) + : : "c" (msr), "a" ((u32)val), "d" ((u32)(val >> 32))); } /* -- 2.45.2 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-07 5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel) @ 2024-08-07 18:11 ` Thomas Gleixner 2024-08-09 23:07 ` Andrew Cooper 1 sibling, 0 replies; 37+ messages in thread From: Thomas Gleixner @ 2024-08-07 18:11 UTC (permalink / raw) To: Xin Li (Intel), linux-kernel Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc On Tue, Aug 06 2024 at 22:47, Xin Li wrote: > > +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ > +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) > + > +/* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ > static __always_inline void wrmsrns(u32 msr, u64 val) > { > - __wrmsrns(msr, val, val >> 32); > + /* > + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant > + * DS prefix to avoid a trailing NOP. > + */ > + asm volatile("1: " > + ALTERNATIVE("ds wrmsr", > + WRMSRNS, X86_FEATURE_WRMSRNS) Please get rid of this horrible line break. You have 100 characters line width. Thanks, tglx ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-07 5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel) 2024-08-07 18:11 ` Thomas Gleixner @ 2024-08-09 23:07 ` Andrew Cooper 2024-08-10 0:01 ` H. Peter Anvin 2024-08-16 17:52 ` Xin Li 1 sibling, 2 replies; 37+ messages in thread From: Andrew Cooper @ 2024-08-09 23:07 UTC (permalink / raw) To: Xin Li (Intel), linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, seanjc On 07/08/2024 6:47 am, Xin Li (Intel) wrote: > From: Andrew Cooper <andrew.cooper3@citrix.com> > > Per the discussion about FRED MSR writes with WRMSRNS instruction [1], > use the alternatives mechanism to choose WRMSRNS when it's available, > otherwise fallback to WRMSR. > > [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/ > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Xin Li (Intel) <xin@zytor.com> > --- > arch/x86/include/asm/msr.h | 28 ++++++++++++++-------------- > 1 file changed, 14 insertions(+), 14 deletions(-) > > diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h > index d642037f9ed5..3e402d717815 100644 > --- a/arch/x86/include/asm/msr.h > +++ b/arch/x86/include/asm/msr.h > @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high) > : : "c" (msr), "a"(low), "d" (high) : "memory"); > } > > -/* > - * WRMSRNS behaves exactly like WRMSR with the only difference being > - * that it is not a serializing instruction by default. > - */ > -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high) > -{ > - /* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */ > - asm volatile("1: .byte 0x0f,0x01,0xc6\n" > - "2:\n" > - _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) > - : : "c" (msr), "a"(low), "d" (high)); > -} > - > #define native_rdmsr(msr, val1, val2) \ > do { \ > u64 __val = __rdmsr((msr)); \ > @@ -312,9 +299,22 @@ do { \ > > #endif /* !CONFIG_PARAVIRT_XXL */ > > +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ > +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) > + > +/* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ > static __always_inline void wrmsrns(u32 msr, u64 val) > { > - __wrmsrns(msr, val, val >> 32); > + /* > + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant > + * DS prefix to avoid a trailing NOP. > + */ > + asm volatile("1: " > + ALTERNATIVE("ds wrmsr", This isn't the version I presented, and there's no discussion of the alteration. The choice of CS over DS was deliberate, and came from Intel: https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf So unless Intel want to retract that whitepaper, and all the binutils work with it, I'd suggest keeping it as CS like we use elsewhere, and as explicitly instructed by Intel. ~Andrew ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-09 23:07 ` Andrew Cooper @ 2024-08-10 0:01 ` H. Peter Anvin 2024-08-10 0:25 ` Andrew Cooper 2024-08-16 17:52 ` Xin Li 1 sibling, 1 reply; 37+ messages in thread From: H. Peter Anvin @ 2024-08-10 0:01 UTC (permalink / raw) To: Andrew Cooper, Xin Li (Intel), linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc On August 9, 2024 4:07:35 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >On 07/08/2024 6:47 am, Xin Li (Intel) wrote: >> From: Andrew Cooper <andrew.cooper3@citrix.com> >> >> Per the discussion about FRED MSR writes with WRMSRNS instruction [1], >> use the alternatives mechanism to choose WRMSRNS when it's available, >> otherwise fallback to WRMSR. >> >> [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/ >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Xin Li (Intel) <xin@zytor.com> >> --- >> arch/x86/include/asm/msr.h | 28 ++++++++++++++-------------- >> 1 file changed, 14 insertions(+), 14 deletions(-) >> >> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h >> index d642037f9ed5..3e402d717815 100644 >> --- a/arch/x86/include/asm/msr.h >> +++ b/arch/x86/include/asm/msr.h >> @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high) >> : : "c" (msr), "a"(low), "d" (high) : "memory"); >> } >> >> -/* >> - * WRMSRNS behaves exactly like WRMSR with the only difference being >> - * that it is not a serializing instruction by default. >> - */ >> -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high) >> -{ >> - /* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */ >> - asm volatile("1: .byte 0x0f,0x01,0xc6\n" >> - "2:\n" >> - _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) >> - : : "c" (msr), "a"(low), "d" (high)); >> -} >> - >> #define native_rdmsr(msr, val1, val2) \ >> do { \ >> u64 __val = __rdmsr((msr)); \ >> @@ -312,9 +299,22 @@ do { \ >> >> #endif /* !CONFIG_PARAVIRT_XXL */ >> >> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ >> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) >> + >> +/* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ >> static __always_inline void wrmsrns(u32 msr, u64 val) >> { >> - __wrmsrns(msr, val, val >> 32); >> + /* >> + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant >> + * DS prefix to avoid a trailing NOP. >> + */ >> + asm volatile("1: " >> + ALTERNATIVE("ds wrmsr", > >This isn't the version I presented, and there's no discussion of the >alteration. > >The choice of CS over DS was deliberate, and came from Intel: > >https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf > >So unless Intel want to retract that whitepaper, and all the binutils >work with it, I'd suggest keeping it as CS like we use elsewhere, and as >explicitly instructed by Intel. > >~Andrew I looked around the kernel, and I believe we are inconsistent. I see both 0x2e (CS) and 0x3e (DS) prefixes used for padding where open-coded. We can't use cs in all cases, since you can't do a store to the code segment (always readonly) so we use 0x3e (DS) to patch out LOCK. In the paper you describe, it only mentions 0x2e as a "benign prefix" in a specific example, not as any kind of specific recommendation. It is particularly irrelevant when it comes to padding a two instructions to the same length as the paper deals with assignment. If you want, I'm perfectly happy to go and ask if there is any general recommendation (except for direct conditional branch hints, of course.) ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-10 0:01 ` H. Peter Anvin @ 2024-08-10 0:25 ` Andrew Cooper 0 siblings, 0 replies; 37+ messages in thread From: Andrew Cooper @ 2024-08-10 0:25 UTC (permalink / raw) To: H. Peter Anvin, Xin Li (Intel), linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc On 10/08/2024 1:01 am, H. Peter Anvin wrote: > On August 9, 2024 4:07:35 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 07/08/2024 6:47 am, Xin Li (Intel) wrote: >>> From: Andrew Cooper <andrew.cooper3@citrix.com> >>> >>> Per the discussion about FRED MSR writes with WRMSRNS instruction [1], >>> use the alternatives mechanism to choose WRMSRNS when it's available, >>> otherwise fallback to WRMSR. >>> >>> [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/ >>> >>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >>> Signed-off-by: Xin Li (Intel) <xin@zytor.com> >>> --- >>> arch/x86/include/asm/msr.h | 28 ++++++++++++++-------------- >>> 1 file changed, 14 insertions(+), 14 deletions(-) >>> >>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h >>> index d642037f9ed5..3e402d717815 100644 >>> --- a/arch/x86/include/asm/msr.h >>> +++ b/arch/x86/include/asm/msr.h >>> @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high) >>> : : "c" (msr), "a"(low), "d" (high) : "memory"); >>> } >>> >>> -/* >>> - * WRMSRNS behaves exactly like WRMSR with the only difference being >>> - * that it is not a serializing instruction by default. >>> - */ >>> -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high) >>> -{ >>> - /* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */ >>> - asm volatile("1: .byte 0x0f,0x01,0xc6\n" >>> - "2:\n" >>> - _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) >>> - : : "c" (msr), "a"(low), "d" (high)); >>> -} >>> - >>> #define native_rdmsr(msr, val1, val2) \ >>> do { \ >>> u64 __val = __rdmsr((msr)); \ >>> @@ -312,9 +299,22 @@ do { \ >>> >>> #endif /* !CONFIG_PARAVIRT_XXL */ >>> >>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ >>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) >>> + >>> +/* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ >>> static __always_inline void wrmsrns(u32 msr, u64 val) >>> { >>> - __wrmsrns(msr, val, val >> 32); >>> + /* >>> + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant >>> + * DS prefix to avoid a trailing NOP. >>> + */ >>> + asm volatile("1: " >>> + ALTERNATIVE("ds wrmsr", >> This isn't the version I presented, and there's no discussion of the >> alteration. >> >> The choice of CS over DS was deliberate, and came from Intel: >> >> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf >> >> So unless Intel want to retract that whitepaper, and all the binutils >> work with it, I'd suggest keeping it as CS like we use elsewhere, and as >> explicitly instructed by Intel. >> >> ~Andrew > I looked around the kernel, and I believe we are inconsistent. I see both 0x2e (CS) and 0x3e (DS) prefixes used for padding where open-coded. > > We can't use cs in all cases, since you can't do a store to the code segment (always readonly) so we use 0x3e (DS) to patch out LOCK. > > In the paper you describe, it only mentions 0x2e as a "benign prefix" in a specific example, not as any kind of specific recommendation. It is particularly irrelevant when it comes to padding a two instructions to the same length as the paper deals with assignment. > > If you want, I'm perfectly happy to go and ask if there is any general recommendation (except for direct conditional branch hints, of course.) It would be lovely if there could be a single coherent statement. In addition to store semantics, off the top of my head: * CS is the P4 hint-not-taken (presumably Jcc only), ignored now. * DS is both the P4 hint-taken (presumably Jcc only), newly reintroduced in Redwood Cove with tweaked semantics (definitely Jcc only), and the CET notrack prefix (JMP/CALL *IND only). Plus whatever else I've missed[1]. ~Andrew [1] I'm ignoring XuCode mode memory operand semantics as not relevant to CPL0 software (where AIUI unprefixed is physical and DS prefixed is virtual) but I'm noting it here to highlight that there's definitely extra complexity beyond what is in SDM Vol2. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-09 23:07 ` Andrew Cooper 2024-08-10 0:01 ` H. Peter Anvin @ 2024-08-16 17:52 ` Xin Li 2024-08-16 18:40 ` Andrew Cooper 2024-08-17 14:23 ` Borislav Petkov 1 sibling, 2 replies; 37+ messages in thread From: Xin Li @ 2024-08-16 17:52 UTC (permalink / raw) To: Andrew Cooper, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, seanjc On 8/9/2024 4:07 PM, Andrew Cooper wrote: > On 07/08/2024 6:47 am, Xin Li (Intel) wrote: >> From: Andrew Cooper <andrew.cooper3@citrix.com> >> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ >> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) >> + >> +/* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ >> static __always_inline void wrmsrns(u32 msr, u64 val) >> { >> - __wrmsrns(msr, val, val >> 32); >> + /* >> + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant >> + * DS prefix to avoid a trailing NOP. >> + */ >> + asm volatile("1: " >> + ALTERNATIVE("ds wrmsr", > > This isn't the version I presented, and there's no discussion of the > alteration. I'm trying to implement wrmsr() as static __always_inline void wrmsr(u32 msr, u64 val) { asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS, "call asm_xen_write_msr", X86_FEATURE_XENPV) "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)), "D" (msr), "S" (val)); } As the CALL instruction is 5-byte long, and we need to pad nop for both WRMSR and WRMSRNS, what about not using segment prefix at all? > The choice of CS over DS was deliberate, and came from Intel: > > https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf > > So unless Intel want to retract that whitepaper, and all the binutils > work with it, I'd suggest keeping it as CS like we use elsewhere, and as > explicitly instructed by Intel. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 17:52 ` Xin Li @ 2024-08-16 18:40 ` Andrew Cooper 2024-08-16 19:18 ` H. Peter Anvin 2024-08-16 21:26 ` H. Peter Anvin 2024-08-17 14:23 ` Borislav Petkov 1 sibling, 2 replies; 37+ messages in thread From: Andrew Cooper @ 2024-08-16 18:40 UTC (permalink / raw) To: Xin Li, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, seanjc On 16/08/2024 6:52 pm, Xin Li wrote: > On 8/9/2024 4:07 PM, Andrew Cooper wrote: >> On 07/08/2024 6:47 am, Xin Li (Intel) wrote: >>> From: Andrew Cooper <andrew.cooper3@citrix.com> >>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ >>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) >>> + >>> +/* Non-serializing WRMSR, when available. Falls back to a >>> serializing WRMSR. */ >>> static __always_inline void wrmsrns(u32 msr, u64 val) >>> { >>> - __wrmsrns(msr, val, val >> 32); >>> + /* >>> + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a >>> redundant >>> + * DS prefix to avoid a trailing NOP. >>> + */ >>> + asm volatile("1: " >>> + ALTERNATIVE("ds wrmsr", >> >> This isn't the version I presented, and there's no discussion of the >> alteration. > > I'm trying to implement wrmsr() as > > static __always_inline void wrmsr(u32 msr, u64 val) > { > asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, > X86_FEATURE_WRMSRNS, > "call asm_xen_write_msr", X86_FEATURE_XENPV) > "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) > : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)), > "D" (msr), "S" (val)); > } > > > As the CALL instruction is 5-byte long, and we need to pad nop for both > WRMSR and WRMSRNS, what about not using segment prefix at all? The prefix was a minor optimisation to avoid having a trailing nop at the end. When combined with a call, you need 3 prefixes on WRMSR and 2 prefixes on WRMSRNS to make all options be 5 bytes long. That said, there's already a paravirt hook for this, and if you're looking to work around the code gen mess for that, then doing it like this by doubling up into rdi and rsi isn't great either. My suggestion, not that I've had time to experiment, was to change paravirt to use a non-C ABI and have asm_xen_write_msr() recombine edx:eax into rsi. That way the top level wrmsr() retains sensible codegen for native even when paravirt is active. ~Andrew ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 18:40 ` Andrew Cooper @ 2024-08-16 19:18 ` H. Peter Anvin 2024-08-16 21:45 ` Andrew Cooper 2024-08-16 21:26 ` H. Peter Anvin 1 sibling, 1 reply; 37+ messages in thread From: H. Peter Anvin @ 2024-08-16 19:18 UTC (permalink / raw) To: Andrew Cooper, Xin Li, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc On August 16, 2024 11:40:05 AM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >On 16/08/2024 6:52 pm, Xin Li wrote: >> On 8/9/2024 4:07 PM, Andrew Cooper wrote: >>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote: >>>> From: Andrew Cooper <andrew.cooper3@citrix.com> >>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ >>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) >>>> + >>>> +/* Non-serializing WRMSR, when available. Falls back to a >>>> serializing WRMSR. */ >>>> static __always_inline void wrmsrns(u32 msr, u64 val) >>>> { >>>> - __wrmsrns(msr, val, val >> 32); >>>> + /* >>>> + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a >>>> redundant >>>> + * DS prefix to avoid a trailing NOP. >>>> + */ >>>> + asm volatile("1: " >>>> + ALTERNATIVE("ds wrmsr", >>> >>> This isn't the version I presented, and there's no discussion of the >>> alteration. >> >> I'm trying to implement wrmsr() as >> >> static __always_inline void wrmsr(u32 msr, u64 val) >> { >> asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, >> X86_FEATURE_WRMSRNS, >> "call asm_xen_write_msr", X86_FEATURE_XENPV) >> "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) >> : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)), >> "D" (msr), "S" (val)); >> } >> >> >> As the CALL instruction is 5-byte long, and we need to pad nop for both >> WRMSR and WRMSRNS, what about not using segment prefix at all? > >The prefix was a minor optimisation to avoid having a trailing nop at >the end. > >When combined with a call, you need 3 prefixes on WRMSR and 2 prefixes >on WRMSRNS to make all options be 5 bytes long. > >That said, there's already a paravirt hook for this, and if you're >looking to work around the code gen mess for that, then doing it like >this by doubling up into rdi and rsi isn't great either. > >My suggestion, not that I've had time to experiment, was to change >paravirt to use a non-C ABI and have asm_xen_write_msr() recombine >edx:eax into rsi. That way the top level wrmsr() retains sensible >codegen for native even when paravirt is active. > >~Andrew Heh, that was my suggestion, except that I suggested defining it so rax pass the full value; the generation of edx still is necessary but there is no real reason to have to recombine them. (One could even add that code to the assembly pattern as the CALL instruction is longer.) My biggest question is how the #GP on an invalid MSR access is handled with Xen. The rest is trivial. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 19:18 ` H. Peter Anvin @ 2024-08-16 21:45 ` Andrew Cooper 0 siblings, 0 replies; 37+ messages in thread From: Andrew Cooper @ 2024-08-16 21:45 UTC (permalink / raw) To: H. Peter Anvin, Xin Li, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc On 16/08/2024 8:18 pm, H. Peter Anvin wrote: > On August 16, 2024 11:40:05 AM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 16/08/2024 6:52 pm, Xin Li wrote: >>> On 8/9/2024 4:07 PM, Andrew Cooper wrote: >>>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote: >>>>> From: Andrew Cooper <andrew.cooper3@citrix.com> >>>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ >>>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) >>>>> + >>>>> +/* Non-serializing WRMSR, when available. Falls back to a >>>>> serializing WRMSR. */ >>>>> static __always_inline void wrmsrns(u32 msr, u64 val) >>>>> { >>>>> - __wrmsrns(msr, val, val >> 32); >>>>> + /* >>>>> + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a >>>>> redundant >>>>> + * DS prefix to avoid a trailing NOP. >>>>> + */ >>>>> + asm volatile("1: " >>>>> + ALTERNATIVE("ds wrmsr", >>>> This isn't the version I presented, and there's no discussion of the >>>> alteration. >>> I'm trying to implement wrmsr() as >>> >>> static __always_inline void wrmsr(u32 msr, u64 val) >>> { >>> asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, >>> X86_FEATURE_WRMSRNS, >>> "call asm_xen_write_msr", X86_FEATURE_XENPV) >>> "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) >>> : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)), >>> "D" (msr), "S" (val)); >>> } >>> >>> >>> As the CALL instruction is 5-byte long, and we need to pad nop for both >>> WRMSR and WRMSRNS, what about not using segment prefix at all? >> The prefix was a minor optimisation to avoid having a trailing nop at >> the end. >> >> When combined with a call, you need 3 prefixes on WRMSR and 2 prefixes >> on WRMSRNS to make all options be 5 bytes long. >> >> That said, there's already a paravirt hook for this, and if you're >> looking to work around the code gen mess for that, then doing it like >> this by doubling up into rdi and rsi isn't great either. >> >> My suggestion, not that I've had time to experiment, was to change >> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine >> edx:eax into rsi. That way the top level wrmsr() retains sensible >> codegen for native even when paravirt is active. >> >> ~Andrew > Heh, that was my suggestion, except that I suggested defining it so rax pass the full value; the generation of edx still is necessary but there is no real reason to have to recombine them. (One could even add that code to the assembly pattern as the CALL instruction is longer.) Hmm yeah, having %rax be full is likely how the logic is going to look before generating %edx. > My biggest question is how the #GP on an invalid MSR access is handled with Xen. The rest is trivial. For historical reasons it's a mess. xen_do_write_msr() does several things. * Turns FSBASE/GSBASE/GSKERN into their respective hypercalls (although no error checking at all!) * Discards modifications to the SYSCALL/SYSENTER MSRs Otherwise, uses the native accessor, taking the #GP path to Xen and is emulated (including full decode). Way back in the day, XenPV's paravirt wrmsr would swallow #GP and pretend success, and then started depending on this behaviour in order to boot. But times have moved on, and even the normal accessors are really safe+warn, and I'm sure all of this could be cleaned up. For ages I've been wanting to make a single PV_PRIV_OP hypercall so we can skip the x86 emulator, but I've never had time to do that either. ~Andrew ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 18:40 ` Andrew Cooper 2024-08-16 19:18 ` H. Peter Anvin @ 2024-08-16 21:26 ` H. Peter Anvin 2024-08-16 22:27 ` Andrew Cooper 2024-08-16 22:59 ` H. Peter Anvin 1 sibling, 2 replies; 37+ messages in thread From: H. Peter Anvin @ 2024-08-16 21:26 UTC (permalink / raw) To: Andrew Cooper, Xin Li, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc [-- Attachment #1: Type: text/plain, Size: 713 bytes --] On 8/16/24 11:40, Andrew Cooper wrote: >> >> As the CALL instruction is 5-byte long, and we need to pad nop for both >> WRMSR and WRMSRNS, what about not using segment prefix at all? > You can use up to 4 prefixes of any kind (which includes opcode prefixes before 0F) before most decoders start hurting, so we can pad it out to 5 bytes by doing 3f 3f .. .. .. > > My suggestion, not that I've had time to experiment, was to change > paravirt to use a non-C ABI and have asm_xen_write_msr() recombine > edx:eax into rsi. That way the top level wrmsr() retains sensible > codegen for native even when paravirt is active. > I have attached what should be an "obvious" example... famous last words. -hpa [-- Attachment #2: xen.S --] [-- Type: text/plain, Size: 1692 bytes --] /* * Input in %rax, MSR number in %ecx * %rdx is clobbered, as the native stub is assumed to do * * Change xen_do_write_msr to return its error code instead * of passing a pointer - this gives the extra benefit that * we can get the *actual invocation address* in the error * messages. * * ex_handler_msr() should be changed to get the MSR data * from %rax regardless of Xen vs native; alternatively the * Xen handler can set up %edx as well. * * Let the native pattern look like: * * 48 89 c2 mov %rax,%rdx * 48 c1 ea 20 shr $32,%rdx * 3e 0f 30 ds wrmsr <--- trap point * * ... which can be replaced with ... * 48 89 c2 mov %rax,%rdx * 48 c1 ea 20 shr $32,%rdx * 0f 01 c6 wrmsrns <--- trap point * * ... or ... * e8 xx xx xx xx call asm_xen_write_msr * 74 02 jz 1f * 3e 0f 0b ds ud2 <--- trap point * 1: * ds ud2 can of course be replaced with any other 3-byte trapping * instruction. * * This also removes the need for Xen to maintain different safe and * unsafe MSR routines, as the difference is handled by the same * trap handler as is used natively. */ SYM_FUNC_START(asm_xen_write_msr) FRAME_BEGIN push %rax /* Save in case of error */ push %rcx push %rsi push %rdi push %r8 push %r9 push %r10 push %r11 mov %rax,%rdi mov %ecx,%esi call xen_do_write_msr test %eax,%eax /* ZF=0 on error */ pop %r11 pop %r10 pop %r9 pop %r8 pop %rdi pop %rsi pop %rcx #ifdef OPTIONAL mov 4(%rsp),%edx #endif pop %rax FRAME_END RET SYM_FUNC_END(asm_xen_write_msr) ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 21:26 ` H. Peter Anvin @ 2024-08-16 22:27 ` Andrew Cooper 2024-08-16 22:34 ` Andrew Cooper 2024-08-16 22:59 ` H. Peter Anvin 1 sibling, 1 reply; 37+ messages in thread From: Andrew Cooper @ 2024-08-16 22:27 UTC (permalink / raw) To: H. Peter Anvin, Xin Li, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc, Juergen Gross On 16/08/2024 10:26 pm, H. Peter Anvin wrote: > On 8/16/24 11:40, Andrew Cooper wrote: >>> >>> As the CALL instruction is 5-byte long, and we need to pad nop for both >>> WRMSR and WRMSRNS, what about not using segment prefix at all? >> > > You can use up to 4 prefixes of any kind (which includes opcode > prefixes before 0F) before most decoders start hurting, so we can pad > it out to 5 bytes by doing 3f 3f .. .. .. > >> >> My suggestion, not that I've had time to experiment, was to change >> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine >> edx:eax into rsi. That way the top level wrmsr() retains sensible >> codegen for native even when paravirt is active. >> > > I have attached what should be an "obvious" example... famous last words. Ah, now I see what you mean about Xen's #GP semantics. That's a neat way of doing it. It means the faulting path will really take 2 faults on Xen, but it's a faulting path anyway so speed is already out of the window. Do you mind about teaching the #UD handler to deal with WRMSR like that? I ask, because I can't think of anything nicer. There are plenty of 3-byte instructions which #GP in PV guests (CPL3), and LTR is my go-to for debugging purposes, as it's not emulated by Xen. Anything here (and it can't be an actual WRMSR) will be slightly confusing to read in an OOPS, especially #UD for what is logically a #GP. But, a clear UD of some form in the disassembly is probably better than a random other instruction unrelated to the operation. ~Andrew ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 22:27 ` Andrew Cooper @ 2024-08-16 22:34 ` Andrew Cooper 2024-08-16 23:03 ` H. Peter Anvin 0 siblings, 1 reply; 37+ messages in thread From: Andrew Cooper @ 2024-08-16 22:34 UTC (permalink / raw) To: H. Peter Anvin, Xin Li, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc, Juergen Gross On 16/08/2024 11:27 pm, Andrew Cooper wrote: > On 16/08/2024 10:26 pm, H. Peter Anvin wrote: >> On 8/16/24 11:40, Andrew Cooper wrote: >>>> As the CALL instruction is 5-byte long, and we need to pad nop for both >>>> WRMSR and WRMSRNS, what about not using segment prefix at all? >> You can use up to 4 prefixes of any kind (which includes opcode >> prefixes before 0F) before most decoders start hurting, so we can pad >> it out to 5 bytes by doing 3f 3f .. .. .. >> >>> My suggestion, not that I've had time to experiment, was to change >>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine >>> edx:eax into rsi. That way the top level wrmsr() retains sensible >>> codegen for native even when paravirt is active. >>> >> I have attached what should be an "obvious" example... famous last words. > Ah, now I see what you mean about Xen's #GP semantics. > > That's a neat way of doing it. It means the faulting path will really > take 2 faults on Xen, but it's a faulting path anyway so speed is > already out of the window. > > Do you mind about teaching the #UD handler to deal with WRMSR like that? > > I ask, because I can't think of anything nicer. > > There are plenty of 3-byte instructions which #GP in PV guests (CPL3), > and LTR is my go-to for debugging purposes, as it's not emulated by Xen. > > Anything here (and it can't be an actual WRMSR) will be slightly > confusing to read in an OOPS, especially #UD for what is logically a #GP. > > But, a clear UD of some form in the disassembly is probably better than > a random other instruction unrelated to the operation. > > ~Andrew Oh, P.S. We can probably drop most of the register manipulation by making the new xen_do_write_msr be no_caller_saved_registers. As we're intentionally not a C ABI to start with, we might as well not spill registers we don't use either. ~Andrew ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 22:34 ` Andrew Cooper @ 2024-08-16 23:03 ` H. Peter Anvin 0 siblings, 0 replies; 37+ messages in thread From: H. Peter Anvin @ 2024-08-16 23:03 UTC (permalink / raw) To: Andrew Cooper, Xin Li, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc, Juergen Gross On August 16, 2024 3:34:53 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >On 16/08/2024 11:27 pm, Andrew Cooper wrote: >> On 16/08/2024 10:26 pm, H. Peter Anvin wrote: >>> On 8/16/24 11:40, Andrew Cooper wrote: >>>>> As the CALL instruction is 5-byte long, and we need to pad nop for both >>>>> WRMSR and WRMSRNS, what about not using segment prefix at all? >>> You can use up to 4 prefixes of any kind (which includes opcode >>> prefixes before 0F) before most decoders start hurting, so we can pad >>> it out to 5 bytes by doing 3f 3f .. .. .. >>> >>>> My suggestion, not that I've had time to experiment, was to change >>>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine >>>> edx:eax into rsi. That way the top level wrmsr() retains sensible >>>> codegen for native even when paravirt is active. >>>> >>> I have attached what should be an "obvious" example... famous last words. >> Ah, now I see what you mean about Xen's #GP semantics. >> >> That's a neat way of doing it. It means the faulting path will really >> take 2 faults on Xen, but it's a faulting path anyway so speed is >> already out of the window. >> >> Do you mind about teaching the #UD handler to deal with WRMSR like that? >> >> I ask, because I can't think of anything nicer. >> >> There are plenty of 3-byte instructions which #GP in PV guests (CPL3), >> and LTR is my go-to for debugging purposes, as it's not emulated by Xen. >> >> Anything here (and it can't be an actual WRMSR) will be slightly >> confusing to read in an OOPS, especially #UD for what is logically a #GP. >> >> But, a clear UD of some form in the disassembly is probably better than >> a random other instruction unrelated to the operation. >> >> ~Andrew > >Oh, P.S. > >We can probably drop most of the register manipulation by making the new >xen_do_write_msr be no_caller_saved_registers. As we're intentionally >not a C ABI to start with, we might as well not spill registers we don't >use either. > >~Andrew If you are willing to require a new enough gcc, certainly. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 21:26 ` H. Peter Anvin 2024-08-16 22:27 ` Andrew Cooper @ 2024-08-16 22:59 ` H. Peter Anvin 2024-08-17 23:51 ` H. Peter Anvin 1 sibling, 1 reply; 37+ messages in thread From: H. Peter Anvin @ 2024-08-16 22:59 UTC (permalink / raw) To: Andrew Cooper, Xin Li, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc [-- Attachment #1: Type: text/plain, Size: 3193 bytes --] On 8/16/24 14:26, H. Peter Anvin wrote: > On 8/16/24 11:40, Andrew Cooper wrote: >>> >>> As the CALL instruction is 5-byte long, and we need to pad nop for both >>> WRMSR and WRMSRNS, what about not using segment prefix at all? >> > > You can use up to 4 prefixes of any kind (which includes opcode prefixes > before 0F) before most decoders start hurting, so we can pad it out to 5 > bytes by doing 3f 3f .. .. .. > >> >> My suggestion, not that I've had time to experiment, was to change >> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine >> edx:eax into rsi. That way the top level wrmsr() retains sensible >> codegen for native even when paravirt is active. >> > > I have attached what should be an "obvious" example... famous last words. > After looking at what xen_do_write_msr looks like, I realized we can do *much* better than that. This means using two alternative sequences, one for native/Xen and the other for native wrmsr/wrmsrns. The call/jz sequence is *exactly* the same length as mov, shr, which means that xen_do_write_msr can simply return a flag if it wants the MSR access to be performed natively. An important point is that in no code path can the error code be set to -EIO except by native MSR invocation, so the only possibilities are "done successfully" or "execute natively." [That being said, that would not be *that* hard to fix if needed.] The new C prototype for xen_do_write_msr() becomes: bool xen_do_write_msr(uint32_t msr, uint64_t val) ... with a true return meaning "execute natively." (I changed the order of the arguments from my last version since it is more consistent with what is used everywhere else.) RDMSR is a bit trickier. I think the best option there is to set up a new permission trap handler type that amounts to "get the address from the stack, apply a specific offset, and invoke the fixup handler for that address: case EX_TYPE_UPLEVEL: { /* Let reg hold the unsigned number of machine * words to pop off the stack before the return * address, and imm the signed offset from the * return address to the desired trap point. * * pointer in units of machine words, and imm the * signed offset from this stack word... */ unsigned long *sp = (unsigned long *)regs->sp + reg; regs->ip = *sp++ + (int16_t)imm; regs->sp = (unsigned long)sp; goto again; /* Loop back to the beginning */ } Again, "obviously correct" code attached. -hpa NOTE: I had to dig several levels down to uncover actual situation, but pmu_msr_write() is actually a completely pointless function: all it does is shuffle some arguments, then calls pmu_msr_chk_emulated() and if it returns true AND the emulated flag is clear then does *exactly the same thing* that the calling code would have done if pmu_msr_write() itself had returned true... in other words, the whole function could be replaced by: bool pmu_msr_write(uint32_t msr, uint64_t val) { bool emulated; return pmu_msr_chk_emulated(msr, &val, false, &emulated) && emulated; } pmu_msr_read() does the equivalent stupidity; the obvious fix(es) to pmu_msr_chk_emulated() I hope are obvious. [-- Attachment #2: xen.S --] [-- Type: text/plain, Size: 3135 bytes --] /* * Input in %rax, MSR number in %ecx * * %edx is set up to match %rax >> 32 like the native stub * is expected to do * * Change xen_do_write_msr to return a true value if * the MSR access should be executed natively (or vice versa, * if you prefer.) * * bool xen_do_write_msr(uint32_t msr, uint64_t value) * * Let the native pattern look like: * * 48 89 c2 mov %rax,%rdx * 48 c1 ea 20 shr $32,%rdx * 3e 0f 30 ds wrmsr <--- trap point * * ... which can be replaced with ... * 48 89 c2 mov %rax,%rdx * 48 c1 ea 20 shr $32,%rdx * 0f 01 c6 wrmsrns <--- trap point * * FOR XEN, replace the FIRST SEVEN BYTES with: * e8 xx xx xx xx call asm_xen_write_msr * 74 03 jz .+5 * * If ZF=0 then this will fall down to the actual native WRMSR[NS] * instruction. * * This also removes the need for Xen to maintain different safe and * unsafe MSR routines, as the difference is handled by the same * trap handler as is used natively. */ SYM_FUNC_START(asm_xen_write_msr) FRAME_BEGIN push %rax /* Save in case of native fallback */ push %rcx push %rsi push %rdi push %r8 push %r9 push %r10 push %r11 mov %ecx,%edi /* MSR number */ mov %rax,%rsi /* MSR data */ call xen_do_write_msr test %al,%al /* ZF=1 means we are done */ pop %r11 pop %r10 pop %r9 pop %r8 pop %rdi pop %rsi pop %rcx mov 4(%rsp),%edx /* Set up %edx for native execution */ pop %rax FRAME_END RET SYM_FUNC_END(asm_xen_write_msr) /* * RDMSR code. It isn't quite as clean; it requires a new trap handler * type: * * case EX_TYPE_UPLEVEL: { * /* Let reg hold the unsigned number of machine * * words to pop off the stack before the return * * address, and imm the signed offset from the * * return address to the desired trap point. * * * * pointer in units of machine words, and imm the * * signed offset from this stack word... * * / * unsigned long *sp = (unsigned long *)regs->sp + reg; * regs->ip = *sp++ + (int16_t)imm; * regs->sp = (unsigned long)sp; * goto again; /* Loop back to the beginning * / * } * * The prototype of the Xen C code: * struct { uint64_t val, bool native } xen_do_read_msr(uint32_t msr) * * Native code: * 0f 32 rdmsr <--- trap point * 48 c1 e2 20 shl $0x20,%rdx * 48 09 d0 or %rdx,%rax * * Xen code (cs rex is just for padding) * 2e 40 e8 xx xx xx xx cs rex call asm_xen_read_msr */ SYM_FUNC_START(asm_xen_read_msr) FRAME_BEGIN push %rcx push %rsi push %rdi push %r8 push %r9 push %r10 push %r11 mov %ecx,%edi /* MSR number */ call xen_do_read_msr test %dl,%dl /* ZF=1 means we are done */ pop %r11 pop %r10 pop %r9 pop %r8 pop %rdi pop %rsi pop %rcx jz 2f 1: rdmsr _ASM_EXTABLE_TYPE(1b, 2f, \ EX_TYPE_UPLEVEL|EX_DATA_IMM(-7)|EX_DATA_REG(0)) shl $32,%rdx or %rdx,%rax /* * The top of the stack points directly at the return address; * back up by 7 bytes from the return address. */ 2: FRAME_END RET SYM_FUNC_END(asm_xen_read_msr) ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 22:59 ` H. Peter Anvin @ 2024-08-17 23:51 ` H. Peter Anvin 0 siblings, 0 replies; 37+ messages in thread From: H. Peter Anvin @ 2024-08-17 23:51 UTC (permalink / raw) To: Andrew Cooper, Xin Li, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, seanjc [-- Attachment #1: Type: text/plain, Size: 1188 bytes --] On 8/16/24 15:59, H. Peter Anvin wrote: > > RDMSR is a bit trickier. I think the best option there is to set up a > new trap fixup handler type that amounts to "get the address from > the stack, apply a specific offset, and invoke the fixup handler for > that address: > > > case EX_TYPE_UPLEVEL: { > /* Let reg hold the unsigned number of machine > * words to pop off the stack before the return > * address, and imm the signed offset from the > * return address to the desired trap point. > * > * pointer in units of machine words, and imm the > * signed offset from this stack word... > */ > unsigned long *sp = (unsigned long *)regs->sp + reg; > regs->ip = *sp++ + (int16_t)imm; > regs->sp = (unsigned long)sp; > goto again; /* Loop back to the beginning */ > } > > Again, "obviously correct" code attached. > Here is an untested patch implemented the above functionality, tidied up and wrapped in a macro as _ASM_EXTABLE_FUNC_REWIND(). -hpa [-- Attachment #2: 0001-x86-extable-implement-EX_TYPE_FUNC_REWIND.patch --] [-- Type: text/x-patch, Size: 6901 bytes --] From 35f99255f2bc0174aba90c44b3e70415f0658257 Mon Sep 17 00:00:00 2001 From: "H. Peter Anvin" <hpa@zytor.com> Date: Sat, 17 Aug 2024 16:45:24 -0700 Subject: [PATCH] x86/extable: implement EX_TYPE_FUNC_REWIND Add a new exception type, which allows emulating an exception as if it had happened at or near the call site of a function. This allows a function call inside an alternative for instruction emulation to "kick back" the exception into the alternatives pattern, possibly invoking a different exception handling pattern there, or at least indicating the "real" location of the fault. Untested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> --- arch/x86/include/asm/asm.h | 6 + arch/x86/include/asm/extable_fixup_types.h | 1 + arch/x86/mm/extable.c | 136 +++++++++++++-------- 3 files changed, 92 insertions(+), 51 deletions(-) diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index 2bec0c89a95c..7398261b0f4a 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -232,5 +232,11 @@ register unsigned long current_stack_pointer asm(_ASM_SP); #define _ASM_EXTABLE_FAULT(from, to) \ _ASM_EXTABLE_TYPE(from, to, EX_TYPE_FAULT) +#define _ASM_EXTABLE_FUNC_REWIND(from, ipdelta, spdelta) \ + _ASM_EXTABLE_TYPE(from, from /* unused */, \ + EX_TYPE_FUNC_REWIND | \ + EX_DATA_REG(spdelta) | \ + EX_DATA_IMM(ipdelta)) + #endif /* __KERNEL__ */ #endif /* _ASM_X86_ASM_H */ diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h index 906b0d5541e8..2eed7f39893f 100644 --- a/arch/x86/include/asm/extable_fixup_types.h +++ b/arch/x86/include/asm/extable_fixup_types.h @@ -67,5 +67,6 @@ #define EX_TYPE_ZEROPAD 20 /* longword load with zeropad on fault */ #define EX_TYPE_ERETU 21 +#define EX_TYPE_FUNC_REWIND 22 #endif diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c index 51986e8a9d35..076f1492b935 100644 --- a/arch/x86/mm/extable.c +++ b/arch/x86/mm/extable.c @@ -290,6 +290,28 @@ static bool ex_handler_eretu(const struct exception_table_entry *fixup, } #endif +/* + * Emulate a fault taken at the call site of a function. + * The combined reg and flags field are used as an unsigned + * number of machine words to pop off the stack before the + * return address, then the signed imm field is used as a delta + * from the return IP address. + */ +static bool ex_handler_func_rewind(struct pt_regs *regs, int data) +{ + const long ipdelta = FIELD_GET(EX_DATA_IMM_MASK, data); + const unsigned long pops = + FIELD_GET(EX_DATA_REG_MASK|EX_DATA_FLAG_MASK, data); + unsigned long *sp; + + sp = (unsigned long *)regs->sp; + sp += pops; + regs->ip = *sp++ + ipdelta; + regs->sp = (unsigned long)sp; + + return true; +} + int ex_get_fixup_type(unsigned long ip) { const struct exception_table_entry *e = search_exception_tables(ip); @@ -302,6 +324,7 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code, { const struct exception_table_entry *e; int type, reg, imm; + bool again; #ifdef CONFIG_PNPBIOS if (unlikely(SEGMENT_IS_PNP_CODE(regs->cs))) { @@ -317,60 +340,71 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code, } #endif - e = search_exception_tables(regs->ip); - if (!e) - return 0; - - type = FIELD_GET(EX_DATA_TYPE_MASK, e->data); - reg = FIELD_GET(EX_DATA_REG_MASK, e->data); - imm = FIELD_GET(EX_DATA_IMM_MASK, e->data); - - switch (type) { - case EX_TYPE_DEFAULT: - case EX_TYPE_DEFAULT_MCE_SAFE: - return ex_handler_default(e, regs); - case EX_TYPE_FAULT: - case EX_TYPE_FAULT_MCE_SAFE: - return ex_handler_fault(e, regs, trapnr); - case EX_TYPE_UACCESS: - return ex_handler_uaccess(e, regs, trapnr, fault_addr); - case EX_TYPE_CLEAR_FS: - return ex_handler_clear_fs(e, regs); - case EX_TYPE_FPU_RESTORE: - return ex_handler_fprestore(e, regs); - case EX_TYPE_BPF: - return ex_handler_bpf(e, regs); - case EX_TYPE_WRMSR: - return ex_handler_msr(e, regs, true, false, reg); - case EX_TYPE_RDMSR: - return ex_handler_msr(e, regs, false, false, reg); - case EX_TYPE_WRMSR_SAFE: - return ex_handler_msr(e, regs, true, true, reg); - case EX_TYPE_RDMSR_SAFE: - return ex_handler_msr(e, regs, false, true, reg); - case EX_TYPE_WRMSR_IN_MCE: - ex_handler_msr_mce(regs, true); - break; - case EX_TYPE_RDMSR_IN_MCE: - ex_handler_msr_mce(regs, false); - break; - case EX_TYPE_POP_REG: - regs->sp += sizeof(long); - fallthrough; - case EX_TYPE_IMM_REG: - return ex_handler_imm_reg(e, regs, reg, imm); - case EX_TYPE_FAULT_SGX: - return ex_handler_sgx(e, regs, trapnr); - case EX_TYPE_UCOPY_LEN: - return ex_handler_ucopy_len(e, regs, trapnr, fault_addr, reg, imm); - case EX_TYPE_ZEROPAD: - return ex_handler_zeropad(e, regs, fault_addr); + do { + e = search_exception_tables(regs->ip); + if (!e) + return 0; + + again = false; + + type = FIELD_GET(EX_DATA_TYPE_MASK, e->data); + reg = FIELD_GET(EX_DATA_REG_MASK, e->data); + imm = FIELD_GET(EX_DATA_IMM_MASK, e->data); + + switch (type) { + case EX_TYPE_DEFAULT: + case EX_TYPE_DEFAULT_MCE_SAFE: + return ex_handler_default(e, regs); + case EX_TYPE_FAULT: + case EX_TYPE_FAULT_MCE_SAFE: + return ex_handler_fault(e, regs, trapnr); + case EX_TYPE_UACCESS: + return ex_handler_uaccess(e, regs, trapnr, fault_addr); + case EX_TYPE_CLEAR_FS: + return ex_handler_clear_fs(e, regs); + case EX_TYPE_FPU_RESTORE: + return ex_handler_fprestore(e, regs); + case EX_TYPE_BPF: + return ex_handler_bpf(e, regs); + case EX_TYPE_WRMSR: + return ex_handler_msr(e, regs, true, false, reg); + case EX_TYPE_RDMSR: + return ex_handler_msr(e, regs, false, false, reg); + case EX_TYPE_WRMSR_SAFE: + return ex_handler_msr(e, regs, true, true, reg); + case EX_TYPE_RDMSR_SAFE: + return ex_handler_msr(e, regs, false, true, reg); + case EX_TYPE_WRMSR_IN_MCE: + ex_handler_msr_mce(regs, true); + break; + case EX_TYPE_RDMSR_IN_MCE: + ex_handler_msr_mce(regs, false); + break; + case EX_TYPE_POP_REG: + regs->sp += sizeof(long); + fallthrough; + case EX_TYPE_IMM_REG: + return ex_handler_imm_reg(e, regs, reg, imm); + case EX_TYPE_FAULT_SGX: + return ex_handler_sgx(e, regs, trapnr); + case EX_TYPE_UCOPY_LEN: + return ex_handler_ucopy_len(e, regs, trapnr, fault_addr, reg, imm); + case EX_TYPE_ZEROPAD: + return ex_handler_zeropad(e, regs, fault_addr); #ifdef CONFIG_X86_FRED - case EX_TYPE_ERETU: - return ex_handler_eretu(e, regs, error_code); + case EX_TYPE_ERETU: + return ex_handler_eretu(e, regs, error_code); #endif - } + case EX_TYPE_FUNC_REWIND: + again = ex_handler_func_rewind(regs, e->data); + break; + default: + break; /* Will BUG() */ + } + } while (again); + BUG(); + return 0; } extern unsigned int early_recursion_flag; -- 2.46.0 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-16 17:52 ` Xin Li 2024-08-16 18:40 ` Andrew Cooper @ 2024-08-17 14:23 ` Borislav Petkov 2024-08-17 14:43 ` Borislav Petkov 2024-08-17 19:22 ` H. Peter Anvin 1 sibling, 2 replies; 37+ messages in thread From: Borislav Petkov @ 2024-08-17 14:23 UTC (permalink / raw) To: Xin Li, Andrew Cooper, linux-kernel Cc: tglx, mingo, dave.hansen, x86, hpa, peterz, seanjc On August 16, 2024 8:52:51 PM GMT+03:00, Xin Li <xin@zytor.com> wrote: >On 8/9/2024 4:07 PM, Andrew Cooper wrote: >> On 07/08/2024 6:47 am, Xin Li (Intel) wrote: >>> From: Andrew Cooper <andrew.cooper3@citrix.com> >>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ >>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) >>> + >>> +/* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ >>> static __always_inline void wrmsrns(u32 msr, u64 val) >>> { >>> - __wrmsrns(msr, val, val >> 32); >>> + /* >>> + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant >>> + * DS prefix to avoid a trailing NOP. >>> + */ >>> + asm volatile("1: " >>> + ALTERNATIVE("ds wrmsr", >> >> This isn't the version I presented, and there's no discussion of the >> alteration. > >I'm trying to implement wrmsr() as > >static __always_inline void wrmsr(u32 msr, u64 val) >{ > asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS, > "call asm_xen_write_msr", X86_FEATURE_XENPV) > "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) > : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)), > "D" (msr), "S" (val)); >} > > >As the CALL instruction is 5-byte long, and we need to pad nop for both >WRMSR and WRMSRNS, what about not using segment prefix at all? The alternatives macros can handle arbitrary insn sizes - no need for any padding. -- Sent from a small device: formatting sucks and brevity is inevitable. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-17 14:23 ` Borislav Petkov @ 2024-08-17 14:43 ` Borislav Petkov 2024-08-17 15:44 ` Xin Li 2024-08-17 19:22 ` H. Peter Anvin 2024-08-17 19:22 ` H. Peter Anvin 1 sibling, 2 replies; 37+ messages in thread From: Borislav Petkov @ 2024-08-17 14:43 UTC (permalink / raw) To: Xin Li, Andrew Cooper, linux-kernel Cc: tglx, mingo, dave.hansen, x86, hpa, peterz, seanjc On Sat, Aug 17, 2024 at 05:23:07PM +0300, Borislav Petkov wrote: > >static __always_inline void wrmsr(u32 msr, u64 val) > >{ > > asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS, > > "call asm_xen_write_msr", X86_FEATURE_XENPV) > > "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) > > : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)), > > "D" (msr), "S" (val)); > >} > > > > > >As the CALL instruction is 5-byte long, and we need to pad nop for both > >WRMSR and WRMSRNS, what about not using segment prefix at all? > > The alternatives macros can handle arbitrary insn sizes - no need for any padding. What you cannot do is have a CALL insn in only one of the alternative insn groups because the compiler is free to clobber callee regs and those might be live where you place the alternative and thus will have a nice lovely corruption. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-17 14:43 ` Borislav Petkov @ 2024-08-17 15:44 ` Xin Li 2024-08-17 19:22 ` H. Peter Anvin 1 sibling, 0 replies; 37+ messages in thread From: Xin Li @ 2024-08-17 15:44 UTC (permalink / raw) To: Borislav Petkov, Andrew Cooper, linux-kernel Cc: tglx, mingo, dave.hansen, x86, hpa, peterz, seanjc On 8/17/2024 7:43 AM, Borislav Petkov wrote: > On Sat, Aug 17, 2024 at 05:23:07PM +0300, Borislav Petkov wrote: >>> static __always_inline void wrmsr(u32 msr, u64 val) >>> { >>> asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS, >>> "call asm_xen_write_msr", X86_FEATURE_XENPV) >>> "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) >>> : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)), >>> "D" (msr), "S" (val)); >>> } >>> >>> >>> As the CALL instruction is 5-byte long, and we need to pad nop for both >>> WRMSR and WRMSRNS, what about not using segment prefix at all? >> >> The alternatives macros can handle arbitrary insn sizes - no need for any padding. > > What you cannot do is have a CALL insn in only one of the alternative > insn groups because the compiler is free to clobber callee regs and > those might be live where you place the alternative and thus will have > a nice lovely corruption. Yeah, asm_xen_write_msr() is a wrapper function to xen_write_msr() which saves/restores those regs. My first draft patch calls xen_write_msr() directly and works fine. But hpa said the same thing as you said. Thanks! Xin ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-17 14:43 ` Borislav Petkov 2024-08-17 15:44 ` Xin Li @ 2024-08-17 19:22 ` H. Peter Anvin 2024-08-18 5:59 ` Borislav Petkov 1 sibling, 1 reply; 37+ messages in thread From: H. Peter Anvin @ 2024-08-17 19:22 UTC (permalink / raw) To: Borislav Petkov, Xin Li, Andrew Cooper, linux-kernel Cc: tglx, mingo, dave.hansen, x86, peterz, seanjc On August 17, 2024 7:43:16 AM PDT, Borislav Petkov <bp@alien8.de> wrote: >On Sat, Aug 17, 2024 at 05:23:07PM +0300, Borislav Petkov wrote: >> >static __always_inline void wrmsr(u32 msr, u64 val) >> >{ >> > asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS, >> > "call asm_xen_write_msr", X86_FEATURE_XENPV) >> > "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) >> > : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)), >> > "D" (msr), "S" (val)); >> >} >> > >> > >> >As the CALL instruction is 5-byte long, and we need to pad nop for both >> >WRMSR and WRMSRNS, what about not using segment prefix at all? >> >> The alternatives macros can handle arbitrary insn sizes - no need for any padding. > >What you cannot do is have a CALL insn in only one of the alternative >insn groups because the compiler is free to clobber callee regs and >those might be live where you place the alternative and thus will have >a nice lovely corruption. > Only if the call goes to a C function. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-17 19:22 ` H. Peter Anvin @ 2024-08-18 5:59 ` Borislav Petkov 0 siblings, 0 replies; 37+ messages in thread From: Borislav Petkov @ 2024-08-18 5:59 UTC (permalink / raw) To: H. Peter Anvin Cc: Xin Li, Andrew Cooper, linux-kernel, tglx, mingo, dave.hansen, x86, peterz, seanjc On Sat, Aug 17, 2024 at 12:22:56PM -0700, H. Peter Anvin wrote: > Only if the call goes to a C function. Aha, you have a patch in the thread which calls a by-hand asm thing. I haven't looked in detail but we have a THUNK macro for such stuff. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-17 14:23 ` Borislav Petkov 2024-08-17 14:43 ` Borislav Petkov @ 2024-08-17 19:22 ` H. Peter Anvin 2024-08-18 5:49 ` Borislav Petkov 1 sibling, 1 reply; 37+ messages in thread From: H. Peter Anvin @ 2024-08-17 19:22 UTC (permalink / raw) To: Borislav Petkov, Xin Li, Andrew Cooper, linux-kernel Cc: tglx, mingo, dave.hansen, x86, peterz, seanjc On August 17, 2024 7:23:07 AM PDT, Borislav Petkov <bp@alien8.de> wrote: >On August 16, 2024 8:52:51 PM GMT+03:00, Xin Li <xin@zytor.com> wrote: >>On 8/9/2024 4:07 PM, Andrew Cooper wrote: >>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote: >>>> From: Andrew Cooper <andrew.cooper3@citrix.com> >>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ >>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) >>>> + >>>> +/* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ >>>> static __always_inline void wrmsrns(u32 msr, u64 val) >>>> { >>>> - __wrmsrns(msr, val, val >> 32); >>>> + /* >>>> + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant >>>> + * DS prefix to avoid a trailing NOP. >>>> + */ >>>> + asm volatile("1: " >>>> + ALTERNATIVE("ds wrmsr", >>> >>> This isn't the version I presented, and there's no discussion of the >>> alteration. >> >>I'm trying to implement wrmsr() as >> >>static __always_inline void wrmsr(u32 msr, u64 val) >>{ >> asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS, >> "call asm_xen_write_msr", X86_FEATURE_XENPV) >> "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) >> : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)), >> "D" (msr), "S" (val)); >>} >> >> >>As the CALL instruction is 5-byte long, and we need to pad nop for both >>WRMSR and WRMSRNS, what about not using segment prefix at all? > >The alternatives macros can handle arbitrary insn sizes - no need for any padding. > > The padding avoids unnecessary NOPs. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-17 19:22 ` H. Peter Anvin @ 2024-08-18 5:49 ` Borislav Petkov 2024-08-18 6:16 ` H. Peter Anvin 0 siblings, 1 reply; 37+ messages in thread From: Borislav Petkov @ 2024-08-18 5:49 UTC (permalink / raw) To: H. Peter Anvin Cc: Xin Li, Andrew Cooper, linux-kernel, tglx, mingo, dave.hansen, x86, peterz, seanjc On Sat, Aug 17, 2024 at 12:22:22PM -0700, H. Peter Anvin wrote: > The padding avoids unnecessary NOPs. And a NOP vs some other prefix to have one less insn matters practically how exactly here? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism 2024-08-18 5:49 ` Borislav Petkov @ 2024-08-18 6:16 ` H. Peter Anvin 0 siblings, 0 replies; 37+ messages in thread From: H. Peter Anvin @ 2024-08-18 6:16 UTC (permalink / raw) To: Borislav Petkov Cc: Xin Li, Andrew Cooper, linux-kernel, tglx, mingo, dave.hansen, x86, peterz, seanjc On 8/17/24 22:49, Borislav Petkov wrote: > On Sat, Aug 17, 2024 at 12:22:22PM -0700, H. Peter Anvin wrote: >> The padding avoids unnecessary NOPs. > > And a NOP vs some other prefix to have one less insn matters practically > how exactly here? Pretty much nothing, but it is zero cost and possibly some insanely small gain. -hpa ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch 2024-08-07 5:47 [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace Xin Li (Intel) 2024-08-07 5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel) 2024-08-07 5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel) @ 2024-08-07 5:47 ` Xin Li (Intel) 2024-08-07 18:39 ` Thomas Gleixner 2024-08-09 10:45 ` Nikolay Borisov 2 siblings, 2 replies; 37+ messages in thread From: Xin Li (Intel) @ 2024-08-07 5:47 UTC (permalink / raw) To: linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc FRED RSP0 is a per task constant pointing to top of its kernel stack for user level event delivery, and needs to be updated when a task is scheduled in. Introduce a new TI flag TIF_LOAD_USER_STATES to track whether FRED RSP0 needs to be loaded, and do the actual load of FRED RSP0 in arch_exit_to_user_mode_prepare() if the TI flag is set, thus to avoid a fair number of WRMSRs in both KVM and the kernel. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> --- arch/x86/include/asm/entry-common.h | 5 +++++ arch/x86/include/asm/switch_to.h | 3 +-- arch/x86/include/asm/thread_info.h | 2 ++ arch/x86/kernel/cpu/cpuid-deps.c | 1 - 4 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index 4c78b99060b5..ae365579efb3 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -51,6 +51,11 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, if (ti_work & _TIF_USER_RETURN_NOTIFY) fire_user_return_notifiers(); + if (cpu_feature_enabled(X86_FEATURE_FRED) && + (ti_work & _TIF_LOAD_USER_STATES)) + wrmsrns(MSR_IA32_FRED_RSP0, + (unsigned long)task_stack_page(current) + THREAD_SIZE); + if (unlikely(ti_work & _TIF_IO_BITMAP)) tss_update_io_bitmap(); diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h index c3bd0c0758c9..a31ea544cc0e 100644 --- a/arch/x86/include/asm/switch_to.h +++ b/arch/x86/include/asm/switch_to.h @@ -71,8 +71,7 @@ static inline void update_task_stack(struct task_struct *task) this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0); #else if (cpu_feature_enabled(X86_FEATURE_FRED)) { - /* WRMSRNS is a baseline feature for FRED. */ - wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE); + set_thread_flag(TIF_LOAD_USER_STATES); } else if (cpu_feature_enabled(X86_FEATURE_XENPV)) { /* Xen PV enters the kernel on the thread stack. */ load_sp0(task_top_of_stack(task)); diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 12da7dfd5ef1..fb51904651c0 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -106,6 +106,7 @@ struct thread_info { #define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */ #define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */ #define TIF_ADDR32 29 /* 32-bit address space on 64 bits */ +#define TIF_LOAD_USER_STATES 30 /* Load user level states */ #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) @@ -128,6 +129,7 @@ struct thread_info { #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP) #define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES) #define _TIF_ADDR32 (1 << TIF_ADDR32) +#define _TIF_LOAD_USER_STATES (1 << TIF_LOAD_USER_STATES) /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW_BASE \ diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c index b7d9f530ae16..8bd84114c2d9 100644 --- a/arch/x86/kernel/cpu/cpuid-deps.c +++ b/arch/x86/kernel/cpu/cpuid-deps.c @@ -83,7 +83,6 @@ static const struct cpuid_dep cpuid_deps[] = { { X86_FEATURE_AMX_TILE, X86_FEATURE_XFD }, { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES }, { X86_FEATURE_FRED, X86_FEATURE_LKGS }, - { X86_FEATURE_FRED, X86_FEATURE_WRMSRNS }, {} }; -- 2.45.2 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch 2024-08-07 5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel) @ 2024-08-07 18:39 ` Thomas Gleixner 2024-08-07 21:55 ` Sean Christopherson 2024-08-09 10:45 ` Nikolay Borisov 1 sibling, 1 reply; 37+ messages in thread From: Thomas Gleixner @ 2024-08-07 18:39 UTC (permalink / raw) To: Xin Li (Intel), linux-kernel Cc: mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc On Tue, Aug 06 2024 at 22:47, Xin Li wrote: > --- a/arch/x86/include/asm/entry-common.h > +++ b/arch/x86/include/asm/entry-common.h > @@ -51,6 +51,11 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, > if (ti_work & _TIF_USER_RETURN_NOTIFY) > fire_user_return_notifiers(); > > + if (cpu_feature_enabled(X86_FEATURE_FRED) && > + (ti_work & _TIF_LOAD_USER_STATES)) > + wrmsrns(MSR_IA32_FRED_RSP0, > + (unsigned long)task_stack_page(current) + THREAD_SIZE); Eyes are bleeding. If you do the inline in patch 1 then this becomes if (cpu_feature_enabled(X86_FEATURE_FRED) && (ti_work & _TIF_LOAD_USER_STATES)) wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(current) + THREAD_SIZE); which is in the 100 character limit and readable. Though I wonder if this should not have a per CPU cache for that. if (cpu_feature_enabled(X86_FEATURE_FRED) { unsigned long rsp0 = (unsigned long) task_stack_page(current) + THREAD_SIZE; if (__this_cpu_read(cpu_fred_rsp0) != rsp0) { wrmsrns(MSR_IA32_FRED_RSP0, fred_rsp0); this_cpu_write(cpu_fred_rsp0, rsp0); } } Hmm? > if (unlikely(ti_work & _TIF_IO_BITMAP)) > tss_update_io_bitmap(); > > diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h > index c3bd0c0758c9..a31ea544cc0e 100644 > --- a/arch/x86/include/asm/switch_to.h > +++ b/arch/x86/include/asm/switch_to.h > @@ -71,8 +71,7 @@ static inline void update_task_stack(struct task_struct *task) > this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0); > #else > if (cpu_feature_enabled(X86_FEATURE_FRED)) { > - /* WRMSRNS is a baseline feature for FRED. */ This comment is wrong when using the alternative WRMSR fallback and should be remove by the alternatives patch. > - wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE); > + set_thread_flag(TIF_LOAD_USER_STATES); Thanks, tglx ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch 2024-08-07 18:39 ` Thomas Gleixner @ 2024-08-07 21:55 ` Sean Christopherson 2024-08-07 23:04 ` Thomas Gleixner 0 siblings, 1 reply; 37+ messages in thread From: Sean Christopherson @ 2024-08-07 21:55 UTC (permalink / raw) To: Thomas Gleixner Cc: Xin Li (Intel), linux-kernel, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3 On Wed, Aug 07, 2024, Thomas Gleixner wrote: > Though I wonder if this should not have a per CPU cache for that. > > if (cpu_feature_enabled(X86_FEATURE_FRED) { > unsigned long rsp0 = (unsigned long) task_stack_page(current) + THREAD_SIZE; > > if (__this_cpu_read(cpu_fred_rsp0) != rsp0) { > wrmsrns(MSR_IA32_FRED_RSP0, fred_rsp0); > this_cpu_write(cpu_fred_rsp0, rsp0); > } > } > > Hmm? A per-CPU cache would work for KVM too, KVM would just need to stuff the cache with the guest's value instead of setting _TIF_LOAD_USER_STATES. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch 2024-08-07 21:55 ` Sean Christopherson @ 2024-08-07 23:04 ` Thomas Gleixner 0 siblings, 0 replies; 37+ messages in thread From: Thomas Gleixner @ 2024-08-07 23:04 UTC (permalink / raw) To: Sean Christopherson Cc: Xin Li (Intel), linux-kernel, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3 On Wed, Aug 07 2024 at 14:55, Sean Christopherson wrote: > On Wed, Aug 07, 2024, Thomas Gleixner wrote: >> Though I wonder if this should not have a per CPU cache for that. >> >> if (cpu_feature_enabled(X86_FEATURE_FRED) { >> unsigned long rsp0 = (unsigned long) task_stack_page(current) + THREAD_SIZE; >> >> if (__this_cpu_read(cpu_fred_rsp0) != rsp0) { >> wrmsrns(MSR_IA32_FRED_RSP0, fred_rsp0); >> this_cpu_write(cpu_fred_rsp0, rsp0); >> } >> } >> >> Hmm? > > A per-CPU cache would work for KVM too, KVM would just need to stuff the cache > with the guest's value instead of setting _TIF_LOAD_USER_STATES. There are two options vs. the cache: 1) Use a cache only static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, unsigned long ti_work) { if (IS_ENABLED(CONFIG_X86_DEBUG_FPU) || unlikely(ti_work)) arch_exit_work(ti_work); fred_rsp0_magic(); 2) Use both the TIF bit and the cache static inline void arch_exit_work(unsigned long ti_work) { if (....) ... fred_rsp0_magic(); } #1 has the charm that it avoids arch_exit_work() completely when ti_work is 0, but has the unconditional check on the per CPU cache variable and the extra conditional on every return #2 has the charm that it avoids touching the per CPU cache variable on return when the TIF bit is not set, but comes with the downside of going through the ti_work chain to update RSP0 I'm inclined to claim that #1 wins. Why? The probability for a RSP0 update is higher than the probability for ti_work != 0, and for syscall heavy workloads the extra per CPU read and the conditional is not really relevant when we stick this into struct pcpu_hot. That cacheline is hot anyway in that code path due to 'current' and other members there. But that's a problem for micro benchmark experts to figure out. I'm not one of them :) In any case the cache should be a win over an unconditionl MSR write independent of MRMSRNS. Thanks, tglx ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch 2024-08-07 5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel) 2024-08-07 18:39 ` Thomas Gleixner @ 2024-08-09 10:45 ` Nikolay Borisov 2024-08-09 17:37 ` Xin Li 1 sibling, 1 reply; 37+ messages in thread From: Nikolay Borisov @ 2024-08-09 10:45 UTC (permalink / raw) To: Xin Li (Intel), linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc On 7.08.24 г. 8:47 ч., Xin Li (Intel) wrote: > FRED RSP0 is a per task constant pointing to top of its kernel stack > for user level event delivery, and needs to be updated when a task is > scheduled in. > > Introduce a new TI flag TIF_LOAD_USER_STATES to track whether FRED RSP0 > needs to be loaded, and do the actual load of FRED RSP0 in > arch_exit_to_user_mode_prepare() if the TI flag is set, thus to avoid a > fair number of WRMSRs in both KVM and the kernel. > > Suggested-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Xin Li (Intel) <xin@zytor.com> > --- > arch/x86/include/asm/entry-common.h | 5 +++++ > arch/x86/include/asm/switch_to.h | 3 +-- > arch/x86/include/asm/thread_info.h | 2 ++ > arch/x86/kernel/cpu/cpuid-deps.c | 1 - > 4 files changed, 8 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h > index 4c78b99060b5..ae365579efb3 100644 > --- a/arch/x86/include/asm/entry-common.h > +++ b/arch/x86/include/asm/entry-common.h > @@ -51,6 +51,11 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, > if (ti_work & _TIF_USER_RETURN_NOTIFY) > fire_user_return_notifiers(); > > + if (cpu_feature_enabled(X86_FEATURE_FRED) && > + (ti_work & _TIF_LOAD_USER_STATES)) > + wrmsrns(MSR_IA32_FRED_RSP0, > + (unsigned long)task_stack_page(current) + THREAD_SIZE); > + > if (unlikely(ti_work & _TIF_IO_BITMAP)) > tss_update_io_bitmap(); > > diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h > index c3bd0c0758c9..a31ea544cc0e 100644 > --- a/arch/x86/include/asm/switch_to.h > +++ b/arch/x86/include/asm/switch_to.h > @@ -71,8 +71,7 @@ static inline void update_task_stack(struct task_struct *task) > this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0); > #else > if (cpu_feature_enabled(X86_FEATURE_FRED)) { > - /* WRMSRNS is a baseline feature for FRED. */ > - wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE); > + set_thread_flag(TIF_LOAD_USER_STATES); > } else if (cpu_feature_enabled(X86_FEATURE_XENPV)) { > /* Xen PV enters the kernel on the thread stack. */ > load_sp0(task_top_of_stack(task)); > diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h > index 12da7dfd5ef1..fb51904651c0 100644 > --- a/arch/x86/include/asm/thread_info.h > +++ b/arch/x86/include/asm/thread_info.h > @@ -106,6 +106,7 @@ struct thread_info { > #define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */ > #define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */ > #define TIF_ADDR32 29 /* 32-bit address space on 64 bits */ > +#define TIF_LOAD_USER_STATES 30 /* Load user level states */ Wouldn't something along the l ines of TIF_LOAD_FRED_RSP be more descriptive, or it's expected that this flag can cover more state in the future? > > #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) > #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) > @@ -128,6 +129,7 @@ struct thread_info { > #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP) > #define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES) > #define _TIF_ADDR32 (1 << TIF_ADDR32) > +#define _TIF_LOAD_USER_STATES (1 << TIF_LOAD_USER_STATES) > > /* flags to check in __switch_to() */ > #define _TIF_WORK_CTXSW_BASE \ > diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c > index b7d9f530ae16..8bd84114c2d9 100644 > --- a/arch/x86/kernel/cpu/cpuid-deps.c > +++ b/arch/x86/kernel/cpu/cpuid-deps.c > @@ -83,7 +83,6 @@ static const struct cpuid_dep cpuid_deps[] = { > { X86_FEATURE_AMX_TILE, X86_FEATURE_XFD }, > { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES }, > { X86_FEATURE_FRED, X86_FEATURE_LKGS }, > - { X86_FEATURE_FRED, X86_FEATURE_WRMSRNS }, > {} > }; > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch 2024-08-09 10:45 ` Nikolay Borisov @ 2024-08-09 17:37 ` Xin Li 2024-08-09 22:43 ` H. Peter Anvin 0 siblings, 1 reply; 37+ messages in thread From: Xin Li @ 2024-08-09 17:37 UTC (permalink / raw) To: Nikolay Borisov, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, andrew.cooper3, seanjc On 8/9/2024 3:45 AM, Nikolay Borisov wrote: >> +#define TIF_LOAD_USER_STATES 30 /* Load user level states */ > > Wouldn't something along the l ines of TIF_LOAD_FRED_RSP be more > descriptive, or it's expected that this flag can cover more state in the > future? Sean mentioned TIF_LOAD_FRED_RSP, however we also have FRED SSP0, which needs to be handled similarly. And we also want to use it to cover future states. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch 2024-08-09 17:37 ` Xin Li @ 2024-08-09 22:43 ` H. Peter Anvin 0 siblings, 0 replies; 37+ messages in thread From: H. Peter Anvin @ 2024-08-09 22:43 UTC (permalink / raw) To: Xin Li, Nikolay Borisov, linux-kernel Cc: tglx, mingo, bp, dave.hansen, x86, peterz, andrew.cooper3, seanjc On August 9, 2024 10:37:08 AM PDT, Xin Li <xin@zytor.com> wrote: >On 8/9/2024 3:45 AM, Nikolay Borisov wrote: >>> +#define TIF_LOAD_USER_STATES 30 /* Load user level states */ >> >> Wouldn't something along the l ines of TIF_LOAD_FRED_RSP be more descriptive, or it's expected that this flag can cover more state in the future? > >Sean mentioned TIF_LOAD_FRED_RSP, however we also have FRED SSP0, which >needs to be handled similarly. > >And we also want to use it to cover future states. It is very likely that there will be additional uses of this beyond FRED. ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2024-08-18 6:17 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-07 5:47 [PATCH v1 0/3] x86: Write FRED RSP0 on return to userspace Xin Li (Intel) 2024-08-07 5:47 ` [PATCH v1 1/3] x86/entry: Test ti_work for zero before processing individual bits Xin Li (Intel) 2024-08-07 16:21 ` Brian Gerst 2024-08-07 23:03 ` Xin Li 2024-08-07 18:08 ` Thomas Gleixner 2024-08-07 18:21 ` Thomas Gleixner 2024-08-07 23:01 ` Xin Li 2024-08-07 5:47 ` [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism Xin Li (Intel) 2024-08-07 18:11 ` Thomas Gleixner 2024-08-09 23:07 ` Andrew Cooper 2024-08-10 0:01 ` H. Peter Anvin 2024-08-10 0:25 ` Andrew Cooper 2024-08-16 17:52 ` Xin Li 2024-08-16 18:40 ` Andrew Cooper 2024-08-16 19:18 ` H. Peter Anvin 2024-08-16 21:45 ` Andrew Cooper 2024-08-16 21:26 ` H. Peter Anvin 2024-08-16 22:27 ` Andrew Cooper 2024-08-16 22:34 ` Andrew Cooper 2024-08-16 23:03 ` H. Peter Anvin 2024-08-16 22:59 ` H. Peter Anvin 2024-08-17 23:51 ` H. Peter Anvin 2024-08-17 14:23 ` Borislav Petkov 2024-08-17 14:43 ` Borislav Petkov 2024-08-17 15:44 ` Xin Li 2024-08-17 19:22 ` H. Peter Anvin 2024-08-18 5:59 ` Borislav Petkov 2024-08-17 19:22 ` H. Peter Anvin 2024-08-18 5:49 ` Borislav Petkov 2024-08-18 6:16 ` H. Peter Anvin 2024-08-07 5:47 ` [PATCH v1 3/3] x86/entry: Set FRED RSP0 on return to userspace instead of context switch Xin Li (Intel) 2024-08-07 18:39 ` Thomas Gleixner 2024-08-07 21:55 ` Sean Christopherson 2024-08-07 23:04 ` Thomas Gleixner 2024-08-09 10:45 ` Nikolay Borisov 2024-08-09 17:37 ` Xin Li 2024-08-09 22:43 ` H. Peter Anvin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).