[PATCH v5 0/1] KVM: Add support for the ERAPS feature

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/1] KVM: Add support for the ERAPS feature
@ 2025-05-15 15:26 Amit Shah
  2025-05-15 15:26 ` [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests Amit Shah
  0 siblings, 1 reply; 9+ messages in thread
From: Amit Shah @ 2025-05-15 15:26 UTC (permalink / raw)
  To: linux-kernel, kvm, x86, linux-doc
  Cc: amit.shah, thomas.lendacky, bp, tglx, peterz, jpoimboe,
	pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa, seanjc,
	pbonzini, daniel.sneddon, kai.huang, sandipan.das,
	boris.ostrovsky, Babu.Moger, david.kaplan, dwmw, andrew.cooper3,
	Amit Shah

Zen5+ AMD CPUs have a larger RSB (64 entries on Zen5), and use all of it in
the host context.  The hypervisor needs to set up a couple things before it's
exposed to guests.  This patch adds the support to enable guests to drop their
software mitigations in favour of the hardware one.

The feature isn't yet part of an APM update that details its working, and
previous versions of the patch were RFC for that reason.  I'm now dropping the
RFC tag, because these patches have been out for a while, and I also have this
writeup:

https://amitshah.net/2024/11/eraps-reduces-software-tax-for-hardware-bugs/

which may be more detailsl than what will eventually appear in the APM anyway.

v5:
* Drop RFC tag
* Add separate VMCB01/VMCB02 handling to ensure both L1 and L2 guests are not
  affected by each other's RSB entries
* Rename vmcb_flush_guest_rap() back to vmcb_set_flush_guest_rap().  The
  previous name did not feel right because the call to the function only sets
  a bit in the VMCB which the CPU acts on much later (at VMRUN).

v4:
* Address Sean's comments from v3
  * remove a bunch of comments in favour of a better commit message
* Drop patch 1 from the series - Josh's patches handle the most common case,
  and the AutoIBRS-disabled case can be tackled later if required after Josh's
  patches have been merged upstream.

v3:
* rebase on top of Josh's RSB tweaks series
  * with that rebase, only the non-AutoIBRS case needs special ERAPS support.
    AutoIBRS is currently disabled when SEV-SNP is active (commit acaa4b5c4c8)

* remove comment about RSB_CLEAR_LOOPS and the size of the RSB -- it's not
  necessary anymore with the rework

* remove comment from patch 2 in svm.c in favour of the commit message

v2:
* reword comments to highlight context switch as the main trigger for RSB
  flushes in hardware (Dave Hansen)
* Split out outdated comment updates in (v1) patch1 to be a standalone
  patch1 in this series, to reinforce RSB filling is only required for RSB
  poisoning cases for AMD
  * Remove mentions of BTC/BTC_NO (Andrew Cooper)
* Add braces in case stmt (kernel test robot)
* s/boot_cpu_has/cpu_feature_enabled (Boris Petkov)

Amit Shah (1):
  x86: kvm: svm: set up ERAPS support for guests

 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/svm.h         |  6 +++++-
 arch/x86/kvm/cpuid.c               | 10 +++++++++-
 arch/x86/kvm/svm/svm.c             | 14 ++++++++++++++
 arch/x86/kvm/svm/svm.h             | 20 ++++++++++++++++++++
 5 files changed, 49 insertions(+), 2 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests
  2025-05-15 15:26 [PATCH v5 0/1] KVM: Add support for the ERAPS feature Amit Shah
@ 2025-05-15 15:26 ` Amit Shah
  2025-05-19 21:22   ` Tom Lendacky
  2025-08-20 17:10   ` Sean Christopherson
  0 siblings, 2 replies; 9+ messages in thread
From: Amit Shah @ 2025-05-15 15:26 UTC (permalink / raw)
  To: linux-kernel, kvm, x86, linux-doc
  Cc: amit.shah, thomas.lendacky, bp, tglx, peterz, jpoimboe,
	pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa, seanjc,
	pbonzini, daniel.sneddon, kai.huang, sandipan.das,
	boris.ostrovsky, Babu.Moger, david.kaplan, dwmw, andrew.cooper3

From: Amit Shah <amit.shah@amd.com>

AMD CPUs with the Enhanced Return Address Predictor (ERAPS) feature
Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after
VMEXITs.  The feature adds guest/host tags to entries in the RSB (a.k.a.
RAP).  This helps with speculation protection across the VM boundary,
and it also preserves host and guest entries in the RSB that can improve
software performance (which would otherwise be flushed due to the
FILL_RETURN_BUFFER sequences).  This feature also extends the size of
the RSB from the older standard (of 32 entries) to a new default
enumerated in CPUID leaf 0x80000021:EBX bits 23:16 -- which is 64
entries in Zen5 CPUs.

In addition to flushing the RSB across VMEXIT boundaries, CPUs with
this feature also flush the RSB when the CR3 is updated (i.e. whenever
there's a context switch), to prevent one userspace process poisoning
the RSB that may affect another process.  The relevance of this for KVM
is explained below in caveat 2.

The hardware feature is always-on, and the host context uses the full
default RSB size without any software changes necessary.  The presence
of this feature allows software (both in host and guest contexts) to
drop all RSB filling routines in favour of the hardware doing it.

For guests to observe and use this feature, the hypervisor needs to
expose the CPUID bit, and also set a VMCB bit.  Without one or both of
those, guests continue to use the older default RSB size and behaviour
for backwards compatibility.  This means the hardware RSB size is
limited to 32 entries for guests that do not have this feature exposed
to them.

There are two guest/host configurations that need to be addressed before
allowing a guest to use this feature: nested guests, and hosts using
shadow paging (or when NPT is disabled):

1. Nested guests: the ERAPS feature adds host/guest tagging to entries
   in the RSB, but does not distinguish between the guest ASIDs.  To
   prevent the case of an L2 guest poisoning the RSB to attack the L1
   guest, the CPU exposes a new VMCB bit (FLUSH_RAP_ON_VMRUN).  The next
   VMRUN with a VMCB that has this bit set causes the CPU to flush the
   RSB before entering the guest context.  In this patch, we set the bit
   in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest
   runs, its RSB contents aren't polluted by the L2's contents.
   Similarly, when an exit from L1 to the hypervisor happens, we set
   that bit for VMCB02, so that the L1 guest's RSB contents are not
   leaked/used in the L2 context.

2. Hosts that disable NPT: the ERAPS feature also flushes the RSB
   entries when the CR3 is updated.  When using shadow paging, CR3
   updates within the guest do not update the CPU's CR3 register.  In
   this case, do not expose the ERAPS feature to guests, and the guests
   continue with existing mitigations to fill the RSB.

This patch to KVM ensures both those caveats are addressed, and sets the
new ALLOW_LARGER_RAP VMCB bit that exposes this feature to the guest.
That allows the new default RSB size to be used in guest contexts as
well, and allows the guest to drop its RSB flushing routines.

Signed-off-by: Amit Shah <amit.shah@amd.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/svm.h         |  6 +++++-
 arch/x86/kvm/cpuid.c               | 10 +++++++++-
 arch/x86/kvm/svm/svm.c             | 14 ++++++++++++++
 arch/x86/kvm/svm/svm.h             | 20 ++++++++++++++++++++
 5 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 39e61212ac9a..57264d5ab162 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -457,6 +457,7 @@
 #define X86_FEATURE_AUTOIBRS		(20*32+ 8) /* Automatic IBRS */
 #define X86_FEATURE_NO_SMM_CTL_MSR	(20*32+ 9) /* SMM_CTL MSR is not present */

+#define X86_FEATURE_ERAPS		(20*32+24) /* Enhanced Return Address Predictor Security */
 #define X86_FEATURE_SBPB		(20*32+27) /* Selective Branch Prediction Barrier */
 #define X86_FEATURE_IBPB_BRTYPE		(20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */
 #define X86_FEATURE_SRSO_NO		(20*32+29) /* CPU is not affected by SRSO */
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 9b7fa99ae951..cf6a94e64e58 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -130,7 +130,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 	u64 tsc_offset;
 	u32 asid;
 	u8 tlb_ctl;
-	u8 reserved_2[3];
+	u8 erap_ctl;
+	u8 reserved_2[2];
 	u32 int_ctl;
 	u32 int_vector;
 	u32 int_state;
@@ -176,6 +177,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define TLB_CONTROL_FLUSH_ASID 3
 #define TLB_CONTROL_FLUSH_ASID_LOCAL 7

+#define ERAP_CONTROL_ALLOW_LARGER_RAP BIT(0)
+#define ERAP_CONTROL_FLUSH_RAP BIT(1)
+
 #define V_TPR_MASK 0x0f

 #define V_IRQ_SHIFT 8
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 571c906ffcbf..0cca1865826e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1187,6 +1187,9 @@ void kvm_set_cpu_caps(void)
 		F(SRSO_USER_KERNEL_NO),
 	);

+	if (tdp_enabled)
+		kvm_cpu_cap_check_and_set(X86_FEATURE_ERAPS);
+
 	kvm_cpu_cap_init(CPUID_8000_0022_EAX,
 		F(PERFMON_V2),
 	);
@@ -1756,8 +1759,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
 		break;
 	case 0x80000021:
-		entry->ebx = entry->ecx = entry->edx = 0;
+		entry->ecx = entry->edx = 0;
 		cpuid_entry_override(entry, CPUID_8000_0021_EAX);
+		if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+			entry->ebx &= GENMASK(23, 16);
+		else
+			entry->ebx = 0;
+
 		break;
 	/* AMD Extended Performance Monitoring and Debug */
 	case 0x80000022: {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a89c271a1951..a2b075ed4133 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1363,6 +1363,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
 	if (boot_cpu_has(X86_FEATURE_V_SPEC_CTRL))
 		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);

+	if (boot_cpu_has(X86_FEATURE_ERAPS) && npt_enabled)
+		vmcb_enable_extended_rap(svm->vmcb);
+
 	if (kvm_vcpu_apicv_active(vcpu))
 		avic_init_vmcb(svm, vmcb);

@@ -3482,6 +3485,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
 	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
 	pr_err("%-20s%d\n", "asid:", control->asid);
 	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
+	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
 	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
 	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
 	pr_err("%-20s%08x\n", "int_state:", control->int_state);
@@ -3663,6 +3667,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)

 		trace_kvm_nested_vmexit(vcpu, KVM_ISA_SVM);

+		if (vmcb_is_extended_rap(svm->vmcb01.ptr)) {
+			vmcb_set_flush_guest_rap(svm->vmcb01.ptr);
+			vmcb_clr_flush_guest_rap(svm->nested.vmcb02.ptr);
+		}
+
 		vmexit = nested_svm_exit_special(svm);

 		if (vmexit == NESTED_EXIT_CONTINUE)
@@ -3670,6 +3679,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)

 		if (vmexit == NESTED_EXIT_DONE)
 			return 1;
+	} else {
+		if (vmcb_is_extended_rap(svm->vmcb01.ptr) && svm->nested.initialized) {
+			vmcb_set_flush_guest_rap(svm->nested.vmcb02.ptr);
+			vmcb_clr_flush_guest_rap(svm->vmcb01.ptr);
+		}
 	}

 	if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f16b068c4228..7f44f7c9b1d5 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -493,6 +493,26 @@ static inline bool svm_is_intercept(struct vcpu_svm *svm, int bit)
 	return vmcb_is_intercept(&svm->vmcb->control, bit);
 }

+static inline void vmcb_set_flush_guest_rap(struct vmcb *vmcb)
+{
+	vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
+}
+
+static inline void vmcb_clr_flush_guest_rap(struct vmcb *vmcb)
+{
+	vmcb->control.erap_ctl &= ~ERAP_CONTROL_FLUSH_RAP;
+}
+
+static inline void vmcb_enable_extended_rap(struct vmcb *vmcb)
+{
+	vmcb->control.erap_ctl |= ERAP_CONTROL_ALLOW_LARGER_RAP;
+}
+
+static inline bool vmcb_is_extended_rap(struct vmcb *vmcb)
+{
+	return !!(vmcb->control.erap_ctl & ERAP_CONTROL_ALLOW_LARGER_RAP);
+}
+
 static inline bool nested_vgif_enabled(struct vcpu_svm *svm)
 {
 	return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_VGIF) &&
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests
  2025-05-15 15:26 ` [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests Amit Shah
@ 2025-05-19 21:22   ` Tom Lendacky
  2025-05-28 12:49     ` Amit Shah
  2025-08-20 17:10   ` Sean Christopherson
  1 sibling, 1 reply; 9+ messages in thread
From: Tom Lendacky @ 2025-05-19 21:22 UTC (permalink / raw)
  To: Amit Shah, linux-kernel, kvm, x86, linux-doc
  Cc: amit.shah, bp, tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet,
	mingo, dave.hansen, hpa, seanjc, pbonzini, daniel.sneddon,
	kai.huang, sandipan.das, boris.ostrovsky, Babu.Moger,
	david.kaplan, dwmw, andrew.cooper3

On 5/15/25 10:26, Amit Shah wrote:
> From: Amit Shah <amit.shah@amd.com>
> 
> AMD CPUs with the Enhanced Return Address Predictor (ERAPS) feature
> Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after
> VMEXITs.  The feature adds guest/host tags to entries in the RSB (a.k.a.
> RAP).  This helps with speculation protection across the VM boundary,
> and it also preserves host and guest entries in the RSB that can improve
> software performance (which would otherwise be flushed due to the
> FILL_RETURN_BUFFER sequences).  This feature also extends the size of
> the RSB from the older standard (of 32 entries) to a new default
> enumerated in CPUID leaf 0x80000021:EBX bits 23:16 -- which is 64
> entries in Zen5 CPUs.
> 
> In addition to flushing the RSB across VMEXIT boundaries, CPUs with
> this feature also flush the RSB when the CR3 is updated (i.e. whenever
> there's a context switch), to prevent one userspace process poisoning
> the RSB that may affect another process.  The relevance of this for KVM
> is explained below in caveat 2.
> 
> The hardware feature is always-on, and the host context uses the full
> default RSB size without any software changes necessary.  The presence
> of this feature allows software (both in host and guest contexts) to
> drop all RSB filling routines in favour of the hardware doing it.
> 
> For guests to observe and use this feature, the hypervisor needs to
> expose the CPUID bit, and also set a VMCB bit.  Without one or both of
> those, guests continue to use the older default RSB size and behaviour
> for backwards compatibility.  This means the hardware RSB size is
> limited to 32 entries for guests that do not have this feature exposed
> to them.
> 
> There are two guest/host configurations that need to be addressed before
> allowing a guest to use this feature: nested guests, and hosts using
> shadow paging (or when NPT is disabled):
> 
> 1. Nested guests: the ERAPS feature adds host/guest tagging to entries
>    in the RSB, but does not distinguish between the guest ASIDs.  To
>    prevent the case of an L2 guest poisoning the RSB to attack the L1
>    guest, the CPU exposes a new VMCB bit (FLUSH_RAP_ON_VMRUN).  The next
>    VMRUN with a VMCB that has this bit set causes the CPU to flush the
>    RSB before entering the guest context.  In this patch, we set the bit
>    in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest
>    runs, its RSB contents aren't polluted by the L2's contents.
>    Similarly, when an exit from L1 to the hypervisor happens, we set
>    that bit for VMCB02, so that the L1 guest's RSB contents are not
>    leaked/used in the L2 context.
> 
> 2. Hosts that disable NPT: the ERAPS feature also flushes the RSB
>    entries when the CR3 is updated.  When using shadow paging, CR3
>    updates within the guest do not update the CPU's CR3 register.  In
>    this case, do not expose the ERAPS feature to guests, and the guests
>    continue with existing mitigations to fill the RSB.
> 
> This patch to KVM ensures both those caveats are addressed, and sets the
> new ALLOW_LARGER_RAP VMCB bit that exposes this feature to the guest.
> That allows the new default RSB size to be used in guest contexts as
> well, and allows the guest to drop its RSB flushing routines.
> 
> Signed-off-by: Amit Shah <amit.shah@amd.com>
> ---
>  arch/x86/include/asm/cpufeatures.h |  1 +
>  arch/x86/include/asm/svm.h         |  6 +++++-
>  arch/x86/kvm/cpuid.c               | 10 +++++++++-
>  arch/x86/kvm/svm/svm.c             | 14 ++++++++++++++
>  arch/x86/kvm/svm/svm.h             | 20 ++++++++++++++++++++
>  5 files changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 39e61212ac9a..57264d5ab162 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -457,6 +457,7 @@
>  #define X86_FEATURE_AUTOIBRS		(20*32+ 8) /* Automatic IBRS */
>  #define X86_FEATURE_NO_SMM_CTL_MSR	(20*32+ 9) /* SMM_CTL MSR is not present */
>  
> +#define X86_FEATURE_ERAPS		(20*32+24) /* Enhanced Return Address Predictor Security */
>  #define X86_FEATURE_SBPB		(20*32+27) /* Selective Branch Prediction Barrier */
>  #define X86_FEATURE_IBPB_BRTYPE		(20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */
>  #define X86_FEATURE_SRSO_NO		(20*32+29) /* CPU is not affected by SRSO */
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 9b7fa99ae951..cf6a94e64e58 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -130,7 +130,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
>  	u64 tsc_offset;
>  	u32 asid;
>  	u8 tlb_ctl;
> -	u8 reserved_2[3];
> +	u8 erap_ctl;
> +	u8 reserved_2[2];
>  	u32 int_ctl;
>  	u32 int_vector;
>  	u32 int_state;
> @@ -176,6 +177,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
>  #define TLB_CONTROL_FLUSH_ASID 3
>  #define TLB_CONTROL_FLUSH_ASID_LOCAL 7
>  
> +#define ERAP_CONTROL_ALLOW_LARGER_RAP BIT(0)
> +#define ERAP_CONTROL_FLUSH_RAP BIT(1)
> +
>  #define V_TPR_MASK 0x0f
>  
>  #define V_IRQ_SHIFT 8
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 571c906ffcbf..0cca1865826e 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1187,6 +1187,9 @@ void kvm_set_cpu_caps(void)
>  		F(SRSO_USER_KERNEL_NO),
>  	);
>  
> +	if (tdp_enabled)
> +		kvm_cpu_cap_check_and_set(X86_FEATURE_ERAPS);

Should this be moved to svm_set_cpu_caps() ? And there it can be

	if (npt_enabled)
		kvm_cpu_cap...

> +
>  	kvm_cpu_cap_init(CPUID_8000_0022_EAX,
>  		F(PERFMON_V2),
>  	);
> @@ -1756,8 +1759,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
>  		break;
>  	case 0x80000021:
> -		entry->ebx = entry->ecx = entry->edx = 0;
> +		entry->ecx = entry->edx = 0;
>  		cpuid_entry_override(entry, CPUID_8000_0021_EAX);
> +		if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
> +			entry->ebx &= GENMASK(23, 16);
> +		else
> +			entry->ebx = 0;
> +

Extra blank line.

>  		break;
>  	/* AMD Extended Performance Monitoring and Debug */
>  	case 0x80000022: {
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index a89c271a1951..a2b075ed4133 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1363,6 +1363,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>  	if (boot_cpu_has(X86_FEATURE_V_SPEC_CTRL))
>  		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
>  
> +	if (boot_cpu_has(X86_FEATURE_ERAPS) && npt_enabled)

Should this be:

	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))

?

> +		vmcb_enable_extended_rap(svm->vmcb);
> +
>  	if (kvm_vcpu_apicv_active(vcpu))
>  		avic_init_vmcb(svm, vmcb);
>  
> @@ -3482,6 +3485,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
>  	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
>  	pr_err("%-20s%d\n", "asid:", control->asid);
>  	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
> +	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
>  	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
>  	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
>  	pr_err("%-20s%08x\n", "int_state:", control->int_state);
> @@ -3663,6 +3667,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>  
>  		trace_kvm_nested_vmexit(vcpu, KVM_ISA_SVM);
>  
> +		if (vmcb_is_extended_rap(svm->vmcb01.ptr)) {
> +			vmcb_set_flush_guest_rap(svm->vmcb01.ptr);
> +			vmcb_clr_flush_guest_rap(svm->nested.vmcb02.ptr);
> +		}
> +
>  		vmexit = nested_svm_exit_special(svm);
>  
>  		if (vmexit == NESTED_EXIT_CONTINUE)
> @@ -3670,6 +3679,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>  
>  		if (vmexit == NESTED_EXIT_DONE)
>  			return 1;
> +	} else {
> +		if (vmcb_is_extended_rap(svm->vmcb01.ptr) && svm->nested.initialized) {
> +			vmcb_set_flush_guest_rap(svm->nested.vmcb02.ptr);
> +			vmcb_clr_flush_guest_rap(svm->vmcb01.ptr);
> +		}
>  	}
>  
>  	if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index f16b068c4228..7f44f7c9b1d5 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -493,6 +493,26 @@ static inline bool svm_is_intercept(struct vcpu_svm *svm, int bit)
>  	return vmcb_is_intercept(&svm->vmcb->control, bit);
>  }
>  
> +static inline void vmcb_set_flush_guest_rap(struct vmcb *vmcb)
> +{
> +	vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
> +}
> +
> +static inline void vmcb_clr_flush_guest_rap(struct vmcb *vmcb)
> +{
> +	vmcb->control.erap_ctl &= ~ERAP_CONTROL_FLUSH_RAP;
> +}
> +
> +static inline void vmcb_enable_extended_rap(struct vmcb *vmcb)

s/extended/larger/ to match the bit name ?

> +{
> +	vmcb->control.erap_ctl |= ERAP_CONTROL_ALLOW_LARGER_RAP;
> +}
> +
> +static inline bool vmcb_is_extended_rap(struct vmcb *vmcb)

s/is_extended/has_larger/

Thanks,
Tom

> +{
> +	return !!(vmcb->control.erap_ctl & ERAP_CONTROL_ALLOW_LARGER_RAP);
> +}
> +
>  static inline bool nested_vgif_enabled(struct vcpu_svm *svm)
>  {
>  	return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_VGIF) &&

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests
  2025-05-19 21:22   ` Tom Lendacky
@ 2025-05-28 12:49     ` Amit Shah
  2025-08-20 16:18       ` Sean Christopherson
  0 siblings, 1 reply; 9+ messages in thread
From: Amit Shah @ 2025-05-28 12:49 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, kvm, x86, linux-doc
  Cc: bp, tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet, mingo,
	dave.hansen, hpa, seanjc, pbonzini, daniel.sneddon, kai.huang,
	sandipan.das, boris.ostrovsky, Babu.Moger, david.kaplan, dwmw,
	andrew.cooper3, amit.shah

On Mon, 2025-05-19 at 16:22 -0500, Tom Lendacky wrote:
> On 5/15/25 10:26, Amit Shah wrote:
> 

[...]

> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 571c906ffcbf..0cca1865826e 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1187,6 +1187,9 @@ void kvm_set_cpu_caps(void)
> >  		F(SRSO_USER_KERNEL_NO),
> >  	);
> >  
> > +	if (tdp_enabled)
> > +		kvm_cpu_cap_check_and_set(X86_FEATURE_ERAPS);
> 
> Should this be moved to svm_set_cpu_caps() ? And there it can be
> 
> 	if (npt_enabled)
> 		kvm_cpu_cap...

Yea, I don't mind moving that to svm-only code.  Will do.

> >  	case 0x80000021:
> > -		entry->ebx = entry->ecx = entry->edx = 0;
> > +		entry->ecx = entry->edx = 0;
> >  		cpuid_entry_override(entry, CPUID_8000_0021_EAX);
> > +		if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
> > +			entry->ebx &= GENMASK(23, 16);
> > +		else
> > +			entry->ebx = 0;
> > +
> 
> Extra blank line.

Hm, helps with visual separation of the if-else and the break.  I
prefer to keep it, unless it breaks style guidelines.

> >  		break;
> >  	/* AMD Extended Performance Monitoring and Debug */
> >  	case 0x80000022: {
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index a89c271a1951..a2b075ed4133 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -1363,6 +1363,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
> >  	if (boot_cpu_has(X86_FEATURE_V_SPEC_CTRL))
> >  		set_msr_interception(vcpu, svm->msrpm,
> > MSR_IA32_SPEC_CTRL, 1, 1);
> >  
> > +	if (boot_cpu_has(X86_FEATURE_ERAPS) && npt_enabled)
> 
> Should this be:
> 
> 	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
> 
> ?

Indeed this is better.  There was some case I wanted to cover
initially, but I don't think it needs to only depend on the host caps
in the current version at least.

[...]
 
> > +static inline void vmcb_set_flush_guest_rap(struct vmcb *vmcb)
> > +{
> > +	vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
> > +}
> > +
> > +static inline void vmcb_clr_flush_guest_rap(struct vmcb *vmcb)
> > +{
> > +	vmcb->control.erap_ctl &= ~ERAP_CONTROL_FLUSH_RAP;
> > +}
> > +
> > +static inline void vmcb_enable_extended_rap(struct vmcb *vmcb)
> 
> s/extended/larger/ to match the bit name ?

I also prefer it with the "larger" name, but that is a confusing bit
name -- so after the last round of review, I renamed the accessor
functions to be "better", while leaving the bit defines match what the
CPU has.

I don't mind switching this back - anyone else have any other opinions?

> 
> > +{
> > +	vmcb->control.erap_ctl |= ERAP_CONTROL_ALLOW_LARGER_RAP;
> > +}
> > +
> > +static inline bool vmcb_is_extended_rap(struct vmcb *vmcb)
> 
> s/is_extended/has_larger/
> 
> Thanks,
> Tom

Thanks for the review!

		Amit

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests
  2025-05-28 12:49     ` Amit Shah
@ 2025-08-20 16:18       ` Sean Christopherson
  2025-08-21  9:05         ` Amit Shah
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2025-08-20 16:18 UTC (permalink / raw)
  To: Amit Shah
  Cc: Tom Lendacky, linux-kernel, kvm, x86, linux-doc, bp, tglx, peterz,
	jpoimboe, pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa,
	pbonzini, daniel.sneddon, kai.huang, sandipan.das,
	boris.ostrovsky, Babu.Moger, david.kaplan, dwmw, andrew.cooper3,
	amit.shah

On Wed, May 28, 2025, Amit Shah wrote:
> On Mon, 2025-05-19 at 16:22 -0500, Tom Lendacky wrote:
> > > +static inline void vmcb_set_flush_guest_rap(struct vmcb *vmcb)
> > > +{
> > > +	vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
> > > +}
> > > +
> > > +static inline void vmcb_clr_flush_guest_rap(struct vmcb *vmcb)
> > > +{
> > > +	vmcb->control.erap_ctl &= ~ERAP_CONTROL_FLUSH_RAP;
> > > +}
> > > +
> > > +static inline void vmcb_enable_extended_rap(struct vmcb *vmcb)
> > 
> > s/extended/larger/ to match the bit name ?
> 
> I also prefer it with the "larger" name, but that is a confusing bit
> name -- so after the last round of review, I renamed the accessor
> functions to be "better", while leaving the bit defines match what the
> CPU has.
> 
> I don't mind switching this back - anyone else have any other opinions?

They both suck equally?  :-)

My dislike of "larger" is that it's a relative and intermediate term.  What is
the "smaller" size?  Is there an "even larger" or "largest size"?

"extended" doesn't help in any way, because that simply "solves" the problem of
size ambiguity by saying absolutely nothing about the size.

I also dislike "allow", because in virtualization context, "allow" usually refers
to what the _guest_ can do, but in this case "allow" refers to what the CPU can
do.

If we want to diverge from the APM, my vote would be for something like

  ERAP_CONTROL_FULL_SIZE_RAP

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests
  2025-05-15 15:26 ` [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests Amit Shah
  2025-05-19 21:22   ` Tom Lendacky
@ 2025-08-20 17:10   ` Sean Christopherson
  2025-08-27  9:17     ` Amit Shah
  2025-08-28  9:40     ` Amit Shah
  1 sibling, 2 replies; 9+ messages in thread
From: Sean Christopherson @ 2025-08-20 17:10 UTC (permalink / raw)
  To: Amit Shah
  Cc: linux-kernel, kvm, x86, linux-doc, amit.shah, thomas.lendacky, bp,
	tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet, mingo,
	dave.hansen, hpa, pbonzini, daniel.sneddon, kai.huang,
	sandipan.das, boris.ostrovsky, Babu.Moger, david.kaplan, dwmw,
	andrew.cooper3

On Thu, May 15, 2025, Amit Shah wrote:
> From: Amit Shah <amit.shah@amd.com>
> 
> AMD CPUs with the Enhanced Return Address Predictor (ERAPS) feature
> Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after
> VMEXITs.  The feature adds guest/host tags to entries in the RSB (a.k.a.
> RAP).  This helps with speculation protection across the VM boundary,
> and it also preserves host and guest entries in the RSB that can improve
> software performance (which would otherwise be flushed due to the
> FILL_RETURN_BUFFER sequences).  This feature also extends the size of
> the RSB from the older standard (of 32 entries) to a new default
> enumerated in CPUID leaf 0x80000021:EBX bits 23:16 -- which is 64
> entries in Zen5 CPUs.
> 
> In addition to flushing the RSB across VMEXIT boundaries, CPUs with
> this feature also flush the RSB when the CR3 is updated (i.e. whenever
> there's a context switch), to prevent one userspace process poisoning
> the RSB that may affect another process.  The relevance of this for KVM
> is explained below in caveat 2.
> 
> The hardware feature is always-on, and the host context uses the full
> default RSB size without any software changes necessary.  The presence
> of this feature allows software (both in host and guest contexts) to
> drop all RSB filling routines in favour of the hardware doing it.
> 
> For guests to observe and use this feature, 

Guests don't necessarily "use" this feature.  It's something that's enabled by
KVM and affects harware behavior regardless of whether or not the guest is even
aware ERAPS is a thing.

> the hypervisor needs to expose the CPUID bit, and also set a VMCB bit.
> Without one or both of those, 

No?  If there's no enabling for bare metal usage, I don't see how emulation of
CPUID can possibly impact usage of RAP size.  The only thing that matters is the
VMCB bit.  And nothing in this patch queries guest CPUID.

Observing ERAPS _might_ cause the guest to forego certain mitigations, but KVM
has zero visibility into whether or not such mitigations exist, if the guest will
care about ERAPS, etc.

> guests continue to use the older default RSB size and behaviour for backwards
> compatibility.  This means the hardware RSB size is limited to 32 entries for
> guests that do not have this feature exposed to them.
> 
> There are two guest/host configurations that need to be addressed before
> allowing a guest to use this feature: nested guests, and hosts using
> shadow paging (or when NPT is disabled):
> 
> 1. Nested guests: the ERAPS feature adds host/guest tagging to entries
>    in the RSB, but does not distinguish between the guest ASIDs.  To
>    prevent the case of an L2 guest poisoning the RSB to attack the L1
>    guest, the CPU exposes a new VMCB bit (FLUSH_RAP_ON_VMRUN).  The next
>    VMRUN with a VMCB that has this bit set causes the CPU to flush the
>    RSB before entering the guest context.  In this patch, we set the bit

"this patch", "we".

>    in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest
>    runs, its RSB contents aren't polluted by the L2's contents.
>    Similarly, when an exit from L1 to the hypervisor happens, we set
>    that bit for VMCB02, so that the L1 guest's RSB contents are not
>    leaked/used in the L2 context.
> 
> 2. Hosts that disable NPT: the ERAPS feature also flushes the RSB
>    entries when the CR3 is updated.  When using shadow paging, CR3
>    updates within the guest do not update the CPU's CR3 register.

Yes they do, just indirectly.  KVM changes the effective CR3 in reaction to the
guest's new CR3.  If hardware doesn't flush in that situation, then it's trivially
easy to set ERAP_CONTROL_FLUSH_RAP on writes to CR3.

>    In this case, do not expose the ERAPS feature to guests,
>    and the guests continue with existing mitigations to fill the RSB.
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 571c906ffcbf..0cca1865826e 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1187,6 +1187,9 @@ void kvm_set_cpu_caps(void)
>  		F(SRSO_USER_KERNEL_NO),
>  	);
>  
> +	if (tdp_enabled)
> +		kvm_cpu_cap_check_and_set(X86_FEATURE_ERAPS);

_If_ ERAPS is conditionally enabled, then it probably makes sense to do this in
svm_set_cpu_caps().  But I think we can just support ERAPS unconditionally.

> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index a89c271a1951..a2b075ed4133 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1363,6 +1363,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>  	if (boot_cpu_has(X86_FEATURE_V_SPEC_CTRL))
>  		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
>  
> +	if (boot_cpu_has(X86_FEATURE_ERAPS) && npt_enabled)
	
Don't regurgitate the same check in multiple places.

	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))


	
> +		vmcb_enable_extended_rap(svm->vmcb);
> +
>  	if (kvm_vcpu_apicv_active(vcpu))
>  		avic_init_vmcb(svm, vmcb);
>  
> @@ -3482,6 +3485,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
>  	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
>  	pr_err("%-20s%d\n", "asid:", control->asid);
>  	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
> +	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
>  	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
>  	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
>  	pr_err("%-20s%08x\n", "int_state:", control->int_state);
> @@ -3663,6 +3667,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>  
>  		trace_kvm_nested_vmexit(vcpu, KVM_ISA_SVM);
>  
> +		if (vmcb_is_extended_rap(svm->vmcb01.ptr)) {
> +			vmcb_set_flush_guest_rap(svm->vmcb01.ptr);
> +			vmcb_clr_flush_guest_rap(svm->nested.vmcb02.ptr);
> +		}
> +
>  		vmexit = nested_svm_exit_special(svm);
>  
>  		if (vmexit == NESTED_EXIT_CONTINUE)
> @@ -3670,6 +3679,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>  
>  		if (vmexit == NESTED_EXIT_DONE)
>  			return 1;
> +	} else {
> +		if (vmcb_is_extended_rap(svm->vmcb01.ptr) && svm->nested.initialized) {
> +			vmcb_set_flush_guest_rap(svm->nested.vmcb02.ptr);
> +			vmcb_clr_flush_guest_rap(svm->vmcb01.ptr);
> +		}

Handling this in the common exit path is confusing, inefficient, and lacking.

Assuming hardware doesn't automatically clear ERAP_CONTROL_FLUSH_RAP, then KVM
should clear the flag after _any_ exit, not just exits that reach this point,
e.g. if KVM stays in the fast path.

And IIUC, ERAP_CONTROL_FLUSH_RAP needs to be done on _every_ nested transition,
not just those that occur in direct response to a hardware #VMEXIT. So, hook
nested_vmcb02_prepare_control() for nested VMRUN and nested_svm_vmexit() for
nested #VMEXIT.

Side topic, the changelog should call out that KVM deliberately ignores guest
CPUID, and instead unconditionally enables the full size RAP when ERAPS is
supported.  I.e. KVM _could_ check guest_cpu_cap_has() instead of kvm_cpu_cap_has()
in all locations, to avoid having to flush the RAP on nested transitions when
ERAPS isn't enumerated to the guest, but presumably using the full size RAP is
better for overall performance.

The changelog should also call out that if the full size RAP is enabled, then
it's KVM's responsibility to flush the RAP on nested transitions irrespective
of whether or not ERAPS is advertised to the guest.  Because if ERAPS isn't
advertised, the the guest's mitigations will likely be insufficient.

>  	}
>  
>  	if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index f16b068c4228..7f44f7c9b1d5 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -493,6 +493,26 @@ static inline bool svm_is_intercept(struct vcpu_svm *svm, int bit)
>  	return vmcb_is_intercept(&svm->vmcb->control, bit);
>  }
>  
> +static inline void vmcb_set_flush_guest_rap(struct vmcb *vmcb)
> +{
> +	vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
> +}
> +
> +static inline void vmcb_clr_flush_guest_rap(struct vmcb *vmcb)
> +{
> +	vmcb->control.erap_ctl &= ~ERAP_CONTROL_FLUSH_RAP;
> +}
> +
> +static inline void vmcb_enable_extended_rap(struct vmcb *vmcb)
> +{
> +	vmcb->control.erap_ctl |= ERAP_CONTROL_ALLOW_LARGER_RAP;
> +}
> +
> +static inline bool vmcb_is_extended_rap(struct vmcb *vmcb)
> +{
> +	return !!(vmcb->control.erap_ctl & ERAP_CONTROL_ALLOW_LARGER_RAP);
> +}

Eh, just drop all of these wrappers.

With the caveat that I'm taking a wild guess on the !npt behavior, something
like this?

---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/include/asm/svm.h         | 6 +++++-
 arch/x86/kvm/cpuid.c               | 7 ++++++-
 arch/x86/kvm/svm/nested.c          | 6 ++++++
 arch/x86/kvm/svm/svm.c             | 9 +++++++++
 5 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index eb859299d514..87d9284c166a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -466,6 +466,7 @@
 #define X86_FEATURE_GP_ON_USER_CPUID	(20*32+17) /* User CPUID faulting */
 
 #define X86_FEATURE_PREFETCHI		(20*32+20) /* Prefetch Data/Instruction to Cache Level */
+#define X86_FEATURE_ERAPS		(20*32+24) /* Enhanced Return Address Predictor Security */
 #define X86_FEATURE_SBPB		(20*32+27) /* Selective Branch Prediction Barrier */
 #define X86_FEATURE_IBPB_BRTYPE		(20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */
 #define X86_FEATURE_SRSO_NO		(20*32+29) /* CPU is not affected by SRSO */
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index ffc27f676243..58a079d6c3ef 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -131,7 +131,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 	u64 tsc_offset;
 	u32 asid;
 	u8 tlb_ctl;
-	u8 reserved_2[3];
+	u8 erap_ctl;
+	u8 reserved_2[2];
 	u32 int_ctl;
 	u32 int_vector;
 	u32 int_state;
@@ -182,6 +183,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define TLB_CONTROL_FLUSH_ASID 3
 #define TLB_CONTROL_FLUSH_ASID_LOCAL 7
 
+#define ERAP_CONTROL_FULL_SIZE_RAP BIT(0)
+#define ERAP_CONTROL_FLUSH_RAP BIT(1)
+
 #define V_TPR_MASK 0x0f
 
 #define V_IRQ_SHIFT 8
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ad6cadf09930..184a810b53e3 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1177,6 +1177,7 @@ void kvm_set_cpu_caps(void)
 		F(AUTOIBRS),
 		F(PREFETCHI),
 		EMULATED_F(NO_SMM_CTL_MSR),
+		F(ERAPS),
 		/* PrefetchCtlMsr */
 		/* GpOnUserCpuid */
 		/* EPSF */
@@ -1760,9 +1761,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
 		break;
 	case 0x80000021:
-		entry->ebx = entry->edx = 0;
+		entry->edx = 0;
 		cpuid_entry_override(entry, CPUID_8000_0021_EAX);
 		cpuid_entry_override(entry, CPUID_8000_0021_ECX);
+		if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+			entry->ebx &= GENMASK(23, 16);
+		else
+			entry->ebx = 0;
 		break;
 	/* AMD Extended Performance Monitoring and Debug */
 	case 0x80000022: {
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index b7fd2e869998..77794fd809e1 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -861,6 +861,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
 		}
 	}
 
+	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+		vmcb02->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
+
 	/*
 	 * Merge guest and host intercepts - must be called with vcpu in
 	 * guest-mode to take effect.
@@ -1144,6 +1147,9 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 
 	kvm_nested_vmexit_handle_ibrs(vcpu);
 
+	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+		vmcb01->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
+
 	svm_switch_vmcb(svm, &svm->vmcb01);
 
 	/*
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7e7821ee8ee1..501596e56d39 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1205,6 +1205,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
 		svm_clr_intercept(svm, INTERCEPT_PAUSE);
 	}
 
+	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+		svm->vmcb->control.erap_ctl |= ERAP_CONTROL_FULL_SIZE_RAP;
+
 	if (kvm_vcpu_apicv_active(vcpu))
 		avic_init_vmcb(svm, vmcb);
 
@@ -2560,6 +2563,8 @@ static int cr_interception(struct kvm_vcpu *vcpu)
 			break;
 		case 3:
 			err = kvm_set_cr3(vcpu, val);
+			if (!err && nested && kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+				svm->vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
 			break;
 		case 4:
 			err = kvm_set_cr4(vcpu, val);
@@ -3322,6 +3327,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
 	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
 	pr_err("%-20s%d\n", "asid:", control->asid);
 	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
+	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
 	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
 	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
 	pr_err("%-20s%08x\n", "int_state:", control->int_state);
@@ -4366,6 +4372,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 	}
 
 	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
+	if (cpu_feature_enabled(X86_FEATURE_ERAPS))
+		svm->vmcb->control.erap_ctl &= ~ERAP_CONTROL_FLUSH_RAP;
+
 	vmcb_mark_all_clean(svm->vmcb);
 
 	/* if exit due to PF check for async PF */

base-commit: 91b392ada892a2e8b1c621b9493c50f6fb49880f
--

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests
  2025-08-20 16:18       ` Sean Christopherson
@ 2025-08-21  9:05         ` Amit Shah
  0 siblings, 0 replies; 9+ messages in thread
From: Amit Shah @ 2025-08-21  9:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Tom Lendacky, linux-kernel, kvm, x86, linux-doc, bp, tglx, peterz,
	jpoimboe, pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa,
	pbonzini, daniel.sneddon, kai.huang, sandipan.das,
	boris.ostrovsky, Babu.Moger, david.kaplan, dwmw, andrew.cooper3,
	amit.shah

On (Wed) 20 Aug 2025 [09:18:29], Sean Christopherson wrote:
> On Wed, May 28, 2025, Amit Shah wrote:
> > On Mon, 2025-05-19 at 16:22 -0500, Tom Lendacky wrote:
> > > > +static inline void vmcb_set_flush_guest_rap(struct vmcb *vmcb)
> > > > +{
> > > > +	vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
> > > > +}
> > > > +
> > > > +static inline void vmcb_clr_flush_guest_rap(struct vmcb *vmcb)
> > > > +{
> > > > +	vmcb->control.erap_ctl &= ~ERAP_CONTROL_FLUSH_RAP;
> > > > +}
> > > > +
> > > > +static inline void vmcb_enable_extended_rap(struct vmcb *vmcb)
> > > 
> > > s/extended/larger/ to match the bit name ?
> > 
> > I also prefer it with the "larger" name, but that is a confusing bit
> > name -- so after the last round of review, I renamed the accessor
> > functions to be "better", while leaving the bit defines match what the
> > CPU has.
> > 
> > I don't mind switching this back - anyone else have any other opinions?
> 
> They both suck equally?  :-)
> 
> My dislike of "larger" is that it's a relative and intermediate term.  What is
> the "smaller" size?  Is there an "even larger" or "largest size"?
> 
> "extended" doesn't help in any way, because that simply "solves" the problem of
> size ambiguity by saying absolutely nothing about the size.

I agree with that; but it's just how the bit is named in the APM, so...

> I also dislike "allow", because in virtualization context, "allow" usually refers
> to what the _guest_ can do, but in this case "allow" refers to what the CPU can
> do.

(same as above)

> If we want to diverge from the APM, my vote would be for something like
> 
>   ERAP_CONTROL_FULL_SIZE_RAP

Oh I def didn't want to diverge from the APM (for the name of the bit).  I
only wnat to check what's palatable for the accessors -- but a quick read of
the followup reply shows you're asking to drop them and just use the checks
directly, that's fine - no need for these accessors and this renaming.

	  	 Amit

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests
  2025-08-20 17:10   ` Sean Christopherson
@ 2025-08-27  9:17     ` Amit Shah
  2025-08-28  9:40     ` Amit Shah
  1 sibling, 0 replies; 9+ messages in thread
From: Amit Shah @ 2025-08-27  9:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, x86, linux-doc, amit.shah, thomas.lendacky, bp,
	tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet, mingo,
	dave.hansen, hpa, pbonzini, daniel.sneddon, kai.huang,
	sandipan.das, boris.ostrovsky, Babu.Moger, david.kaplan, dwmw,
	andrew.cooper3

On (Wed) 20 Aug 2025 [10:10:16], Sean Christopherson wrote:
> On Thu, May 15, 2025, Amit Shah wrote:

[...]

> > For guests to observe and use this feature, 
> 
> Guests don't necessarily "use" this feature.  It's something that's enabled by
> KVM and affects harware behavior regardless of whether or not the guest is even
> aware ERAPS is a thing.

OK wording it is tricky.  "use" in the sense of for the entire RSB to be
utilized within guest context.  Not "use" as in guest needs enablement or
needs to do anything special.

"For the extended size to also be utilized when the CPU is in guest context,
the hypervisor needs to..." ?

> > the hypervisor needs to expose the CPUID bit, and also set a VMCB bit.
> > Without one or both of those, 
> 
> No?  If there's no enabling for bare metal usage, I don't see how emulation of
> CPUID can possibly impact usage of RAP size.  The only thing that matters is the
> VMCB bit.  And nothing in this patch queries guest CPUID.

True.

> Observing ERAPS _might_ cause the guest to forego certain mitigations, but KVM
> has zero visibility into whether or not such mitigations exist, if the guest will
> care about ERAPS, etc.

Sure, there's nothing guest-specific about this; any OS, when it detects
ERAPS, may or may not want to adapt to its existence.  (As it turns out, for
Linux, no adaptation is necessary.)

> > guests continue to use the older default RSB size and behaviour for backwards
> > compatibility.  This means the hardware RSB size is limited to 32 entries for
> > guests that do not have this feature exposed to them.

[...]

> > 2. Hosts that disable NPT: the ERAPS feature also flushes the RSB
> >    entries when the CR3 is updated.  When using shadow paging, CR3
> >    updates within the guest do not update the CPU's CR3 register.
> 
> Yes they do, just indirectly.  KVM changes the effective CR3 in reaction to the
> guest's new CR3.  If hardware doesn't flush in that situation, then it's trivially
> easy to set ERAP_CONTROL_FLUSH_RAP on writes to CR3.

Yea, that's right - since it doesn't happen in-guest (i.e. there's an exit
instead), it needs KVM to set that bit.

[...]

> > @@ -3482,6 +3485,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
> >  	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
> >  	pr_err("%-20s%d\n", "asid:", control->asid);
> >  	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
> > +	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
> >  	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
> >  	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
> >  	pr_err("%-20s%08x\n", "int_state:", control->int_state);
> > @@ -3663,6 +3667,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> >  
> >  		trace_kvm_nested_vmexit(vcpu, KVM_ISA_SVM);
> >  
> > +		if (vmcb_is_extended_rap(svm->vmcb01.ptr)) {
> > +			vmcb_set_flush_guest_rap(svm->vmcb01.ptr);
> > +			vmcb_clr_flush_guest_rap(svm->nested.vmcb02.ptr);
> > +		}
> > +
> >  		vmexit = nested_svm_exit_special(svm);
> >  
> >  		if (vmexit == NESTED_EXIT_CONTINUE)
> > @@ -3670,6 +3679,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> >  
> >  		if (vmexit == NESTED_EXIT_DONE)
> >  			return 1;
> > +	} else {
> > +		if (vmcb_is_extended_rap(svm->vmcb01.ptr) && svm->nested.initialized) {
> > +			vmcb_set_flush_guest_rap(svm->nested.vmcb02.ptr);
> > +			vmcb_clr_flush_guest_rap(svm->vmcb01.ptr);
> > +		}
> 
> Handling this in the common exit path is confusing, inefficient, and lacking.

Heh, I agree.  I toyed with doing this just before VMRUN.  But I can't recall
why I disliked that more.

> Assuming hardware doesn't automatically clear ERAP_CONTROL_FLUSH_RAP, then KVM

That's right - it doesn't.

> should clear the flag after _any_ exit, not just exits that reach this point,
> e.g. if KVM stays in the fast path.

(or just before VMRUN).  Right.

> And IIUC, ERAP_CONTROL_FLUSH_RAP needs to be done on _every_ nested transition,
> not just those that occur in direct response to a hardware #VMEXIT. So, hook
> nested_vmcb02_prepare_control() for nested VMRUN and nested_svm_vmexit() for
> nested #VMEXIT.

Does sound better.  I think the case I wanted to preserve in this complex
logic was if we have a L2->exit->L2 transition, I didn't want to set the FLUSH
bit.

> Side topic, the changelog should call out that KVM deliberately ignores guest
> CPUID, and instead unconditionally enables the full size RAP when ERAPS is
> supported.  I.e. KVM _could_ check guest_cpu_cap_has() instead of kvm_cpu_cap_has()
> in all locations, to avoid having to flush the RAP on nested transitions when
> ERAPS isn't enumerated to the guest, but presumably using the full size RAP is
> better for overall performance.

Yea.

> The changelog should also call out that if the full size RAP is enabled, then
> it's KVM's responsibility to flush the RAP on nested transitions irrespective
> of whether or not ERAPS is advertised to the guest.  Because if ERAPS isn't
> advertised, the the guest's mitigations will likely be insufficient.

You mean the L2 guest?  ACK on the update.

> With the caveat that I'm taking a wild guess on the !npt behavior, something
> like this?

[...]

> +#define ERAP_CONTROL_FULL_SIZE_RAP BIT(0)
> +#define ERAP_CONTROL_FLUSH_RAP BIT(1)

Oh I def prefer to keep the APM-specified names.

[...]

Patch looks good!

I'll test it a bit and repost.

Thanks,

		Amit


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests
  2025-08-20 17:10   ` Sean Christopherson
  2025-08-27  9:17     ` Amit Shah
@ 2025-08-28  9:40     ` Amit Shah
  1 sibling, 0 replies; 9+ messages in thread
From: Amit Shah @ 2025-08-28  9:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, x86, linux-doc, amit.shah, thomas.lendacky, bp,
	tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet, mingo,
	dave.hansen, hpa, pbonzini, daniel.sneddon, kai.huang,
	sandipan.das, boris.ostrovsky, Babu.Moger, david.kaplan, dwmw,
	andrew.cooper3

On (Wed) 20 Aug 2025 [10:10:16], Sean Christopherson wrote:

[...]

> > +	if (tdp_enabled)
> > +		kvm_cpu_cap_check_and_set(X86_FEATURE_ERAPS);
> 
> _If_ ERAPS is conditionally enabled, then it probably makes sense to do this in
> svm_set_cpu_caps().  But I think we can just support ERAPS unconditionally.

[...]

> @@ -2560,6 +2563,8 @@ static int cr_interception(struct kvm_vcpu *vcpu)
>  			break;
>  		case 3:
>  			err = kvm_set_cr3(vcpu, val);
> +			if (!err && nested && kvm_cpu_cap_has(X86_FEATURE_ERAPS))
> +				svm->vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;

I missed this part in my reply earlier.

Enabling ERAPS for a guest is trickier in the no-NPT case: *if* the guest is
enlightened, notices the CPUID for ERAPS, and drops the mitigations, KVM needs
to ensure the FLUSH_RAP bit is set in the VMCB on guest CR3 changes all the
time.  Your change adds it - but only for the nested case.  It needs to do it
for the non-nested case as well.

I steered away from enabling it in the !npt case in the first place because it
was such a niche configuration that wasn't worth bothering and getting right
-- but since you've added it here, I'll go with it and drop the '&& nested'
part of the hunk above.

     	Amit

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-08-28  9:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-15 15:26 [PATCH v5 0/1] KVM: Add support for the ERAPS feature Amit Shah
2025-05-15 15:26 ` [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests Amit Shah
2025-05-19 21:22   ` Tom Lendacky
2025-05-28 12:49     ` Amit Shah
2025-08-20 16:18       ` Sean Christopherson
2025-08-21  9:05         ` Amit Shah
2025-08-20 17:10   ` Sean Christopherson
2025-08-27  9:17     ` Amit Shah
2025-08-28  9:40     ` Amit Shah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).