* [PATCH v6 0/1] KVM: Add support for the ERAPS feature
@ 2025-11-07 9:32 Amit Shah
2025-11-07 9:32 ` [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests Amit Shah
0 siblings, 1 reply; 11+ messages in thread
From: Amit Shah @ 2025-11-07 9:32 UTC (permalink / raw)
To: linux-kernel, kvm, x86, linux-doc
Cc: amit.shah, thomas.lendacky, bp, tglx, peterz, jpoimboe,
pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa, seanjc,
pbonzini, daniel.sneddon, kai.huang, sandipan.das,
boris.ostrovsky, Babu.Moger, david.kaplan, dwmw, andrew.cooper3,
Amit Shah
Zen5+ AMD CPUs have a larger RSB (64 entries on Zen5), and use all of it in
the host context. The hypervisor needs to set up a couple things before the
extra 32 entries are exposed to guests. Add hypervisor support to let the
hardware use the entire RSB in VM contexts as well.
The APM has now been published with details of this feature - and I finally
got around to sending this updated version based on the previous
round. Apologies for the long delays in getting this out; I ended up spending
a bunch of time looking at the NPT=off case:
In the previous round, Sean suggested some emulation for also handling the
NPT=off case. After discussions on the PUCK call (and some tracing to confirm
what we had wasn't sufficient), I decided to just drop it all and send this
patch for NPT=off.
Amit
v6:
* APM update is out as of July 2025. Reference it in the commit msg.
* Update commit msg from review comments (Sean)
* Move cpuid enablement to svm.c from x86.c (Tom Lendacky)
* Update bitfield names to reflect what's in the APM
* Update VMCB bits for all nested exits (Sean)
* Drop helper functions and set bitfields directly instead (Sean)
v5:
* Drop RFC tag
* Add separate VMCB01/VMCB02 handling to ensure both L1 and L2 guests are not
affected by each other's RSB entries
* Rename vmcb_flush_guest_rap() back to vmcb_set_flush_guest_rap(). The
previous name did not feel right because the call to the function only sets
a bit in the VMCB which the CPU acts on much later (at VMRUN).
v4:
* Address Sean's comments from v3
* remove a bunch of comments in favour of a better commit message
* Drop patch 1 from the series - Josh's patches handle the most common case,
and the AutoIBRS-disabled case can be tackled later if required after Josh's
patches have been merged upstream.
v3:
* rebase on top of Josh's RSB tweaks series
* with that rebase, only the non-AutoIBRS case needs special ERAPS support.
AutoIBRS is currently disabled when SEV-SNP is active (commit acaa4b5c4c8)
* remove comment about RSB_CLEAR_LOOPS and the size of the RSB -- it's not
necessary anymore with the rework
* remove comment from patch 2 in svm.c in favour of the commit message
v2:
* reword comments to highlight context switch as the main trigger for RSB
flushes in hardware (Dave Hansen)
* Split out outdated comment updates in (v1) patch1 to be a standalone
patch1 in this series, to reinforce RSB filling is only required for RSB
poisoning cases for AMD
* Remove mentions of BTC/BTC_NO (Andrew Cooper)
* Add braces in case stmt (kernel test robot)
* s/boot_cpu_has/cpu_feature_enabled (Boris Petkov)
Amit Shah (1):
x86: kvm: svm: set up ERAPS support for guests
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/svm.h | 6 +++++-
arch/x86/kvm/cpuid.c | 8 +++++++-
arch/x86/kvm/svm/nested.c | 6 ++++++
arch/x86/kvm/svm/svm.c | 11 +++++++++++
5 files changed, 30 insertions(+), 2 deletions(-)
--
2.51.1
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-07 9:32 [PATCH v6 0/1] KVM: Add support for the ERAPS feature Amit Shah @ 2025-11-07 9:32 ` Amit Shah 2025-11-20 20:11 ` Sean Christopherson 0 siblings, 1 reply; 11+ messages in thread From: Amit Shah @ 2025-11-07 9:32 UTC (permalink / raw) To: linux-kernel, kvm, x86, linux-doc Cc: amit.shah, thomas.lendacky, bp, tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa, seanjc, pbonzini, daniel.sneddon, kai.huang, sandipan.das, boris.ostrovsky, Babu.Moger, david.kaplan, dwmw, andrew.cooper3 From: Amit Shah <amit.shah@amd.com> AMD CPUs with the Enhanced Return Address Predictor (ERAPS) feature Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after VMEXITs. The feature adds guest/host tags to entries in the RSB (a.k.a. RAP). This helps with speculation protection across the VM boundary, and it also preserves host and guest entries in the RSB that can improve software performance (which would otherwise be flushed due to the FILL_RETURN_BUFFER sequences). This feature also extends the size of the RSB from the older standard (of 32 entries) to a new default enumerated in CPUID leaf 0x80000021:EBX bits 23:16 -- which is 64 entries in Zen5 CPUs. The hardware feature is always-on, and the host context uses the full default RSB size without any software changes necessary. The presence of this feature allows software (both in host and guest contexts) to drop all RSB filling routines in favour of the hardware doing it. There are two guest/host configurations that need to be addressed before allowing a guest to use this feature: nested guests, and hosts using shadow paging (or when NPT is disabled): 1. Nested guests: the ERAPS feature adds host/guest tagging to entries in the RSB, but does not distinguish between the guest ASIDs. To prevent the case of an L2 guest poisoning the RSB to attack the L1 guest, the CPU exposes a new VMCB bit (CLEAR_RAP). The next VMRUN with a VMCB that has this bit set causes the CPU to flush the RSB before entering the guest context. Set the bit in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest runs, its RSB contents aren't polluted by the L2's contents. Similarly, before entry into a nested guest, set the bit for VMCB02, so that the L1 guest's RSB contents are not leaked/used in the L2 context. 2. Hosts that disable NPT: the ERAPS feature flushes the RSB entries on several conditions, including CR3 updates. Emulating hardware behaviour on RSB flushes is not worth the effort for NPT=off case, nor is it worthwhile to enumerate and emulate every trigger the hardware uses to flush RSB entries. Instead of identifying and replicating RSB flushes that hardware would have performed had NPT been ON, do not let NPT=off VMs use the ERAPS features. This patch to KVM ensures both those caveats are addressed, and sets the new ALLOW_LARGER_RAP VMCB bit that allows the CPU to operate with ERAPS enabled in guest contexts. This feature is documented in AMD APM Vol 2 (Pub 24593), in revisions 3.43 and later. Signed-off-by: Amit Shah <amit.shah@amd.com> --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/svm.h | 6 +++++- arch/x86/kvm/cpuid.c | 8 +++++++- arch/x86/kvm/svm/nested.c | 6 ++++++ arch/x86/kvm/svm/svm.c | 11 +++++++++++ 5 files changed, 30 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 4091a776e37a..edc76a489aae 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -467,6 +467,7 @@ #define X86_FEATURE_GP_ON_USER_CPUID (20*32+17) /* User CPUID faulting */ #define X86_FEATURE_PREFETCHI (20*32+20) /* Prefetch Data/Instruction to Cache Level */ +#define X86_FEATURE_ERAPS (20*32+24) /* Enhanced Return Address Predictor Security */ #define X86_FEATURE_SBPB (20*32+27) /* Selective Branch Prediction Barrier */ #define X86_FEATURE_IBPB_BRTYPE (20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */ #define X86_FEATURE_SRSO_NO (20*32+29) /* CPU is not affected by SRSO */ diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 17f6c3fedeee..d4602ee4cf1f 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -131,7 +131,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u64 tsc_offset; u32 asid; u8 tlb_ctl; - u8 reserved_2[3]; + u8 erap_ctl; + u8 reserved_2[2]; u32 int_ctl; u32 int_vector; u32 int_state; @@ -182,6 +183,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define TLB_CONTROL_FLUSH_ASID 3 #define TLB_CONTROL_FLUSH_ASID_LOCAL 7 +#define ERAP_CONTROL_ALLOW_LARGER_RAP BIT(0) +#define ERAP_CONTROL_CLEAR_RAP BIT(1) + #define V_TPR_MASK 0x0f #define V_IRQ_SHIFT 8 diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 52524e0ca97f..93934d4f8f11 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -1795,8 +1795,14 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = entry->ebx = entry->ecx = entry->edx = 0; break; case 0x80000021: - entry->ebx = entry->edx = 0; + entry->edx = 0; cpuid_entry_override(entry, CPUID_8000_0021_EAX); + + if (kvm_cpu_cap_has(X86_FEATURE_ERAPS)) + entry->ebx &= GENMASK(23, 16); + else + entry->ebx = 0; + cpuid_entry_override(entry, CPUID_8000_0021_ECX); break; /* AMD Extended Performance Monitoring and Debug */ diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index a6443feab252..de51595e875c 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -869,6 +869,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, } } + if (kvm_cpu_cap_has(X86_FEATURE_ERAPS)) + vmcb02->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP; + /* * Merge guest and host intercepts - must be called with vcpu in * guest-mode to take effect. @@ -1164,6 +1167,9 @@ int nested_svm_vmexit(struct vcpu_svm *svm) kvm_nested_vmexit_handle_ibrs(vcpu); + if (kvm_cpu_cap_has(X86_FEATURE_ERAPS)) + vmcb01->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP; + svm_switch_vmcb(svm, &svm->vmcb01); /* diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 153c12dbf3eb..ff110a1fb5f0 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1147,6 +1147,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event) svm_clr_intercept(svm, INTERCEPT_PAUSE); } + if (kvm_cpu_cap_has(X86_FEATURE_ERAPS) && npt_enabled) + svm->vmcb->control.erap_ctl |= ERAP_CONTROL_ALLOW_LARGER_RAP; + if (kvm_vcpu_apicv_active(vcpu)) avic_init_vmcb(svm, vmcb); @@ -3267,6 +3270,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset); pr_err("%-20s%d\n", "asid:", control->asid); pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl); + pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl); pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl); pr_err("%-20s%08x\n", "int_vector:", control->int_vector); pr_err("%-20s%08x\n", "int_state:", control->int_state); @@ -4321,6 +4325,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) } svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; + if (cpu_feature_enabled(X86_FEATURE_ERAPS)) + svm->vmcb->control.erap_ctl &= ~ERAP_CONTROL_CLEAR_RAP; + vmcb_mark_all_clean(svm->vmcb); /* if exit due to PF check for async PF */ @@ -5265,6 +5272,10 @@ static __init void svm_set_cpu_caps(void) /* CPUID 0x8000001F (SME/SEV features) */ sev_set_cpu_caps(); + /* CPUID 0x80000021 */ + if (npt_enabled) + kvm_cpu_cap_check_and_set(X86_FEATURE_ERAPS); + /* * Clear capabilities that are automatically configured by common code, * but that require explicit SVM support (that isn't yet implemented). -- 2.51.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-07 9:32 ` [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests Amit Shah @ 2025-11-20 20:11 ` Sean Christopherson 2025-11-21 2:40 ` Andrew Cooper 2025-11-24 16:15 ` Shah, Amit 0 siblings, 2 replies; 11+ messages in thread From: Sean Christopherson @ 2025-11-20 20:11 UTC (permalink / raw) To: Amit Shah Cc: linux-kernel, kvm, x86, linux-doc, amit.shah, thomas.lendacky, bp, tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa, pbonzini, daniel.sneddon, kai.huang, sandipan.das, boris.ostrovsky, Babu.Moger, david.kaplan, dwmw, andrew.cooper3 KVM: SVM: On Fri, Nov 07, 2025, Amit Shah wrote: > From: Amit Shah <amit.shah@amd.com> > > AMD CPUs with the Enhanced Return Address Predictor (ERAPS) feature Enhanced Return Address Predictor Security. The 'S' matters. > Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after > VMEXITs. The feature adds guest/host tags to entries in the RSB (a.k.a. > RAP). This helps with speculation protection across the VM boundary, > and it also preserves host and guest entries in the RSB that can improve > software performance (which would otherwise be flushed due to the > FILL_RETURN_BUFFER sequences). This feature also extends the size of > the RSB from the older standard (of 32 entries) to a new default > enumerated in CPUID leaf 0x80000021:EBX bits 23:16 -- which is 64 > entries in Zen5 CPUs. > > The hardware feature is always-on, and the host context uses the full > default RSB size without any software changes necessary. The presence > of this feature allows software (both in host and guest contexts) to > drop all RSB filling routines in favour of the hardware doing it. > > There are two guest/host configurations that need to be addressed before > allowing a guest to use this feature: nested guests, and hosts using > shadow paging (or when NPT is disabled): > > 1. Nested guests: the ERAPS feature adds host/guest tagging to entries > in the RSB, but does not distinguish between the guest ASIDs. To > prevent the case of an L2 guest poisoning the RSB to attack the L1 > guest, the CPU exposes a new VMCB bit (CLEAR_RAP). The next > VMRUN with a VMCB that has this bit set causes the CPU to flush the > RSB before entering the guest context. Set the bit in VMCB01 after a > nested #VMEXIT to ensure the next time the L1 guest runs, its RSB > contents aren't polluted by the L2's contents. Similarly, before > entry into a nested guest, set the bit for VMCB02, so that the L1 > guest's RSB contents are not leaked/used in the L2 context. > > 2. Hosts that disable NPT: the ERAPS feature flushes the RSB entries on > several conditions, including CR3 updates. Emulating hardware > behaviour on RSB flushes is not worth the effort for NPT=off case, > nor is it worthwhile to enumerate and emulate every trigger the > hardware uses to flush RSB entries. Instead of identifying and > replicating RSB flushes that hardware would have performed had NPT > been ON, do not let NPT=off VMs use the ERAPS features. The emulation requirements are not limited to shadow paging. From the APM: The ERAPS feature eliminates the need to execute CALL instructions to clear the return address predictor in most cases. On processors that support ERAPS, return addresses from CALL instructions executed in host mode are not used in guest mode, and vice versa. Additionally, the return address predictor is cleared in all cases when the TLB is implicitly invalidated (see Section 5.5.3 “TLB ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Management,” on page 159) and in the following cases: • MOV CR3 instruction • INVPCID other than single address invalidation (operation type 0) Yes, KVM only intercepts MOV CR3 and INVPCID when NPT is disabled (or INVPCID is unsupported per guest CPUID), but that is an implementation detail, the instructions are still reachable via emulator, and KVM needs to emulate implicit TLB flush behavior. So punting on emulating RAP clearing because it's too hard is not an option. And AFAICT, it's not even that hard. The changelog also needs to include the architectural behavior, otherwise "is not worth the effort" is even more subjective since there's no documentation of what the effort would actually be. As for emulating the RAP clears, a clever idea is to piggyback and alias dirty tracking for VCPU_EXREG_CR3, as VCPU_EXREG_ERAPS. I.e. mark the vCPU as needing a RAP clear if CR3 is written to, and then let common x86 also set VCPU_EXREG_ERAPS as needed, e.g. when handling INVPCID. > This patch to KVM ensures both those caveats are addressed, and sets the No "This patch". > new ALLOW_LARGER_RAP VMCB bit that allows the CPU to operate with ERAPS > enabled in guest contexts. > > This feature is documented in AMD APM Vol 2 (Pub 24593), in revisions > 3.43 and later. > > Signed-off-by: Amit Shah <amit.shah@amd.com> > --- > arch/x86/include/asm/cpufeatures.h | 1 + > arch/x86/include/asm/svm.h | 6 +++++- > arch/x86/kvm/cpuid.c | 8 +++++++- > arch/x86/kvm/svm/nested.c | 6 ++++++ > arch/x86/kvm/svm/svm.c | 11 +++++++++++ > 5 files changed, 30 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h > index 4091a776e37a..edc76a489aae 100644 > --- a/arch/x86/include/asm/cpufeatures.h > +++ b/arch/x86/include/asm/cpufeatures.h > @@ -467,6 +467,7 @@ > #define X86_FEATURE_GP_ON_USER_CPUID (20*32+17) /* User CPUID faulting */ > > #define X86_FEATURE_PREFETCHI (20*32+20) /* Prefetch Data/Instruction to Cache Level */ > +#define X86_FEATURE_ERAPS (20*32+24) /* Enhanced Return Address Predictor Security */ > #define X86_FEATURE_SBPB (20*32+27) /* Selective Branch Prediction Barrier */ > #define X86_FEATURE_IBPB_BRTYPE (20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */ > #define X86_FEATURE_SRSO_NO (20*32+29) /* CPU is not affected by SRSO */ > diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h > index 17f6c3fedeee..d4602ee4cf1f 100644 > --- a/arch/x86/include/asm/svm.h > +++ b/arch/x86/include/asm/svm.h > @@ -131,7 +131,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { > u64 tsc_offset; > u32 asid; > u8 tlb_ctl; > - u8 reserved_2[3]; > + u8 erap_ctl; > + u8 reserved_2[2]; > u32 int_ctl; > u32 int_vector; > u32 int_state; > @@ -182,6 +183,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area { > #define TLB_CONTROL_FLUSH_ASID 3 > #define TLB_CONTROL_FLUSH_ASID_LOCAL 7 > > +#define ERAP_CONTROL_ALLOW_LARGER_RAP BIT(0) > +#define ERAP_CONTROL_CLEAR_RAP BIT(1) > + > #define V_TPR_MASK 0x0f > > #define V_IRQ_SHIFT 8 > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index 52524e0ca97f..93934d4f8f11 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -1795,8 +1795,14 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) > entry->eax = entry->ebx = entry->ecx = entry->edx = 0; > break; > case 0x80000021: > - entry->ebx = entry->edx = 0; > + entry->edx = 0; > cpuid_entry_override(entry, CPUID_8000_0021_EAX); > + > + if (kvm_cpu_cap_has(X86_FEATURE_ERAPS)) > + entry->ebx &= GENMASK(23, 16); > + else > + entry->ebx = 0; > + > cpuid_entry_override(entry, CPUID_8000_0021_ECX); > break; > /* AMD Extended Performance Monitoring and Debug */ > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c > index a6443feab252..de51595e875c 100644 > --- a/arch/x86/kvm/svm/nested.c > +++ b/arch/x86/kvm/svm/nested.c > @@ -869,6 +869,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, > } > } > > + if (kvm_cpu_cap_has(X86_FEATURE_ERAPS)) My bad, this should be if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS)) because KVM doesn't need to flush the RAP if the virtual isn't ERAPS-capable (L1 is responsible for manually clearing the RAP). Ditto for setting ERAP_CONTROL_ALLOW_LARGER_RAP in init_vmcb() and below. If userspace decides not to expose ERAPS to L1 for whatever reason, then KVM should honor that. > + vmcb02->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP; Suspiciously absent is support for running L2 with ERAP_CONTROL_ALLOW_LARGER_RAP. init_vmcb() always operates on vmcb01. In theory, we could set ALLOW_LARGER_RAP in vmcb02 if it's supported in the virtual CPU, regardless of what vmcb12 says. But given that this is security related, I think it makes sense to honor L1's wishes, even though that means L1 also needs to be updated to fully benefit from ERAPS. Compile tested only at this point, but this? -- From: Amit Shah <amit.shah@amd.com> Date: Fri, 7 Nov 2025 10:32:39 +0100 Subject: [PATCH] KVM: SVM: Virtualize and advertise support for ERAPS MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit AMD CPUs with the Enhanced Return Address Predictor Security (ERAPS) feature (available on Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after VMEXITs. ERAPS adds guest/host tags to entries in the RSB (a.k.a. RAP). This helps with speculation protection across the VM boundary, and it also preserves host and guest entries in the RSB that can improve software performance (which would otherwise be flushed due to the FILL_RETURN_BUFFER sequences). Importantly, ERAPS also improves cross-domain security by clearing the RAP in certain situations. Specifically, the RAP is cleared in response to actions that are typically tied to software context switching between tasks. Per the APM: The ERAPS feature eliminates the need to execute CALL instructions to clear the return address predictor in most cases. On processors that support ERAPS, return addresses from CALL instructions executed in host mode are not used in guest mode, and vice versa. Additionally, the return address predictor is cleared in all cases when the TLB is implicitly invalidated and in the following cases: • MOV CR3 instruction • INVPCID other than single address invalidation (operation type 0) ERAPS also allows CPUs to extends the size of the RSB/RAP from the older standard (of 32 entries) to a new size, enumerated in CPUID leaf 0x80000021:EBX bits 23:16 (64 entries in Zen5 CPUs). In hardware, ERAPS is always-on, when running in host context, the CPU uses the full RSB/RAP size without any software changes necessary. However, when running in guest context, the CPU utilizes the full size of the RSB/RAP if and only if the new ALLOW_LARGER_RAP flag is set in the VMCB; if the flag is not set, the CPU limits itself to the historical size of 32 entires. Requiring software to opt-in for guest usage of RAPs larger than 32 entries allows hypervisors, i.e. KVM, to emulate the aforementioned conditions in which the RAP is cleared as well as the guest/host split. E.g. if the CPU unconditionally used the full RAP for guests, failure to clear the RAP on transitions between L1 or L2, or on emulated guest TLB flushes, would expose the guest to RAP-based attacks as a guest without support for ERAPS wouldn't know that its FILL_RETURN_BUFFER sequence is insufficient. Address the ~two broad categories of ERAPS emulation, and advertise ERAPS support to userspace, along with the RAP size enumerated in CPUID. 1. Architectural RAP clearing: as above, CPUs with ERAPS clear RAP entries on several conditions, including CR3 updates. To handle scenarios where a relevant operation is handled in common code (emulation of INVPCID and to a lesser extent MOV CR3), piggyback VCPU_EXREG_CR3 and create an alias, VCPU_EXREG_ERAPS. SVM doesn't utilize CR3 dirty tracking, and so for all intents and purposes VCPU_EXREG_CR3 is unused. Aliasing VCPU_EXREG_ERAPS ensures that any flow that writes CR3 will also clear the guest's RAP, and allows common x86 to mark ERAPS vCPUs as needing a RAP clear without having to add a new request (or other mechanism). 2. Nested guests: the ERAPS feature adds host/guest tagging to entries in the RSB, but does not distinguish between the guest ASIDs. To prevent the case of an L2 guest poisoning the RSB to attack the L1 guest, the CPU exposes a new VMCB bit (CLEAR_RAP). The next VMRUN with a VMCB that has this bit set causes the CPU to flush the RSB before entering the guest context. Set the bit in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest runs, its RSB contents aren't polluted by the L2's contents. Similarly, before entry into a nested guest, set the bit for VMCB02, so that the L1 guest's RSB contents are not leaked/used in the L2 context. Enable ALLOW_LARGER_RAP (and emulate RAP clears) if and only if ERAPS is exposed to the guest. Enabling ALLOW_LARGER_RAP unconditionally wouldn't cause any functional issues, but ignoring userspace's (and L1's) desires would put KVM into a grey area, which is especially undesirable due to the potential security implications. E.g. if a use case wants to have L1 do manual RAP clearing even when ERAPS is present in hardware, enabling ALLOW_LARGER_RAP could result in L1 leaving stale entries in the RAP. ERAPS is documented in AMD APM Vol 2 (Pub 24593), in revisions 3.43 and later. Signed-off-by: Amit Shah <amit.shah@amd.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/kvm_host.h | 8 ++++++++ arch/x86/include/asm/svm.h | 6 +++++- arch/x86/kvm/cpuid.c | 9 ++++++++- arch/x86/kvm/svm/nested.c | 18 ++++++++++++++++++ arch/x86/kvm/svm/svm.c | 25 ++++++++++++++++++++++++- arch/x86/kvm/svm/svm.h | 1 + arch/x86/kvm/x86.c | 12 ++++++++++++ 8 files changed, 77 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 646d2a77a2e2..31ab1e4e70f3 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -468,6 +468,7 @@ #define X86_FEATURE_GP_ON_USER_CPUID (20*32+17) /* User CPUID faulting */ #define X86_FEATURE_PREFETCHI (20*32+20) /* Prefetch Data/Instruction to Cache Level */ +#define X86_FEATURE_ERAPS (20*32+24) /* Enhanced Return Address Predictor Security */ #define X86_FEATURE_SBPB (20*32+27) /* Selective Branch Prediction Barrier */ #define X86_FEATURE_IBPB_BRTYPE (20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */ #define X86_FEATURE_SRSO_NO (20*32+29) /* CPU is not affected by SRSO */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5a3bfa293e8b..0353d8b6988c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -195,7 +195,15 @@ enum kvm_reg { VCPU_EXREG_PDPTR = NR_VCPU_REGS, VCPU_EXREG_CR0, + /* + * Alias AMD's ERAPS (not a real register) to CR3 so that common code + * can trigger emulation of the RAP (Return Address Predictor) with + * minimal support required in common code. Piggyback CR3 as the RAP + * is cleared on writes to CR3, i.e. marking CR3 dirty will naturally + * mark ERAPS dirty as well. + */ VCPU_EXREG_CR3, + VCPU_EXREG_ERAPS = VCPU_EXREG_CR3, VCPU_EXREG_CR4, VCPU_EXREG_RFLAGS, VCPU_EXREG_SEGMENTS, diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index e69b6d0dedcf..348957bda488 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -131,7 +131,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u64 tsc_offset; u32 asid; u8 tlb_ctl; - u8 reserved_2[3]; + u8 erap_ctl; + u8 reserved_2[2]; u32 int_ctl; u32 int_vector; u32 int_state; @@ -182,6 +183,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define TLB_CONTROL_FLUSH_ASID 3 #define TLB_CONTROL_FLUSH_ASID_LOCAL 7 +#define ERAP_CONTROL_ALLOW_LARGER_RAP BIT(0) +#define ERAP_CONTROL_CLEAR_RAP BIT(1) + #define V_TPR_MASK 0x0f #define V_IRQ_SHIFT 8 diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index d563a948318b..8bed1224635d 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -1216,6 +1216,7 @@ void kvm_set_cpu_caps(void) /* PrefetchCtlMsr */ /* GpOnUserCpuid */ /* EPSF */ + F(ERAPS), SYNTHESIZED_F(SBPB), SYNTHESIZED_F(IBPB_BRTYPE), SYNTHESIZED_F(SRSO_NO), @@ -1796,8 +1797,14 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = entry->ebx = entry->ecx = entry->edx = 0; break; case 0x80000021: - entry->ebx = entry->edx = 0; + entry->edx = 0; cpuid_entry_override(entry, CPUID_8000_0021_EAX); + + if (kvm_cpu_cap_has(X86_FEATURE_ERAPS)) + entry->ebx &= GENMASK(23, 16); + else + entry->ebx = 0; + cpuid_entry_override(entry, CPUID_8000_0021_ECX); break; /* AMD Extended Performance Monitoring and Debug */ diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index c81005b24522..5dc6b8915809 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -417,6 +417,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->msrpm_base_pa = from->msrpm_base_pa; to->tsc_offset = from->tsc_offset; to->tlb_ctl = from->tlb_ctl; + to->erap_ctl = from->erap_ctl; to->int_ctl = from->int_ctl; to->int_vector = from->int_vector; to->int_state = from->int_state; @@ -866,6 +867,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, } } + /* + * Take ALLOW_LARGER_RAP from vmcb12 even though it should be safe to + * let L2 use a larger RAP since KVM will emulate the necessary clears, + * as it's possible L1 deliberately wants to restrict L2 to the legacy + * RAP size. Unconditionally clear the RAP on nested VMRUN, as KVM is + * responsible for emulating the host vs. guest tags (L1 is the "host", + * L2 is the "guest"). + */ + if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS)) + vmcb02->control.erap_ctl = (svm->nested.ctl.erap_ctl & + ERAP_CONTROL_ALLOW_LARGER_RAP) | + ERAP_CONTROL_CLEAR_RAP; + /* * Merge guest and host intercepts - must be called with vcpu in * guest-mode to take effect. @@ -1161,6 +1175,9 @@ int nested_svm_vmexit(struct vcpu_svm *svm) kvm_nested_vmexit_handle_ibrs(vcpu); + if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS)) + vmcb01->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP; + svm_switch_vmcb(svm, &svm->vmcb01); /* @@ -1667,6 +1684,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst, dst->tsc_offset = from->tsc_offset; dst->asid = from->asid; dst->tlb_ctl = from->tlb_ctl; + dst->erap_ctl = from->erap_ctl; dst->int_ctl = from->int_ctl; dst->int_vector = from->int_vector; dst->int_state = from->int_state; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index f56c2d895011..4f1407b9d0a2 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1141,6 +1141,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event) svm_clr_intercept(svm, INTERCEPT_PAUSE); } + if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS)) + svm->vmcb->control.erap_ctl |= ERAP_CONTROL_ALLOW_LARGER_RAP; + if (kvm_vcpu_apicv_active(vcpu)) avic_init_vmcb(svm, vmcb); @@ -3271,6 +3274,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset); pr_err("%-20s%d\n", "asid:", control->asid); pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl); + pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl); pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl); pr_err("%-20s%08x\n", "int_vector:", control->int_vector); pr_err("%-20s%08x\n", "int_state:", control->int_state); @@ -3982,6 +3986,13 @@ static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva) invlpga(gva, svm->vmcb->control.asid); } +static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu) +{ + kvm_register_mark_dirty(vcpu, VCPU_EXREG_ERAPS); + + svm_flush_tlb_asid(vcpu); +} + static inline void sync_cr8_to_lapic(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4240,6 +4251,10 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) } svm->vmcb->save.cr2 = vcpu->arch.cr2; + if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS) && + kvm_register_is_dirty(vcpu, VCPU_EXREG_ERAPS)) + svm->vmcb->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP; + svm_hv_update_vp_id(svm->vmcb, vcpu); /* @@ -4317,6 +4332,14 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) } svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; + + /* + * Unconditionally mask off the CLEAR_RAP bit, the AND is just as cheap + * as the TEST+Jcc to avoid it. + */ + if (cpu_feature_enabled(X86_FEATURE_ERAPS)) + svm->vmcb->control.erap_ctl &= ~ERAP_CONTROL_CLEAR_RAP; + vmcb_mark_all_clean(svm->vmcb); /* if exit due to PF check for async PF */ @@ -5071,7 +5094,7 @@ struct kvm_x86_ops svm_x86_ops __initdata = { .flush_tlb_all = svm_flush_tlb_all, .flush_tlb_current = svm_flush_tlb_current, .flush_tlb_gva = svm_flush_tlb_gva, - .flush_tlb_guest = svm_flush_tlb_asid, + .flush_tlb_guest = svm_flush_tlb_guest, .vcpu_pre_run = svm_vcpu_pre_run, .vcpu_run = svm_vcpu_run, diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 9e151dbdef25..96eab58830ce 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -156,6 +156,7 @@ struct vmcb_ctrl_area_cached { u64 tsc_offset; u32 asid; u8 tlb_ctl; + u8 erap_ctl; u32 int_ctl; u32 int_vector; u32 int_state; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0c6d899d53dd..98c177ebf2a2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -14123,6 +14123,13 @@ int kvm_handle_invpcid(struct kvm_vcpu *vcpu, unsigned long type, gva_t gva) return 1; } + /* + * When ERAPS is supported, invalidating a specific PCID clears + * the RAP (Return Address Predicator). + */ + if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS)) + kvm_register_is_dirty(vcpu, VCPU_EXREG_ERAPS); + kvm_invalidate_pcid(vcpu, operand.pcid); return kvm_skip_emulated_instruction(vcpu); @@ -14136,6 +14143,11 @@ int kvm_handle_invpcid(struct kvm_vcpu *vcpu, unsigned long type, gva_t gva) fallthrough; case INVPCID_TYPE_ALL_INCL_GLOBAL: + /* + * Don't bother marking VCPU_EXREG_ERAPS dirty, SVM will take + * care of doing so when emulating the full guest TLB flush + * (the RAP is cleared on all implicit TLB flushes). + */ kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); return kvm_skip_emulated_instruction(vcpu); base-commit: 0c3b67dddd1051015f5504389a551ecd260488a5 -- ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-20 20:11 ` Sean Christopherson @ 2025-11-21 2:40 ` Andrew Cooper 2025-11-21 14:58 ` Sean Christopherson 2025-11-24 16:15 ` Shah, Amit 1 sibling, 1 reply; 11+ messages in thread From: Andrew Cooper @ 2025-11-21 2:40 UTC (permalink / raw) To: Sean Christopherson, Amit Shah Cc: linux-kernel, kvm, x86, linux-doc, amit.shah, thomas.lendacky, bp, tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa, pbonzini, daniel.sneddon, kai.huang, sandipan.das, boris.ostrovsky, Babu.Moger, david.kaplan, dwmw On 20/11/2025 8:11 pm, Sean Christopherson wrote: > KVM: SVM: > > On Fri, Nov 07, 2025, Amit Shah wrote: >> From: Amit Shah <amit.shah@amd.com> >> >> AMD CPUs with the Enhanced Return Address Predictor (ERAPS) feature > Enhanced Return Address Predictor Security. The 'S' matters. > >> Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after >> VMEXITs. The feature adds guest/host tags to entries in the RSB (a.k.a. >> RAP). This helps with speculation protection across the VM boundary, >> and it also preserves host and guest entries in the RSB that can improve >> software performance (which would otherwise be flushed due to the >> FILL_RETURN_BUFFER sequences). This feature also extends the size of >> the RSB from the older standard (of 32 entries) to a new default >> enumerated in CPUID leaf 0x80000021:EBX bits 23:16 -- which is 64 >> entries in Zen5 CPUs. >> >> The hardware feature is always-on, and the host context uses the full >> default RSB size without any software changes necessary. The presence >> of this feature allows software (both in host and guest contexts) to >> drop all RSB filling routines in favour of the hardware doing it. >> >> There are two guest/host configurations that need to be addressed before >> allowing a guest to use this feature: nested guests, and hosts using >> shadow paging (or when NPT is disabled): >> >> 1. Nested guests: the ERAPS feature adds host/guest tagging to entries >> in the RSB, but does not distinguish between the guest ASIDs. To >> prevent the case of an L2 guest poisoning the RSB to attack the L1 >> guest, the CPU exposes a new VMCB bit (CLEAR_RAP). The next >> VMRUN with a VMCB that has this bit set causes the CPU to flush the >> RSB before entering the guest context. Set the bit in VMCB01 after a >> nested #VMEXIT to ensure the next time the L1 guest runs, its RSB >> contents aren't polluted by the L2's contents. Similarly, before >> entry into a nested guest, set the bit for VMCB02, so that the L1 >> guest's RSB contents are not leaked/used in the L2 context. >> >> 2. Hosts that disable NPT: the ERAPS feature flushes the RSB entries on >> several conditions, including CR3 updates. Emulating hardware >> behaviour on RSB flushes is not worth the effort for NPT=off case, >> nor is it worthwhile to enumerate and emulate every trigger the >> hardware uses to flush RSB entries. Instead of identifying and >> replicating RSB flushes that hardware would have performed had NPT >> been ON, do not let NPT=off VMs use the ERAPS features. > The emulation requirements are not limited to shadow paging. From the APM: > > The ERAPS feature eliminates the need to execute CALL instructions to clear > the return address predictor in most cases. On processors that support ERAPS, > return addresses from CALL instructions executed in host mode are not used in > guest mode, and vice versa. Additionally, the return address predictor is > cleared in all cases when the TLB is implicitly invalidated (see Section 5.5.3 “TLB > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Management,” on page 159) and in the following cases: > > • MOV CR3 instruction > • INVPCID other than single address invalidation (operation type 0) I already asked AMD for clarification here. AIUI, INVLPGB should be included in this list, and that begs the question what else is missed from the documentation. > > Yes, KVM only intercepts MOV CR3 and INVPCID when NPT is disabled (or INVPCID is > unsupported per guest CPUID), but that is an implementation detail, the instructions > are still reachable via emulator, and KVM needs to emulate implicit TLB flush > behavior. The Implicit flushes cover CR0.PG, CR4.{PSE,PGE,PCIDE,PKE}, SMI, RSM, writes to MTRR MSR, #INIT, A20M, and "other model specific MSRs, see NDA docs". The final part is very unhelpful in practice, and necessitates a RAS flush on any emulated WRMSR, unless AMD are going to start handing out the multi-coloured documents... The really fastpath MSRs are unintercepted and won't suffer this overhead. ~Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-21 2:40 ` Andrew Cooper @ 2025-11-21 14:58 ` Sean Christopherson 2025-11-21 15:21 ` Andrew Cooper 0 siblings, 1 reply; 11+ messages in thread From: Sean Christopherson @ 2025-11-21 14:58 UTC (permalink / raw) To: Andrew Cooper Cc: Amit Shah, linux-kernel, kvm, x86, linux-doc, amit.shah, thomas.lendacky, bp, tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa, pbonzini, daniel.sneddon, kai.huang, sandipan.das, boris.ostrovsky, Babu.Moger, david.kaplan, dwmw On Fri, Nov 21, 2025, Andrew Cooper wrote: > On 20/11/2025 8:11 pm, Sean Christopherson wrote: > > The emulation requirements are not limited to shadow paging. From the APM: > > > > The ERAPS feature eliminates the need to execute CALL instructions to clear > > the return address predictor in most cases. On processors that support ERAPS, > > return addresses from CALL instructions executed in host mode are not used in > > guest mode, and vice versa. Additionally, the return address predictor is > > cleared in all cases when the TLB is implicitly invalidated (see Section 5.5.3 “TLB > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Management,” on page 159) and in the following cases: > > > > • MOV CR3 instruction > > • INVPCID other than single address invalidation (operation type 0) > > I already asked AMD for clarification here. AIUI, INVLPGB should be > included in this list, and that begs the question what else is missed > from the documentation. > > > > > Yes, KVM only intercepts MOV CR3 and INVPCID when NPT is disabled (or INVPCID is > > unsupported per guest CPUID), but that is an implementation detail, the instructions > > are still reachable via emulator, and KVM needs to emulate implicit TLB flush > > behavior. > > The Implicit flushes cover CR0.PG, CR4.{PSE,PGE,PCIDE,PKE}, SMI, RSM, > writes to MTRR MSR, #INIT, A20M, and "other model specific MSRs, see NDA > docs". > > The final part is very unhelpful in practice, and necessitates a RAS > flush on any emulated WRMSR, unless AMD are going to start handing out > the multi-coloured documents... Does Xen actually emulate guest TLB flushes on all emulated WRMSRs? A RAS flush seems like small peanuts compared to a TLB flush. > The really fastpath MSRs are unintercepted and won't suffer this overhead. Heh, if an unintercepted MSR is on the "naughty list", wouldn't that break shadow paging schemes that rely on intercepting architectural TLB flushes to synchronize shadow PTEs? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-21 14:58 ` Sean Christopherson @ 2025-11-21 15:21 ` Andrew Cooper 0 siblings, 0 replies; 11+ messages in thread From: Andrew Cooper @ 2025-11-21 15:21 UTC (permalink / raw) To: Sean Christopherson Cc: Amit Shah, linux-kernel, kvm, x86, linux-doc, amit.shah, thomas.lendacky, bp, tglx, peterz, jpoimboe, pawan.kumar.gupta, corbet, mingo, dave.hansen, hpa, pbonzini, daniel.sneddon, kai.huang, sandipan.das, boris.ostrovsky, Babu.Moger, david.kaplan, dwmw On 21/11/2025 2:58 pm, Sean Christopherson wrote: > On Fri, Nov 21, 2025, Andrew Cooper wrote: >> On 20/11/2025 8:11 pm, Sean Christopherson wrote: >>> The emulation requirements are not limited to shadow paging. From the APM: >>> >>> The ERAPS feature eliminates the need to execute CALL instructions to clear >>> the return address predictor in most cases. On processors that support ERAPS, >>> return addresses from CALL instructions executed in host mode are not used in >>> guest mode, and vice versa. Additionally, the return address predictor is >>> cleared in all cases when the TLB is implicitly invalidated (see Section 5.5.3 “TLB >>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> Management,” on page 159) and in the following cases: >>> >>> • MOV CR3 instruction >>> • INVPCID other than single address invalidation (operation type 0) >> I already asked AMD for clarification here. AIUI, INVLPGB should be >> included in this list, and that begs the question what else is missed >> from the documentation. >> >>> Yes, KVM only intercepts MOV CR3 and INVPCID when NPT is disabled (or INVPCID is >>> unsupported per guest CPUID), but that is an implementation detail, the instructions >>> are still reachable via emulator, and KVM needs to emulate implicit TLB flush >>> behavior. >> The Implicit flushes cover CR0.PG, CR4.{PSE,PGE,PCIDE,PKE}, SMI, RSM, >> writes to MTRR MSR, #INIT, A20M, and "other model specific MSRs, see NDA >> docs". >> >> The final part is very unhelpful in practice, and necessitates a RAS >> flush on any emulated WRMSR, unless AMD are going to start handing out >> the multi-coloured documents... > Does Xen actually emulate guest TLB flushes on all emulated WRMSRs? Not currently. I need to reassess in light of this conversation. > A RAS flush seems like small peanuts compared to a TLB flush. Workload dependent, but in the common case, I'd expect so. > >> The really fastpath MSRs are unintercepted and won't suffer this overhead. > Heh, if an unintercepted MSR is on the "naughty list", wouldn't that break shadow > paging schemes that rely on intercepting architectural TLB flushes to synchronize > shadow PTEs? Hmm. Yes it would, if (and only if) the OS is aware of and depending on the WRMSR for TLB flushing. I doubt OSes are depending on model specific side effects such as this, but we have no way to know for sure. ~Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-20 20:11 ` Sean Christopherson 2025-11-21 2:40 ` Andrew Cooper @ 2025-11-24 16:15 ` Shah, Amit 2025-11-24 16:40 ` Andrew Cooper 1 sibling, 1 reply; 11+ messages in thread From: Shah, Amit @ 2025-11-24 16:15 UTC (permalink / raw) To: seanjc@google.com Cc: corbet@lwn.net, pawan.kumar.gupta@linux.intel.com, kai.huang@intel.com, jpoimboe@kernel.org, andrew.cooper3@citrix.com, dave.hansen@linux.intel.com, daniel.sneddon@linux.intel.com, Lendacky, Thomas, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@redhat.com, dwmw@amazon.co.uk, pbonzini@redhat.com, tglx@linutronix.de, Moger, Babu, Das1, Sandipan, linux-doc@vger.kernel.org, hpa@zytor.com, peterz@infradead.org, bp@alien8.de, boris.ostrovsky@oracle.com, Kaplan, David, x86@kernel.org On Thu, 2025-11-20 at 12:11 -0800, Sean Christopherson wrote: > > > 2. Hosts that disable NPT: the ERAPS feature flushes the RSB > > entries on > > several conditions, including CR3 updates. Emulating hardware > > behaviour on RSB flushes is not worth the effort for NPT=off > > case, > > nor is it worthwhile to enumerate and emulate every trigger the > > hardware uses to flush RSB entries. Instead of identifying and > > replicating RSB flushes that hardware would have performed had > > NPT > > been ON, do not let NPT=off VMs use the ERAPS features. > > The emulation requirements are not limited to shadow paging. From > the APM: > > The ERAPS feature eliminates the need to execute CALL instructions > to clear > the return address predictor in most cases. On processors that > support ERAPS, > return addresses from CALL instructions executed in host mode are > not used in > guest mode, and vice versa. Additionally, the return address > predictor is > cleared in all cases when the TLB is implicitly invalidated (see > Section 5.5.3 “TLB > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Management,” on page 159) and in the following cases: > > • MOV CR3 instruction > • INVPCID other than single address invalidation (operation type 0) > > Yes, KVM only intercepts MOV CR3 and INVPCID when NPT is disabled (or > INVPCID is > unsupported per guest CPUID), but that is an implementation detail, > the instructions > are still reachable via emulator, and KVM needs to emulate implicit > TLB flush > behavior. > > So punting on emulating RAP clearing because it's too hard is not an > option. And > AFAICT, it's not even that hard. I didn't mean on punting it in the "it's too hard" sense, but in the sense that we don't know all the details of when hardware decides to do a flush; and even if triggers are mentioned in this APM today, future changes to microcode or APM docs might reveal more triggers that we need to emulate and account for. There's no way to track such changes, so my thinking is that we should be conservative and not assume anything. > The changelog also needs to include the architectural behavior, > otherwise "is not > worth the effort" is even more subjective since there's no > documentation of what > the effort would actually be. > As for emulating the RAP clears, a clever idea is to piggyback and > alias dirty > tracking for VCPU_EXREG_CR3, as VCPU_EXREG_ERAPS. I.e. mark the vCPU > as needing > a RAP clear if CR3 is written to, and then let common x86 also set > VCPU_EXREG_ERAPS > as needed, e.g. when handling INVPCID. > Compile tested only at this point, but this? I'll run this on my hardware and check for anything obvious. Since you're also saying the npt=on and npt=off cases aren't very different, I'll check with the hardware architects to confirm we can indeed go with the RAP clearing triggers as presented. Amit ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-24 16:15 ` Shah, Amit @ 2025-11-24 16:40 ` Andrew Cooper 2025-11-25 14:41 ` Shah, Amit 2025-12-11 16:09 ` Shah, Amit 0 siblings, 2 replies; 11+ messages in thread From: Andrew Cooper @ 2025-11-24 16:40 UTC (permalink / raw) To: Shah, Amit, seanjc@google.com Cc: corbet@lwn.net, pawan.kumar.gupta@linux.intel.com, kai.huang@intel.com, jpoimboe@kernel.org, dave.hansen@linux.intel.com, daniel.sneddon@linux.intel.com, Lendacky, Thomas, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@redhat.com, dwmw@amazon.co.uk, pbonzini@redhat.com, tglx@linutronix.de, Moger, Babu, Das1, Sandipan, linux-doc@vger.kernel.org, hpa@zytor.com, peterz@infradead.org, bp@alien8.de, boris.ostrovsky@oracle.com, Kaplan, David, x86@kernel.org On 24/11/2025 4:15 pm, Shah, Amit wrote: > On Thu, 2025-11-20 at 12:11 -0800, Sean Christopherson wrote: >>> 2. Hosts that disable NPT: the ERAPS feature flushes the RSB >>> entries on >>> several conditions, including CR3 updates. Emulating hardware >>> behaviour on RSB flushes is not worth the effort for NPT=off >>> case, >>> nor is it worthwhile to enumerate and emulate every trigger the >>> hardware uses to flush RSB entries. Instead of identifying and >>> replicating RSB flushes that hardware would have performed had >>> NPT >>> been ON, do not let NPT=off VMs use the ERAPS features. >> The emulation requirements are not limited to shadow paging. From >> the APM: >> >> The ERAPS feature eliminates the need to execute CALL instructions >> to clear >> the return address predictor in most cases. On processors that >> support ERAPS, >> return addresses from CALL instructions executed in host mode are >> not used in >> guest mode, and vice versa. Additionally, the return address >> predictor is >> cleared in all cases when the TLB is implicitly invalidated (see >> Section 5.5.3 “TLB >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> Management,” on page 159) and in the following cases: >> >> • MOV CR3 instruction >> • INVPCID other than single address invalidation (operation type 0) >> >> Yes, KVM only intercepts MOV CR3 and INVPCID when NPT is disabled (or >> INVPCID is >> unsupported per guest CPUID), but that is an implementation detail, >> the instructions >> are still reachable via emulator, and KVM needs to emulate implicit >> TLB flush >> behavior. >> >> So punting on emulating RAP clearing because it's too hard is not an >> option. And >> AFAICT, it's not even that hard. > I didn't mean on punting it in the "it's too hard" sense, but in the > sense that we don't know all the details of when hardware decides to do > a flush; and even if triggers are mentioned in this APM today, future > changes to microcode or APM docs might reveal more triggers that we > need to emulate and account for. There's no way to track such changes, > so my thinking is that we should be conservative and not assume > anything. But this *is* the problem. The APM says that OSes can depend on this property for safety, and does not provide enough information for Hypervisors to make it safe. ERAPS is a bad spec. It should not have gotten out of the door. A better spec would say "clears the RAP on any MOV to CR3" and nothing else. The fact that it might happen microarchitecturally in other cases doesn't matter; what matters is what OSes can architecturally depend on, and right now that that explicitly includes "unspecified cases in NDA documents". ~Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-24 16:40 ` Andrew Cooper @ 2025-11-25 14:41 ` Shah, Amit 2025-11-25 14:54 ` Sean Christopherson 2025-12-11 16:09 ` Shah, Amit 1 sibling, 1 reply; 11+ messages in thread From: Shah, Amit @ 2025-11-25 14:41 UTC (permalink / raw) To: seanjc@google.com, andrew.cooper3@citrix.com Cc: corbet@lwn.net, pawan.kumar.gupta@linux.intel.com, kai.huang@intel.com, jpoimboe@kernel.org, dave.hansen@linux.intel.com, daniel.sneddon@linux.intel.com, Lendacky, Thomas, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@redhat.com, dwmw@amazon.co.uk, pbonzini@redhat.com, tglx@linutronix.de, Moger, Babu, Das1, Sandipan, linux-doc@vger.kernel.org, hpa@zytor.com, peterz@infradead.org, bp@alien8.de, boris.ostrovsky@oracle.com, Kaplan, David, x86@kernel.org On Mon, 2025-11-24 at 16:40 +0000, Andrew Cooper wrote: > On 24/11/2025 4:15 pm, Shah, Amit wrote: > > On Thu, 2025-11-20 at 12:11 -0800, Sean Christopherson wrote: > > > > 2. Hosts that disable NPT: the ERAPS feature flushes the RSB > > > > entries on > > > > several conditions, including CR3 updates. Emulating > > > > hardware > > > > behaviour on RSB flushes is not worth the effort for NPT=off > > > > case, > > > > nor is it worthwhile to enumerate and emulate every trigger > > > > the > > > > hardware uses to flush RSB entries. Instead of identifying > > > > and > > > > replicating RSB flushes that hardware would have performed > > > > had > > > > NPT > > > > been ON, do not let NPT=off VMs use the ERAPS features. > > > The emulation requirements are not limited to shadow paging. > > > From > > > the APM: > > > > > > The ERAPS feature eliminates the need to execute CALL > > > instructions > > > to clear > > > the return address predictor in most cases. On processors that > > > support ERAPS, > > > return addresses from CALL instructions executed in host mode > > > are > > > not used in > > > guest mode, and vice versa. Additionally, the return address > > > predictor is > > > cleared in all cases when the TLB is implicitly invalidated > > > (see > > > Section 5.5.3 “TLB > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > Management,” on page 159) and in the following cases: > > > > > > • MOV CR3 instruction > > > • INVPCID other than single address invalidation (operation > > > type 0) > > > > > > Yes, KVM only intercepts MOV CR3 and INVPCID when NPT is disabled > > > (or > > > INVPCID is > > > unsupported per guest CPUID), but that is an implementation > > > detail, > > > the instructions > > > are still reachable via emulator, and KVM needs to emulate > > > implicit > > > TLB flush > > > behavior. > > > > > > So punting on emulating RAP clearing because it's too hard is not > > > an > > > option. And > > > AFAICT, it's not even that hard. > > I didn't mean on punting it in the "it's too hard" sense, but in > > the > > sense that we don't know all the details of when hardware decides > > to do > > a flush; and even if triggers are mentioned in this APM today, > > future > > changes to microcode or APM docs might reveal more triggers that we > > need to emulate and account for. There's no way to track such > > changes, > > so my thinking is that we should be conservative and not assume > > anything. > > But this *is* the problem. The APM says that OSes can depend on this > property for safety, and does not provide enough information for > Hypervisors to make it safe. That's certainly true - that's driving my reluctance to perform the emulation or in enabling it for cases that aren't completely clear. > ERAPS is a bad spec. It should not have gotten out of the door. > > A better spec would say "clears the RAP on any MOV to CR3" and > nothing else. > > The fact that it might happen microarchitecturally in other cases > doesn't matter; what matters is what OSes can architecturally depend > on, > and right now that that explicitly includes "unspecified cases in NDA > documents". To be honest, I haven't seen the mention of those unspecified cases or NDA documents. However, at least for the case of an NPT guest, the hypervisor does not need to do anything special (other than handle nested guests as this patch does). Amit ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-25 14:41 ` Shah, Amit @ 2025-11-25 14:54 ` Sean Christopherson 0 siblings, 0 replies; 11+ messages in thread From: Sean Christopherson @ 2025-11-25 14:54 UTC (permalink / raw) To: Amit Shah Cc: andrew.cooper3@citrix.com, corbet@lwn.net, pawan.kumar.gupta@linux.intel.com, kai.huang@intel.com, jpoimboe@kernel.org, dave.hansen@linux.intel.com, daniel.sneddon@linux.intel.com, Thomas Lendacky, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@redhat.com, dwmw@amazon.co.uk, pbonzini@redhat.com, tglx@linutronix.de, Babu Moger, Sandipan Das1, linux-doc@vger.kernel.org, hpa@zytor.com, peterz@infradead.org, bp@alien8.de, boris.ostrovsky@oracle.com, David Kaplan, x86@kernel.org On Tue, Nov 25, 2025, Amit Shah wrote: > On Mon, 2025-11-24 at 16:40 +0000, Andrew Cooper wrote: > > > > So punting on emulating RAP clearing because it's too hard is not > > > > an option. And AFAICT, it's not even that hard. > > > I didn't mean on punting it in the "it's too hard" sense, but in the > > > sense that we don't know all the details of when hardware decides to do a > > > flush; and even if triggers are mentioned in this APM today, future > > > changes to microcode or APM docs might reveal more triggers that we need > > > to emulate and account for. There's no way to track such changes, so my > > > thinking is that we should be conservative and not assume anything. > > > > But this *is* the problem. The APM says that OSes can depend on this > > property for safety, and does not provide enough information for > > Hypervisors to make it safe. > > That's certainly true - that's driving my reluctance to perform the > emulation or in enabling it for cases that aren't completely clear. Uh, I think you're misunderstanding what Andrew and I are saying. Doing nothing is the worst option. > > ERAPS is a bad spec. It should not have gotten out of the door. > > > > A better spec would say "clears the RAP on any MOV to CR3" and > > nothing else. > > > > The fact that it might happen microarchitecturally in other cases doesn't > > matter; what matters is what OSes can architecturally depend on, and right > > now that that explicitly includes "unspecified cases in NDA documents". > > To be honest, I haven't seen the mention of those unspecified cases or > NDA documents. > > However, at least for the case of an NPT guest, the hypervisor does not > need to do anything special (other than handle nested guests as this > patch does). How on earth do you come to that conclusion? I'm genuinely baffled as to why you think it's safe to completely ignore RAP clears that are architecturally supposed to happen from the guest's perspective. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests 2025-11-24 16:40 ` Andrew Cooper 2025-11-25 14:41 ` Shah, Amit @ 2025-12-11 16:09 ` Shah, Amit 1 sibling, 0 replies; 11+ messages in thread From: Shah, Amit @ 2025-12-11 16:09 UTC (permalink / raw) To: seanjc@google.com, andrew.cooper3@citrix.com Cc: corbet@lwn.net, pawan.kumar.gupta@linux.intel.com, kai.huang@intel.com, jpoimboe@kernel.org, dave.hansen@linux.intel.com, daniel.sneddon@linux.intel.com, Lendacky, Thomas, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@redhat.com, dwmw@amazon.co.uk, pbonzini@redhat.com, tglx@linutronix.de, Moger, Babu, Das1, Sandipan, linux-doc@vger.kernel.org, hpa@zytor.com, peterz@infradead.org, bp@alien8.de, boris.ostrovsky@oracle.com, Kaplan, David, x86@kernel.org On Mon, 2025-11-24 at 16:40 +0000, Andrew Cooper wrote: [...] > But this *is* the problem. The APM says that OSes can depend on this > property for safety, and does not provide enough information for > Hypervisors to make it safe. > > ERAPS is a bad spec. It should not have gotten out of the door. > > A better spec would say "clears the RAP on any MOV to CR3" and > nothing else. > > The fact that it might happen microarchitecturally in other cases > doesn't matter; what matters is what OSes can architecturally depend > on, > and right now that that explicitly includes "unspecified cases in NDA > documents". I'd like to clarify and confirm the details around TLB flushes and their effects of ERAPS here as an official AMD statement. First, I'd like to clarify that INVLPGB does not flush the RAP. Second, Referring to the APM at https://docs.amd.com/v/u/en-US/24593_3.43 Section 3.2.9: move to CR3 and the execution of INVPCID are the instructions that result in the flushing of the ERAPS. The reference to section 5.3.3 - TLB Management - and the implicit TLB flushes there was unclear which led to most of the speculation in this discussion earlier in this thread. For the section 5.3.3, we will update the APM to clarify that the "writes to certain MSRs" relates to microarchitectural behavior. The updated wording for section 5.3.3 will make its way in a future APM update. (For the curious, the list of those MSRs currently is APIC_BASE, PREFETCH_CONTROL, SYSCFG, IORRs, TOM/TOM2, SMMADDR/MASK, ECS_BASE, but of course this is subject to change.) Coming back to the patch here with these clarifications from AMD architects - I am happy with Sean's update to the patch. I've also tested the patch and it works as expected on a Zen5 CPU. Reviewed-by: Amit Shah <amit.shah@amd.com> Thanks, Amit ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-12-11 16:09 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-11-07 9:32 [PATCH v6 0/1] KVM: Add support for the ERAPS feature Amit Shah 2025-11-07 9:32 ` [PATCH v6 1/1] x86: kvm: svm: set up ERAPS support for guests Amit Shah 2025-11-20 20:11 ` Sean Christopherson 2025-11-21 2:40 ` Andrew Cooper 2025-11-21 14:58 ` Sean Christopherson 2025-11-21 15:21 ` Andrew Cooper 2025-11-24 16:15 ` Shah, Amit 2025-11-24 16:40 ` Andrew Cooper 2025-11-25 14:41 ` Shah, Amit 2025-11-25 14:54 ` Sean Christopherson 2025-12-11 16:09 ` Shah, Amit
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).