[RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
@ 2025-03-13 20:36 Jon Kohler
  2025-03-13 20:36 ` [RFC PATCH 01/18] KVM: VMX: Remove EPT_VIOLATIONS_ACC_*_BIT defines Jon Kohler
                   ` (20 more replies)
  0 siblings, 21 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler, Alexander Grest, Nicolas Saenz Julienne,
	Madhavan T . Venkataraman, Mickaël Salaün, Tao Su,
	Xiaoyao Li, Zhao Liu

## Summary
This series introduces support for Intel Mode-Based Execute Control
(MBEC) to KVM and nested VMX virtualization, aiming to significantly
reduce VMexits and improve performance for Windows guests running with
Hypervisor-Protected Code Integrity (HVCI).

## What?
Intel MBEC is a hardware feature, introduced in the Kabylake
generation, that allows for more granular control over execution
permissions. MBEC enables the separation and tracking of execution
permissions for supervisor (kernel) and user-mode code. It is used as
an accelerator for Microsoft's Memory Integrity [1] (also known as
hypervisor-protected code integrity or HVCI).

## Why?
The primary reason for this feature is performance.

Without hardware-level MBEC, enabling Windows HVCI runs a 'software
MBEC' known as Restricted User Mode, which imposes a runtime overhead
due to increased state transitions between the guest's L2 root
partition and the L2 secure partition for running kernel mode code
integrity operations.

In practice, this results in a significant number of exits. For
example, playing a YouTube video within the Edge Browser produces
roughly 1.2 million VMexits/second across an 8 vCPU Windows 11 guest.

Most of these exits are VMREAD/VMWRITE operations, which can be
emulated with Enlightened VMCS (eVMCS). However, even with eVMCS, this
configuration still produces around 200,000 VMexits/second.

With MBEC exposed to the L1 Windows Hypervisor, the same scenario
results in approximately 50,000 VMexits/second, a *24x* reduction from
the baseline.

Not a typo, 24x reduction in VMexits.

## How?
This series implements core KVM support for exposing the MBEC bit in
secondary execution controls (bit 22) to L1 and L2, based on
configuration from user space and a module parameter
'enable_pt_guest_exec_control'. The inspiration for this series
started with Mickaël's series for Heki [3], where we've extracted,
refactored, and extended the MBEC-specific use case to be
general-purpose.

MBEC, which appears in Linux /proc/cpuinfo as ept_mode_based_exec,
splits the EPT exec bit (bit 2 in PTE) into two bits. When secondary
execution control bit 22 is set, PTE bit 2 reflects supervisor mode
executable, and PTE bit 10 reflects user mode executable.

The semantics for EPT violation qualifications also change when MBEC
is enabled, with bit 5 reflecting supervisor/kernel mode execute
permissions and bit 6 reflecting user mode execute permissions.
This ultimately serves to expose this feature to the L1 hypervisor,
which consumes MBEC and informs the L2 partitions not to use the
software MBEC by removing bit 14 in 0x40000004 EAX [4].

## Where?
Enablement spans both VMX code and MMU code to teach the shadow MMU
about the different execution modes, as well as user space VMM to pass
secondary execution control bit 22. A patch for QEMU enablement is
available [5].

## Testing
Initial testing has been on done on 6.12-based code with:
  Guests
    - Windows 11 24H2 26100.2894
    - Windows Server 2025 24H2 26100.2894
    - Windows Server 2022 W1H2 20348.825
  Processors:
    - Intel Skylake 6154
    - Intel Sapphire Rapids 6444Y

## Acknowledgements
Special thanks to all contributors and reviewers who have provided
valuable feedback and support for this patch series.

[1] https://learn.microsoft.com/en-us/windows/security/hardware-security/enable-virtualization-based-protection-of-code-integrity
[2] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/nested-virtualization#enlightened-vmcs-intel
[3] https://patchwork.kernel.org/project/kvm/patch/20231113022326.24388-6-mic@digikod.net/
[4] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/feature-discovery#implementation-recommendations---0x40000004
[5] https://github.com/JonKohler/qemu/tree/mbec-rfc-v1

Cc: Alexander Grest <Alexander.Grest@microsoft.com>
Cc: Nicolas Saenz Julienne <nsaenz@amazon.es>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Mickaël Salaün <mic@digikod.net>
Cc: Tao Su <tao1.su@linux.intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Zhao Liu <zhao1.liu@intel.com>

Jon Kohler (11):
  KVM: x86: Add module parameter for Intel MBEC
  KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch
  KVM: VMX: Wire up Intel MBEC enable/disable logic
  KVM: x86/mmu: Remove SPTE_PERM_MASK
  KVM: VMX: Extend EPT Violation protection bits
  KVM: x86/mmu: Introduce shadow_ux_mask
  KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC
  KVM: x86/mmu: Extend make_spte to understand MBEC
  KVM: nVMX: Setup Intel MBEC in nested secondary controls
  KVM: VMX: Allow MBEC with EVMCS
  KVM: x86: Enable module parameter for MBEC

Mickaël Salaün (5):
  KVM: VMX: add cpu_has_vmx_mbec helper
  KVM: VMX: Define VMX_EPT_USER_EXECUTABLE_MASK
  KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role
  KVM: VMX: Enhance EPT violation handler for PROT_USER_EXEC
  KVM: x86/mmu: Extend is_executable_pte to understand MBEC

Nikolay Borisov (1):
  KVM: VMX: Remove EPT_VIOLATIONS_ACC_*_BIT defines

Sean Christopherson (1):
  KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bits

 arch/x86/include/asm/kvm_host.h | 13 +++++----
 arch/x86/include/asm/vmx.h      | 45 ++++++++++++++++++++---------
 arch/x86/kvm/mmu.h              |  3 +-
 arch/x86/kvm/mmu/mmu.c          | 13 +++++----
 arch/x86/kvm/mmu/mmutrace.h     | 23 ++++++++++-----
 arch/x86/kvm/mmu/paging_tmpl.h  | 19 +++++++++---
 arch/x86/kvm/mmu/spte.c         | 51 ++++++++++++++++++++++++++++-----
 arch/x86/kvm/mmu/spte.h         | 36 +++++++++++++++--------
 arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
 arch/x86/kvm/vmx/capabilities.h |  6 ++++
 arch/x86/kvm/vmx/hyperv.c       |  5 +++-
 arch/x86/kvm/vmx/hyperv_evmcs.h |  1 +
 arch/x86/kvm/vmx/nested.c       |  4 +++
 arch/x86/kvm/vmx/vmx.c          | 21 ++++++++++++--
 arch/x86/kvm/vmx/vmx.h          |  7 +++++
 arch/x86/kvm/x86.c              |  4 +++
 16 files changed, 192 insertions(+), 61 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH 01/18] KVM: VMX: Remove EPT_VIOLATIONS_ACC_*_BIT defines
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-03-13 20:36 ` [RFC PATCH 02/18] KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bits Jon Kohler
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Nikolay Borisov

From: Nikolay Borisov <nik.borisov@suse.com>

Those defines are only used in the definition of the various
EPT_VIOLATIONS_ACC_* macros which are then used to extract respective
bits from vmexit error qualifications. Remove the _BIT defines and
redefine the _ACC ones via BIT() macro. No functional changes.

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250227000705.3199706-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit fa6c8fc2d2673dcaf7333bc35eb759ab7c39b81f)
(cherry picked from commit b55fd5c48d3ec1dbf566937a377817b390ec0768)

---
 arch/x86/include/asm/vmx.h | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f7fd4369b821..aabc223c6498 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -580,18 +580,13 @@ enum vm_entry_failure_code {
 /*
  * Exit Qualifications for EPT Violations
  */
-#define EPT_VIOLATION_ACC_READ_BIT	0
-#define EPT_VIOLATION_ACC_WRITE_BIT	1
-#define EPT_VIOLATION_ACC_INSTR_BIT	2
 #define EPT_VIOLATION_RWX_SHIFT		3
-#define EPT_VIOLATION_GVA_IS_VALID_BIT	7
-#define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
-#define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
-#define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
-#define EPT_VIOLATION_ACC_INSTR		(1 << EPT_VIOLATION_ACC_INSTR_BIT)
+#define EPT_VIOLATION_ACC_READ		BIT(0)
+#define EPT_VIOLATION_ACC_WRITE		BIT(1)
+#define EPT_VIOLATION_ACC_INSTR		BIT(2)
 #define EPT_VIOLATION_RWX_MASK		(VMX_EPT_RWX_MASK << EPT_VIOLATION_RWX_SHIFT)
-#define EPT_VIOLATION_GVA_IS_VALID	(1 << EPT_VIOLATION_GVA_IS_VALID_BIT)
-#define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
+#define EPT_VIOLATION_GVA_IS_VALID	BIT(7)
+#define EPT_VIOLATION_GVA_TRANSLATED	BIT(8)
 
 /*
  * Exit Qualifications for NOTIFY VM EXIT
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 02/18] KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bits
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
  2025-03-13 20:36 ` [RFC PATCH 01/18] KVM: VMX: Remove EPT_VIOLATIONS_ACC_*_BIT defines Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-03-13 20:36 ` [RFC PATCH 03/18] KVM: x86: Add module parameter for Intel MBEC Jon Kohler
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler, Nikolay Borisov

From: Sean Christopherson <seanjc@google.com>

Define independent macros for the RWX protection bits that are enumerated
via EXIT_QUALIFICATION for EPT Violations, and tie them to the RWX bits in
EPT entries via compile-time asserts.  Piggybacking the EPTE defines works
for now, but it creates holes in the EPT_VIOLATION_xxx macros and will
cause headaches if/when KVM emulates Mode-Based Execution (MBEC), or any
other features that introduces additional protection information.

Opportunistically rename EPT_VIOLATION_RWX_MASK to EPT_VIOLATION_PROT_MASK
so that it doesn't become stale if/when MBEC support is added.

No functional change intended.

Cc: Jon Kohler <jon@nutanix.com>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250227000705.3199706-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 61146f67e4cb67064ce3003d94ee19302d314fff)
(cherry picked from commit 8cddacdb6a6a459c9425b4abd4c982cec89c25e4)

---
 arch/x86/include/asm/vmx.h     | 13 +++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h |  3 +--
 arch/x86/kvm/vmx/vmx.c         |  2 +-
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index aabc223c6498..8707361b24da 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -580,14 +580,23 @@ enum vm_entry_failure_code {
 /*
  * Exit Qualifications for EPT Violations
  */
-#define EPT_VIOLATION_RWX_SHIFT		3
 #define EPT_VIOLATION_ACC_READ		BIT(0)
 #define EPT_VIOLATION_ACC_WRITE		BIT(1)
 #define EPT_VIOLATION_ACC_INSTR		BIT(2)
-#define EPT_VIOLATION_RWX_MASK		(VMX_EPT_RWX_MASK << EPT_VIOLATION_RWX_SHIFT)
+#define EPT_VIOLATION_PROT_READ		BIT(3)
+#define EPT_VIOLATION_PROT_WRITE	BIT(4)
+#define EPT_VIOLATION_PROT_EXEC		BIT(5)
+#define EPT_VIOLATION_PROT_MASK		(EPT_VIOLATION_PROT_READ  | \
+					 EPT_VIOLATION_PROT_WRITE | \
+					 EPT_VIOLATION_PROT_EXEC)
 #define EPT_VIOLATION_GVA_IS_VALID	BIT(7)
 #define EPT_VIOLATION_GVA_TRANSLATED	BIT(8)
 
+#define EPT_VIOLATION_RWX_TO_PROT(__epte) (((__epte) & VMX_EPT_RWX_MASK) << 3)
+
+static_assert(EPT_VIOLATION_RWX_TO_PROT(VMX_EPT_RWX_MASK) ==
+	      (EPT_VIOLATION_PROT_READ | EPT_VIOLATION_PROT_WRITE | EPT_VIOLATION_PROT_EXEC));
+
 /*
  * Exit Qualifications for NOTIFY VM EXIT
  */
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index ae7d39ff2d07..9bc3fc4a238b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -510,8 +510,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		 * Note, pte_access holds the raw RWX bits from the EPTE, not
 		 * ACC_*_MASK flags!
 		 */
-		walker->fault.exit_qualification |= (pte_access & VMX_EPT_RWX_MASK) <<
-						     EPT_VIOLATION_RWX_SHIFT;
+		walker->fault.exit_qualification |= EPT_VIOLATION_RWX_TO_PROT(pte_access);
 	}
 #endif
 	walker->fault.address = addr;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 87206dabf020..7a98f03ef146 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5831,7 +5831,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
 		      ? PFERR_FETCH_MASK : 0;
 	/* ept page table entry is present? */
-	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
+	error_code |= (exit_qualification & EPT_VIOLATION_PROT_MASK)
 		      ? PFERR_PRESENT_MASK : 0;
 
 	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 03/18] KVM: x86: Add module parameter for Intel MBEC
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
  2025-03-13 20:36 ` [RFC PATCH 01/18] KVM: VMX: Remove EPT_VIOLATIONS_ACC_*_BIT defines Jon Kohler
  2025-03-13 20:36 ` [RFC PATCH 02/18] KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bits Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-05-12 18:08   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 04/18] KVM: VMX: add cpu_has_vmx_mbec helper Jon Kohler
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler

Add 'enable_pt_guest_exec_control' module parameter to x86 code, with
default value false. This parameter will control enablement for
exposing Intel Mode Based Execution Control (aka MBEC).

Place parameter in x86 common code as, notionally, AMD has a similar
feature called Guest Mode Execute Trap (GMET), which may want to build
off of this parameter in the future, similar to how 'enable_apicv' is
shared across both Intel APICv and AMD AVIC.

Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/x86.c              | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7cf2025a64a0..fd37dad38670 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1883,6 +1883,7 @@ struct kvm_arch_async_pf {
 extern u32 __read_mostly kvm_nr_uret_msrs;
 extern bool __read_mostly allow_smaller_maxphyaddr;
 extern bool __read_mostly enable_apicv;
+extern bool __read_mostly enable_pt_guest_exec_control;
 extern struct kvm_x86_ops kvm_x86_ops;
 
 #define kvm_x86_call(func) static_call(kvm_x86_##func)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7bae9e9cc14e..4b2fbb9088ea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -197,6 +197,10 @@ module_param(eager_page_split, bool, 0644);
 static bool __read_mostly mitigate_smt_rsb;
 module_param(mitigate_smt_rsb, bool, 0444);
 
+bool __read_mostly enable_pt_guest_exec_control;
+EXPORT_SYMBOL_GPL(enable_pt_guest_exec_control);
+module_param(enable_pt_guest_exec_control, bool, 0444);
+
 /*
  * Restoring the host value for MSRs that are only consumed when running in
  * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 04/18] KVM: VMX: add cpu_has_vmx_mbec helper
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (2 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 03/18] KVM: x86: Add module parameter for Intel MBEC Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-05-12 18:14   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 05/18] KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch Jon Kohler
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Mickaël Salaün, Jon Kohler

From: Mickaël Salaün <mic@digikod.net>

Add 'cpu_has_vmx_mbec' helper to determine whether the cpu based VMCS
from hardware has Intel Mode Based Execution Control exposed, which is
secondary execution control bit 22.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/kvm/vmx/capabilities.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index cb6588238f46..f83592272920 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -253,6 +253,12 @@ static inline bool cpu_has_vmx_xsaves(void)
 		SECONDARY_EXEC_ENABLE_XSAVES;
 }
 
+static inline bool cpu_has_vmx_mbec(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
+}
+
 static inline bool cpu_has_vmx_waitpkg(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 05/18] KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (3 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 04/18] KVM: VMX: add cpu_has_vmx_mbec helper Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-04-22  6:27   ` Chao Gao
  2025-05-12 18:15   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic Jon Kohler
                   ` (15 subsequent siblings)
  20 siblings, 2 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler

Add bool for pt_guest_exec_control to kvm_vcpu_arch, to be used for
runtime checks for Intel Mode Based Execution Control (MBEC) and
AMD Guest Mode Execute Control (GMET).

Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/include/asm/kvm_host.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fd37dad38670..192233eb557a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -856,6 +856,8 @@ struct kvm_vcpu_arch {
 	struct kvm_hypervisor_cpuid kvm_cpuid;
 	bool is_amd_compatible;
 
+	bool pt_guest_exec_control;
+
 	/*
 	 * FIXME: Drop this macro and use KVM_NR_GOVERNED_FEATURES directly
 	 * when "struct kvm_vcpu_arch" is no longer defined in an
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (4 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 05/18] KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-04-22  7:06   ` Chao Gao
  2025-05-12 18:23   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 07/18] KVM: VMX: Define VMX_EPT_USER_EXECUTABLE_MASK Jon Kohler
                   ` (14 subsequent siblings)
  20 siblings, 2 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler

Add logic to enable / disable Intel Mode Based Execution Control (MBEC)
based on specific conditions.

MBEC depends on:
- User space exposing secondary execution control bit 22
- Extended Page Tables (EPT)
- The KVM module parameter `enable_pt_guest_exec_control`

If any of these conditions are not met, MBEC will be disabled
accordingly.

Store runtime enablement within `kvm_vcpu_arch.pt_guest_exec_control`.

Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/kvm/vmx/vmx.c | 11 +++++++++++
 arch/x86/kvm/vmx/vmx.h |  7 +++++++
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7a98f03ef146..116910159a3f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2694,6 +2694,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 			return -EIO;
 
 		vmx_cap->ept = 0;
+		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
 		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
 	}
 	if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
@@ -4641,11 +4642,15 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 		exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
 	if (!enable_ept) {
 		exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
+		exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
 		exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
 		enable_unrestricted_guest = 0;
 	}
 	if (!enable_unrestricted_guest)
 		exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
+	if (!enable_pt_guest_exec_control)
+		exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
+
 	if (kvm_pause_in_guest(vmx->vcpu.kvm))
 		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
 	if (!kvm_vcpu_apicv_active(vcpu))
@@ -4770,6 +4775,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		if (vmx->ve_info)
 			vmcs_write64(VE_INFORMATION_ADDRESS,
 				     __pa(vmx->ve_info));
+
+		vmx->vcpu.arch.pt_guest_exec_control =
+			enable_pt_guest_exec_control && vmx_has_mbec(vmx);
 	}
 
 	if (cpu_has_tertiary_exec_ctrls())
@@ -8472,6 +8480,9 @@ __init int vmx_hardware_setup(void)
 	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
 		enable_unrestricted_guest = 0;
 
+	if (!cpu_has_vmx_mbec() || !enable_ept)
+		enable_pt_guest_exec_control = false;
+
 	if (!cpu_has_vmx_flexpriority())
 		flexpriority_enabled = 0;
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index d1e537bf50ea..9f4ae3139a90 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -580,6 +580,7 @@ static inline u8 vmx_get_rvi(void)
 	 SECONDARY_EXEC_ENABLE_VMFUNC |					\
 	 SECONDARY_EXEC_BUS_LOCK_DETECTION |				\
 	 SECONDARY_EXEC_NOTIFY_VM_EXITING |				\
+	 SECONDARY_EXEC_MODE_BASED_EPT_EXEC |				\
 	 SECONDARY_EXEC_ENCLS_EXITING |					\
 	 SECONDARY_EXEC_EPT_VIOLATION_VE)
 
@@ -721,6 +722,12 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
 		SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
 }
 
+static inline bool vmx_has_mbec(struct vcpu_vmx *vmx)
+{
+	return secondary_exec_controls_get(vmx) &
+		SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
+}
+
 static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
 {
 	if (!enable_ept)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 07/18] KVM: VMX: Define VMX_EPT_USER_EXECUTABLE_MASK
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (5 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-03-13 20:36 ` [RFC PATCH 08/18] KVM: x86/mmu: Remove SPTE_PERM_MASK Jon Kohler
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Mickaël Salaün, Jon Kohler

From: Mickaël Salaün <mic@digikod.net>

EPT bit 10 is used to denote user executable pages, for use with Intel
MBEC.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/include/asm/vmx.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8707361b24da..d7ab0ad63be6 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -537,6 +537,7 @@ enum vmcs_field {
 #define VMX_EPT_IPAT_BIT    			(1ull << 6)
 #define VMX_EPT_ACCESS_BIT			(1ull << 8)
 #define VMX_EPT_DIRTY_BIT			(1ull << 9)
+#define VMX_EPT_USER_EXECUTABLE_MASK		(1ull << 10)
 #define VMX_EPT_SUPPRESS_VE_BIT			(1ull << 63)
 #define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
 						 VMX_EPT_WRITABLE_MASK |       \
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 08/18] KVM: x86/mmu: Remove SPTE_PERM_MASK
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (6 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 07/18] KVM: VMX: Define VMX_EPT_USER_EXECUTABLE_MASK Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-03-13 20:36 ` [RFC PATCH 09/18] KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role Jon Kohler
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler

SPTE_PERM_MASK is no longer referenced by anything in the kernel.

Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/kvm/mmu/spte.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 2cb816ea2430..71d6fe28fafc 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -42,9 +42,6 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
 #define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
 #endif
 
-#define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
-			| shadow_x_mask | shadow_nx_mask | shadow_me_mask)
-
 #define ACC_EXEC_MASK    1
 #define ACC_WRITE_MASK   PT_WRITABLE_MASK
 #define ACC_USER_MASK    PT_USER_MASK
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 09/18] KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (7 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 08/18] KVM: x86/mmu: Remove SPTE_PERM_MASK Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-05-12 18:32   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 10/18] KVM: VMX: Extend EPT Violation protection bits Jon Kohler
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Mickaël Salaün, Jon Kohler, Sergey Dyasli

From: Mickaël Salaün <mic@digikod.net>

Extend access bitfield from 3 to 4 in kvm_mmu_page_role, where the 4th
bit will be used to track user executable pages with Intel MBEC.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Co-developed-by: Sergey Dyasli <sergey.dyasli@nutanix.com>
Signed-off-by: Sergey Dyasli <sergey.dyasli@nutanix.com>

---
 arch/x86/include/asm/kvm_host.h | 10 +++++-----
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/mmu/mmutrace.h     |  8 +++++++-
 arch/x86/kvm/mmu/spte.h         |  4 +++-
 4 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 192233eb557a..e8193de802a7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -312,10 +312,10 @@ struct kvm_kernel_irq_routing_entry;
  * the number of unique SPs that can theoretically be created is 2^n, where n
  * is the number of bits that are used to compute the role.
  *
- * But, even though there are 19 bits in the mask below, not all combinations
+ * But, even though there are 20 bits in the mask below, not all combinations
  * of modes and flags are possible:
  *
- *   - invalid shadow pages are not accounted, so the bits are effectively 18
+ *   - invalid shadow pages are not accounted, so the bits are effectively 19
  *
  *   - quadrant will only be used if has_4_byte_gpte=1 (non-PAE paging);
  *     execonly and ad_disabled are only used for nested EPT which has
@@ -330,7 +330,7 @@ struct kvm_kernel_irq_routing_entry;
  *     cr0_wp=0, therefore these three bits only give rise to 5 possibilities.
  *
  * Therefore, the maximum number of possible upper-level shadow pages for a
- * single gfn is a bit less than 2^13.
+ * single gfn is a bit less than 2^14.
  */
 union kvm_mmu_page_role {
 	u32 word;
@@ -339,7 +339,7 @@ union kvm_mmu_page_role {
 		unsigned has_4_byte_gpte:1;
 		unsigned quadrant:2;
 		unsigned direct:1;
-		unsigned access:3;
+		unsigned access:4;
 		unsigned invalid:1;
 		unsigned efer_nx:1;
 		unsigned cr0_wp:1;
@@ -348,7 +348,7 @@ union kvm_mmu_page_role {
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
 		unsigned passthrough:1;
-		unsigned :5;
+		unsigned:4;
 
 		/*
 		 * This is left at the top of the word so that
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8e853a5fc867..791413b93589 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1915,7 +1915,7 @@ static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 	 */
 	const union kvm_mmu_page_role sync_role_ign = {
 		.level = 0xf,
-		.access = 0x7,
+		.access = 0xf,
 		.quadrant = 0x3,
 		.passthrough = 0x1,
 	};
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index f35a830ce469..2511fe64ca01 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -22,10 +22,16 @@
 	__entry->root_count = sp->root_count;		\
 	__entry->unsync = sp->unsync;
 
+/*
+ * X == ACC_EXEC_MASK: executable without guest_exec_control and only
+ *                     supervisor execute with guest exec control
+ * x == ACC_USER_EXEC_MASK: user execute with guest exec control
+ */
 #define KVM_MMU_PAGE_PRINTK() ({				        \
 	const char *saved_ptr = trace_seq_buffer_ptr(p);		\
 	static const char *access_str[] = {			        \
-		"---", "--x", "w--", "w-x", "-u-", "-ux", "wu-", "wux"  \
+		"----", "---X", "-w--", "-w-X", "--u-", "--uX", "-wu-", "-wuX", \
+		"x---", "x--X", "xw--", "xw-X", "xu--", "x-uX", "xwu-", "xwuX"	\
 	};							        \
 	union kvm_mmu_page_role role;				        \
 								        \
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 71d6fe28fafc..d9e22133b6d0 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -45,7 +45,9 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
 #define ACC_EXEC_MASK    1
 #define ACC_WRITE_MASK   PT_WRITABLE_MASK
 #define ACC_USER_MASK    PT_USER_MASK
-#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
+#define ACC_USER_EXEC_MASK (1ULL << 3)
+#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK | \
+			  ACC_USER_EXEC_MASK)
 
 /* The mask for the R/X bits in EPT PTEs */
 #define SPTE_EPT_READABLE_MASK			0x1ull
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 10/18] KVM: VMX: Extend EPT Violation protection bits
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (8 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 09/18] KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-05-12 18:37   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 11/18] KVM: VMX: Enhance EPT violation handler for PROT_USER_EXEC Jon Kohler
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler

Define macros for READ, WRITE, EXEC protection bits, to be used by
MBEC-enabled systems.

No functional change intended.

Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/include/asm/vmx.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index d7ab0ad63be6..ffc90d672b5d 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -593,8 +593,17 @@ enum vm_entry_failure_code {
 #define EPT_VIOLATION_GVA_IS_VALID	BIT(7)
 #define EPT_VIOLATION_GVA_TRANSLATED	BIT(8)
 
+#define EPT_VIOLATION_READ_TO_PROT(__epte) (((__epte) & VMX_EPT_READABLE_MASK) << 3)
+#define EPT_VIOLATION_WRITE_TO_PROT(__epte) (((__epte) & VMX_EPT_WRITABLE_MASK) << 3)
+#define EPT_VIOLATION_EXEC_TO_PROT(__epte) (((__epte) & VMX_EPT_EXECUTABLE_MASK) << 3)
 #define EPT_VIOLATION_RWX_TO_PROT(__epte) (((__epte) & VMX_EPT_RWX_MASK) << 3)
 
+static_assert(EPT_VIOLATION_READ_TO_PROT(VMX_EPT_READABLE_MASK) ==
+	      (EPT_VIOLATION_PROT_READ));
+static_assert(EPT_VIOLATION_WRITE_TO_PROT(VMX_EPT_WRITABLE_MASK) ==
+	      (EPT_VIOLATION_PROT_WRITE));
+static_assert(EPT_VIOLATION_EXEC_TO_PROT(VMX_EPT_EXECUTABLE_MASK) ==
+	      (EPT_VIOLATION_PROT_EXEC));
 static_assert(EPT_VIOLATION_RWX_TO_PROT(VMX_EPT_RWX_MASK) ==
 	      (EPT_VIOLATION_PROT_READ | EPT_VIOLATION_PROT_WRITE | EPT_VIOLATION_PROT_EXEC));
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 11/18] KVM: VMX: Enhance EPT violation handler for PROT_USER_EXEC
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (9 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 10/18] KVM: VMX: Extend EPT Violation protection bits Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-05-12 18:54   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 12/18] KVM: x86/mmu: Introduce shadow_ux_mask Jon Kohler
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Mickaël Salaün, Jon Kohler

From: Mickaël Salaün <mic@digikod.net>

Add EPT_VIOLATION_PROT_USER_EXEC (6) to reflect the user executable
permissions of a given address when Intel MBEC is enabled.

Refactor usage of EPT_VIOLATION_RWX_TO_PROT to understand all of the
specific bits that are now possible with MBEC.

Intel SDM 'Exit Qualification for EPT Violations' states the following
for Bit 6.
  If the “mode-based execute control” VM-execution control is 0, the
  value of this bit is undefined. If that control is 1, this bit is
  the logical-AND of bit 10 in the EPT paging-structure entries used
  to translate the guest-physical address of the access causing the
  EPT violation. In this case, it indicates whether the guest-physical
  address was executable for user-mode linear addresses.

  Bit 6 is cleared to 0 if (1) the “mode-based execute control”
  VM-execution control is 1; and (2) either (a) any of EPT
  paging-structure entries used to translate the guest-physical address
  of the access causing the EPT violation is not present; or
  (b) 4-level EPT is in use and the guest-physical address sets any
  bits in the range 51:48.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/include/asm/vmx.h     |  7 ++++---
 arch/x86/kvm/mmu/paging_tmpl.h | 15 ++++++++++++---
 arch/x86/kvm/vmx/vmx.c         |  7 +++++--
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index ffc90d672b5d..84c5be416f5c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -587,6 +587,7 @@ enum vm_entry_failure_code {
 #define EPT_VIOLATION_PROT_READ		BIT(3)
 #define EPT_VIOLATION_PROT_WRITE	BIT(4)
 #define EPT_VIOLATION_PROT_EXEC		BIT(5)
+#define EPT_VIOLATION_PROT_USER_EXEC	BIT(6)
 #define EPT_VIOLATION_PROT_MASK		(EPT_VIOLATION_PROT_READ  | \
 					 EPT_VIOLATION_PROT_WRITE | \
 					 EPT_VIOLATION_PROT_EXEC)
@@ -596,7 +597,7 @@ enum vm_entry_failure_code {
 #define EPT_VIOLATION_READ_TO_PROT(__epte) (((__epte) & VMX_EPT_READABLE_MASK) << 3)
 #define EPT_VIOLATION_WRITE_TO_PROT(__epte) (((__epte) & VMX_EPT_WRITABLE_MASK) << 3)
 #define EPT_VIOLATION_EXEC_TO_PROT(__epte) (((__epte) & VMX_EPT_EXECUTABLE_MASK) << 3)
-#define EPT_VIOLATION_RWX_TO_PROT(__epte) (((__epte) & VMX_EPT_RWX_MASK) << 3)
+#define EPT_VIOLATION_USER_EXEC_TO_PROT(__epte) (((__epte) & VMX_EPT_USER_EXECUTABLE_MASK) >> 4)
 
 static_assert(EPT_VIOLATION_READ_TO_PROT(VMX_EPT_READABLE_MASK) ==
 	      (EPT_VIOLATION_PROT_READ));
@@ -604,8 +605,8 @@ static_assert(EPT_VIOLATION_WRITE_TO_PROT(VMX_EPT_WRITABLE_MASK) ==
 	      (EPT_VIOLATION_PROT_WRITE));
 static_assert(EPT_VIOLATION_EXEC_TO_PROT(VMX_EPT_EXECUTABLE_MASK) ==
 	      (EPT_VIOLATION_PROT_EXEC));
-static_assert(EPT_VIOLATION_RWX_TO_PROT(VMX_EPT_RWX_MASK) ==
-	      (EPT_VIOLATION_PROT_READ | EPT_VIOLATION_PROT_WRITE | EPT_VIOLATION_PROT_EXEC));
+static_assert(EPT_VIOLATION_USER_EXEC_TO_PROT(VMX_EPT_USER_EXECUTABLE_MASK) ==
+	      (EPT_VIOLATION_PROT_USER_EXEC));
 
 /*
  * Exit Qualifications for NOTIFY VM EXIT
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 9bc3fc4a238b..a3a5cacda614 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -181,8 +181,9 @@ static inline unsigned FNAME(gpte_access)(u64 gpte)
 	unsigned access;
 #if PTTYPE == PTTYPE_EPT
 	access = ((gpte & VMX_EPT_WRITABLE_MASK) ? ACC_WRITE_MASK : 0) |
-		((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0) |
-		((gpte & VMX_EPT_READABLE_MASK) ? ACC_USER_MASK : 0);
+		 ((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0) |
+		 ((gpte & VMX_EPT_USER_EXECUTABLE_MASK) ? ACC_USER_EXEC_MASK : 0) |
+		 ((gpte & VMX_EPT_READABLE_MASK) ? ACC_USER_MASK : 0);
 #else
 	BUILD_BUG_ON(ACC_EXEC_MASK != PT_PRESENT_MASK);
 	BUILD_BUG_ON(ACC_EXEC_MASK != 1);
@@ -510,7 +511,15 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		 * Note, pte_access holds the raw RWX bits from the EPTE, not
 		 * ACC_*_MASK flags!
 		 */
-		walker->fault.exit_qualification |= EPT_VIOLATION_RWX_TO_PROT(pte_access);
+		walker->fault.exit_qualification |=
+			EPT_VIOLATION_READ_TO_PROT(pte_access);
+		walker->fault.exit_qualification |=
+			EPT_VIOLATION_WRITE_TO_PROT(pte_access);
+		walker->fault.exit_qualification |=
+			EPT_VIOLATION_EXEC_TO_PROT(pte_access);
+		if (vcpu->arch.pt_guest_exec_control)
+			walker->fault.exit_qualification |=
+				EPT_VIOLATION_USER_EXEC_TO_PROT(pte_access);
 	}
 #endif
 	walker->fault.address = addr;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 116910159a3f..0aadfa924045 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5809,7 +5809,7 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
 
 static int handle_ept_violation(struct kvm_vcpu *vcpu)
 {
-	unsigned long exit_qualification;
+	unsigned long exit_qualification, rwx_mask;
 	gpa_t gpa;
 	u64 error_code;
 
@@ -5839,7 +5839,10 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
 		      ? PFERR_FETCH_MASK : 0;
 	/* ept page table entry is present? */
-	error_code |= (exit_qualification & EPT_VIOLATION_PROT_MASK)
+	rwx_mask = EPT_VIOLATION_PROT_MASK;
+	if (vcpu->arch.pt_guest_exec_control)
+		rwx_mask |= EPT_VIOLATION_PROT_USER_EXEC;
+	error_code |= (exit_qualification & rwx_mask)
 		      ? PFERR_PRESENT_MASK : 0;
 
 	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 12/18] KVM: x86/mmu: Introduce shadow_ux_mask
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (10 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 11/18] KVM: VMX: Enhance EPT violation handler for PROT_USER_EXEC Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-04-23  3:06   ` Chao Gao
  2025-05-12 19:13   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 13/18] KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC Jon Kohler
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler, Sergey Dyasli

Add shadow_ux_mask to spte, to keep track of user executable pages
used by Intel MBEC.

Since masks are setup outside of vcpu creation, plumb in general
system level enablement from vmx code into kvm_mmu_set_ept_masks().

Signed-off-by: Jon Kohler <jon@nutanix.com>
Co-developed-by: Sergey Dyasli <sergey.dyasli@nutanix.com>
Signed-off-by: Sergey Dyasli <sergey.dyasli@nutanix.com>

---
 arch/x86/kvm/mmu.h      |  3 ++-
 arch/x86/kvm/mmu/spte.c | 21 ++++++++++++++++++---
 arch/x86/kvm/mmu/spte.h |  1 +
 arch/x86/kvm/vmx/vmx.c  |  3 ++-
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9dc5dd43ae7f..d10c37db7653 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -79,7 +79,8 @@ u8 kvm_mmu_get_max_tdp_level(void);
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only,
+			   bool has_guest_exec_ctrl);
 
 void kvm_init_mmu(struct kvm_vcpu *vcpu);
 void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 8f7eb3ad88fc..6f4994b3e6d0 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -28,6 +28,7 @@ u64 __read_mostly shadow_host_writable_mask;
 u64 __read_mostly shadow_mmu_writable_mask;
 u64 __read_mostly shadow_nx_mask;
 u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
+u64 __read_mostly shadow_ux_mask;
 u64 __read_mostly shadow_user_mask;
 u64 __read_mostly shadow_accessed_mask;
 u64 __read_mostly shadow_dirty_mask;
@@ -313,8 +314,14 @@ u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
 		 * the page executable as the NX hugepage mitigation no longer
 		 * applies.
 		 */
-		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled(kvm))
+		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled(kvm)) {
 			child_spte = make_spte_executable(child_spte);
+			// TODO: For LKML: switch to vcpu->arch.pt_guest_exec_control? up
+			// for suggestions on how best to toggle this.
+			if (enable_pt_guest_exec_control &&
+			    role.access & ACC_USER_EXEC_MASK)
+				child_spte |= shadow_ux_mask;
+		}
 	}
 
 	return child_spte;
@@ -326,7 +333,7 @@ u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 	u64 spte = SPTE_MMU_PRESENT_MASK;
 
 	spte |= __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
-		shadow_user_mask | shadow_x_mask | shadow_me_value;
+		shadow_user_mask | shadow_x_mask | shadow_ux_mask | shadow_me_value;
 
 	if (ad_disabled)
 		spte |= SPTE_TDP_AD_DISABLED;
@@ -420,7 +427,8 @@ void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_me_spte_mask);
 
-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only,
+			   bool has_guest_exec_ctrl)
 {
 	shadow_user_mask	= VMX_EPT_READABLE_MASK;
 	shadow_accessed_mask	= has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
@@ -428,8 +436,14 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
 	shadow_nx_mask		= 0ull;
 	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
 	/* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
+	// For LKML Review:
+	// Do we need to modify shadow_present_mask in the MBEC case?
 	shadow_present_mask	=
 		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
+
+	shadow_ux_mask		=
+		has_guest_exec_ctrl ? VMX_EPT_USER_EXECUTABLE_MASK : 0ull;
+
 	/*
 	 * EPT overrides the host MTRRs, and so KVM must program the desired
 	 * memtype directly into the SPTEs.  Note, this mask is just the mask
@@ -484,6 +498,7 @@ void kvm_mmu_reset_all_pte_masks(void)
 	shadow_dirty_mask	= PT_DIRTY_MASK;
 	shadow_nx_mask		= PT64_NX_MASK;
 	shadow_x_mask		= 0;
+	shadow_ux_mask		= 0;
 	shadow_present_mask	= PT_PRESENT_MASK;
 
 	/*
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index d9e22133b6d0..dc2f0dc9c46e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -171,6 +171,7 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
 extern u64 __read_mostly shadow_nx_mask;
 extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
 extern u64 __read_mostly shadow_user_mask;
+extern u64 __read_mostly shadow_ux_mask;
 extern u64 __read_mostly shadow_accessed_mask;
 extern u64 __read_mostly shadow_dirty_mask;
 extern u64 __read_mostly shadow_mmio_value;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0aadfa924045..d16e3f170258 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8544,7 +8544,8 @@ __init int vmx_hardware_setup(void)
 
 	if (enable_ept)
 		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
-				      cpu_has_vmx_ept_execute_only());
+				      cpu_has_vmx_ept_execute_only(),
+				      enable_pt_guest_exec_control);
 
 	/*
 	 * Setup shadow_me_value/shadow_me_mask to include MKTME KeyID
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 13/18] KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (11 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 12/18] KVM: x86/mmu: Introduce shadow_ux_mask Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-04-23  5:37   ` Chao Gao
  2025-05-12 19:37   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 14/18] KVM: x86/mmu: Extend is_executable_pte " Jon Kohler
                   ` (7 subsequent siblings)
  20 siblings, 2 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler

Adjust the SPTE_MMIO_ALLOWED_MASK and associated values to make these
masks aware of PTE Bit 10, to be used by Intel MBEC.

Intel SDM 30.3.3.1 EPT Misconfigurations states:
  An EPT misconfiguration occurs if translation of a guest-physical
  address encounters an EPT paging-structure entry that meets any of
  the following conditions:
   - Bit 0 of the entry is clear (indicating that data reads are not
     allowed) and any of the following hold:
     — Bit 1 is set (indicating that data writes are allowed).
     — The processor does not support execute-only translations and
       either of the following hold:
       - Bit 2 is set (indicating that instruction fetches are allowed)
         Note: If the “mode-based execute control for EPT” VM-execution
         control is 1, setting bit 2 indicates that instruction fetches
         are allowed from supervisor-mode linear addresses.
       - The “mode-based execute control for EPT” VM-execution control
         is 1 and bit 10 is set (indicating that instruction fetches
         are allowed from user-mode linear addresses).

For LKML Review:
SDM 30.3.3.1 also states that "Software should read the VMX capability
MSR IA32_VMX_EPT_VPID_CAP to determine whether execute-only
translations are supported (see Appendix A.10)." A.10 indicates that
this is specified by bit 0; if bit 0 is 1, then the processor supports
execute-only transactions by EPT.

Searching around a bit, it looks like this bit is checked by
vmx/capabilities.h:cpu_has_vmx_ept_execute_only(), which is used only
in kvm/vmx/vmx.c:vmx_hardware_setup(), passed as the has_exec_only
argument to kvm_mmu_set_ept_masks(), which uses it to set
shadow_present_mask.

I'm not sure if this actually matters for this change(?), but thought
it was at least worth surfacing for others to consider.

Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/include/asm/vmx.h |  6 ++++--
 arch/x86/kvm/mmu/spte.h    | 13 +++++++------
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 84c5be416f5c..961d37e108b5 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -541,7 +541,8 @@ enum vmcs_field {
 #define VMX_EPT_SUPPRESS_VE_BIT			(1ull << 63)
 #define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
 						 VMX_EPT_WRITABLE_MASK |       \
-						 VMX_EPT_EXECUTABLE_MASK)
+						 VMX_EPT_EXECUTABLE_MASK |     \
+						 VMX_EPT_USER_EXECUTABLE_MASK)
 #define VMX_EPT_MT_MASK				(7ull << VMX_EPT_MT_EPTE_SHIFT)
 
 static inline u8 vmx_eptp_page_walk_level(u64 eptp)
@@ -558,7 +559,8 @@ static inline u8 vmx_eptp_page_walk_level(u64 eptp)
 
 /* The mask to use to trigger an EPT Misconfiguration in order to track MMIO */
 #define VMX_EPT_MISCONFIG_WX_VALUE		(VMX_EPT_WRITABLE_MASK |       \
-						 VMX_EPT_EXECUTABLE_MASK)
+						 VMX_EPT_EXECUTABLE_MASK |     \
+						 VMX_EPT_USER_EXECUTABLE_MASK)
 
 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR		0xfffbc000ul
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index dc2f0dc9c46e..1f7b388a56aa 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -98,11 +98,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 #undef SHADOW_ACC_TRACK_SAVED_MASK
 
 /*
- * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
+ * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
  * the memslots generation and is derived as follows:
  *
- * Bits 0-7 of the MMIO generation are propagated to spte bits 3-10
- * Bits 8-18 of the MMIO generation are propagated to spte bits 52-62
+ * Bits 0-6 of the MMIO generation are propagated to spte bits 3-9
+ * Bits 7-17 of the MMIO generation are propagated to spte bits 52-62
  *
  * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
  * the MMIO generation number, as doing so would require stealing a bit from
@@ -113,7 +113,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
  */
 
 #define MMIO_SPTE_GEN_LOW_START		3
-#define MMIO_SPTE_GEN_LOW_END		10
+#define MMIO_SPTE_GEN_LOW_END		9
 
 #define MMIO_SPTE_GEN_HIGH_START	52
 #define MMIO_SPTE_GEN_HIGH_END		62
@@ -135,7 +135,8 @@ static_assert(!(SPTE_MMU_PRESENT_MASK &
  * and so they're off-limits for generation; additional checks ensure the mask
  * doesn't overlap legal PA bits), and bit 63 (carved out for future usage).
  */
-#define SPTE_MMIO_ALLOWED_MASK (BIT_ULL(63) | GENMASK_ULL(51, 12) | GENMASK_ULL(2, 0))
+#define SPTE_MMIO_ALLOWED_MASK (BIT_ULL(63) | GENMASK_ULL(51, 12) | \
+				BIT_ULL(10) | GENMASK_ULL(2, 0))
 static_assert(!(SPTE_MMIO_ALLOWED_MASK &
 		(SPTE_MMU_PRESENT_MASK | MMIO_SPTE_GEN_LOW_MASK | MMIO_SPTE_GEN_HIGH_MASK)));
 
@@ -143,7 +144,7 @@ static_assert(!(SPTE_MMIO_ALLOWED_MASK &
 #define MMIO_SPTE_GEN_HIGH_BITS		(MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)
 
 /* remember to adjust the comment above as well if you change these */
-static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 7 && MMIO_SPTE_GEN_HIGH_BITS == 11);
 
 #define MMIO_SPTE_GEN_LOW_SHIFT		(MMIO_SPTE_GEN_LOW_START - 0)
 #define MMIO_SPTE_GEN_HIGH_SHIFT	(MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 14/18] KVM: x86/mmu: Extend is_executable_pte to understand MBEC
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (12 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 13/18] KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-04-23  6:16   ` Chao Gao
  2025-05-12 21:16   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 15/18] KVM: x86/mmu: Extend make_spte " Jon Kohler
                   ` (6 subsequent siblings)
  20 siblings, 2 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Mickaël Salaün, Jon Kohler

From: Mickaël Salaün <mic@digikod.net>

Extend is_executable_pte to understand user vs kernel executable
pages and plumb in kvm_vcpu into kvm_mmu_set_spte so that tracepoints
can tell the right execute permissions.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/kvm/mmu/mmu.c      | 11 ++++++-----
 arch/x86/kvm/mmu/mmutrace.h | 15 +++++++++------
 arch/x86/kvm/mmu/spte.h     | 15 +++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.c  |  2 +-
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 791413b93589..5127520f01d2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2951,7 +2951,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 		ret = RET_PF_SPURIOUS;
 	} else {
 		flush |= mmu_spte_update(sptep, spte);
-		trace_kvm_mmu_set_spte(level, gfn, sptep);
+		trace_kvm_mmu_set_spte(vcpu, level, gfn, sptep);
 	}
 
 	if (wrprot && write_fault)
@@ -3430,10 +3430,11 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
 	return true;
 }
 
-static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
+static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte,
+			      struct kvm_vcpu *vcpu)
 {
 	if (fault->exec)
-		return is_executable_pte(spte);
+		return is_executable_pte(spte, !fault->user, vcpu);
 
 	if (fault->write)
 		return is_writable_pte(spte);
@@ -3514,7 +3515,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * Need not check the access of upper level table entries since
 		 * they are always ACC_ALL.
 		 */
-		if (is_access_allowed(fault, spte)) {
+		if (is_access_allowed(fault, spte, vcpu)) {
 			ret = RET_PF_SPURIOUS;
 			break;
 		}
@@ -3561,7 +3562,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 		/* Verify that the fault can be handled in the fast path */
 		if (new_spte == spte ||
-		    !is_access_allowed(fault, new_spte))
+		    !is_access_allowed(fault, new_spte, vcpu))
 			break;
 
 		/*
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index 2511fe64ca01..1067fb7ecd55 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -339,8 +339,8 @@ TRACE_EVENT(
 
 TRACE_EVENT(
 	kvm_mmu_set_spte,
-	TP_PROTO(int level, gfn_t gfn, u64 *sptep),
-	TP_ARGS(level, gfn, sptep),
+	TP_PROTO(struct kvm_vcpu *vcpu, int level, gfn_t gfn, u64 *sptep),
+	TP_ARGS(vcpu, level, gfn, sptep),
 
 	TP_STRUCT__entry(
 		__field(u64, gfn)
@@ -349,7 +349,8 @@ TRACE_EVENT(
 		__field(u8, level)
 		/* These depend on page entry type, so compute them now.  */
 		__field(bool, r)
-		__field(bool, x)
+		__field(bool, kx)
+		__field(bool, ux)
 		__field(signed char, u)
 	),
 
@@ -359,15 +360,17 @@ TRACE_EVENT(
 		__entry->sptep = virt_to_phys(sptep);
 		__entry->level = level;
 		__entry->r = shadow_present_mask || (__entry->spte & PT_PRESENT_MASK);
-		__entry->x = is_executable_pte(__entry->spte);
+		__entry->kx = is_executable_pte(__entry->spte, true, vcpu);
+		__entry->ux = is_executable_pte(__entry->spte, false, vcpu);
 		__entry->u = shadow_user_mask ? !!(__entry->spte & shadow_user_mask) : -1;
 	),
 
-	TP_printk("gfn %llx spte %llx (%s%s%s%s) level %d at %llx",
+	TP_printk("gfn %llx spte %llx (%s%s%s%s%s) level %d at %llx",
 		  __entry->gfn, __entry->spte,
 		  __entry->r ? "r" : "-",
 		  __entry->spte & PT_WRITABLE_MASK ? "w" : "-",
-		  __entry->x ? "x" : "-",
+		  __entry->kx ? "X" : "-",
+		  __entry->ux ? "x" : "-",
 		  __entry->u == -1 ? "" : (__entry->u ? "u" : "-"),
 		  __entry->level, __entry->sptep
 	)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1f7b388a56aa..fd7e29a0a567 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -346,9 +346,20 @@ static inline bool is_last_spte(u64 pte, int level)
 	return (level == PG_LEVEL_4K) || is_large_pte(pte);
 }
 
-static inline bool is_executable_pte(u64 spte)
+static inline bool is_executable_pte(u64 spte, bool for_kernel_mode,
+				     struct kvm_vcpu *vcpu)
 {
-	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
+	u64 x_mask = shadow_x_mask;
+
+	if (vcpu->arch.pt_guest_exec_control) {
+		x_mask |= shadow_ux_mask;
+		if (for_kernel_mode)
+			x_mask &= ~VMX_EPT_USER_EXECUTABLE_MASK;
+		else
+			x_mask &= ~VMX_EPT_EXECUTABLE_MASK;
+	}
+
+	return (spte & (x_mask | shadow_nx_mask)) == x_mask;
 }
 
 static inline kvm_pfn_t spte_to_pfn(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3b996c1fdaab..6a799ab42687 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1056,7 +1056,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 				     new_spte);
 		ret = RET_PF_EMULATE;
 	} else {
-		trace_kvm_mmu_set_spte(iter->level, iter->gfn,
+		trace_kvm_mmu_set_spte(vcpu, iter->level, iter->gfn,
 				       rcu_dereference(iter->sptep));
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 15/18] KVM: x86/mmu: Extend make_spte to understand MBEC
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (13 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 14/18] KVM: x86/mmu: Extend is_executable_pte " Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-05-12 21:29   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 16/18] KVM: nVMX: Setup Intel MBEC in nested secondary controls Jon Kohler
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler, Sergey Dyasli

Extend make_spte to mask in and out bits depending on MBEC enablement.

Note: For the RFC/v1 series, I've added several 'For Review' items that
may require a bit deeper inspection, as well as some long winded
comments/annotations. These will be cleaned up for the next iteration
of the series after initial review.

Signed-off-by: Jon Kohler <jon@nutanix.com>
Co-developed-by: Sergey Dyasli <sergey.dyasli@nutanix.com>
Signed-off-by: Sergey Dyasli <sergey.dyasli@nutanix.com>

---
 arch/x86/kvm/mmu/paging_tmpl.h |  3 +++
 arch/x86/kvm/mmu/spte.c        | 30 ++++++++++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index a3a5cacda614..7675239f2dd1 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -840,6 +840,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		 * then we should prevent the kernel from executing it
 		 * if SMEP is enabled.
 		 */
+		// FOR REVIEW:
+		// ACC_USER_EXEC_MASK seems not necessary to add here since
+		// SMEP is for kernel-only.
 		if (is_cr4_smep(vcpu->arch.mmu))
 			walker.pte_access &= ~ACC_EXEC_MASK;
 	}
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 6f4994b3e6d0..89bdae3f9ada 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -178,6 +178,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	else if (kvm_mmu_page_ad_need_write_protect(sp))
 		spte |= SPTE_TDP_AD_WRPROT_ONLY;
 
+	// For LKML Review:
+	// In MBEC case, you can have exec only and also bit 10
+	// set for user exec only. Do we need to cater for that here?
 	spte |= shadow_present_mask;
 	if (!prefetch)
 		spte |= spte_shadow_accessed_mask(spte);
@@ -197,12 +200,31 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
 	    is_nx_huge_page_enabled(vcpu->kvm)) {
 		pte_access &= ~ACC_EXEC_MASK;
+		if (vcpu->arch.pt_guest_exec_control)
+			pte_access &= ~ACC_USER_EXEC_MASK;
 	}
 
-	if (pte_access & ACC_EXEC_MASK)
-		spte |= shadow_x_mask;
-	else
-		spte |= shadow_nx_mask;
+	// For LKML Review:
+	// We could probably optimize the logic here, but typing it out
+	// long hand for now to make it clear how we're changing the control
+	// flow to support MBEC.
+	if (!vcpu->arch.pt_guest_exec_control) { // non-mbec logic
+		if (pte_access & ACC_EXEC_MASK)
+			spte |= shadow_x_mask;
+		else
+			spte |= shadow_nx_mask;
+	} else { // mbec logic
+		if (pte_access & ACC_EXEC_MASK) { /* mbec: kernel exec */
+			if (pte_access & ACC_USER_EXEC_MASK)
+				spte |= shadow_x_mask | shadow_ux_mask; // KMX = 1, UMX = 1
+			else
+				spte |= shadow_x_mask;  // KMX = 1, UMX = 0
+		} else if (pte_access & ACC_USER_EXEC_MASK) { /* mbec: user exec, no kernel exec */
+			spte |= shadow_ux_mask; // KMX = 0, UMX = 1
+		} else { /* mbec: nx */
+			spte |= shadow_nx_mask; // KMX = 0, UMX = 0
+		}
+	}
 
 	if (pte_access & ACC_USER_MASK)
 		spte |= shadow_user_mask;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 16/18] KVM: nVMX: Setup Intel MBEC in nested secondary controls
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (14 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 15/18] KVM: x86/mmu: Extend make_spte " Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-05-12 21:32   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 17/18] KVM: VMX: Allow MBEC with EVMCS Jon Kohler
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler

Setup Intel Mode Based Execution Control (bit 22) for nested
guest, gated on module parameter enablement.

Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/kvm/vmx/nested.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 931a7361c30f..ce3a6d6dfce7 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -7099,6 +7099,10 @@ static void nested_vmx_setup_secondary_ctls(u32 ept_caps,
 		 */
 		if (cpu_has_vmx_vmfunc())
 			msrs->vmfunc_controls = VMX_VMFUNC_EPTP_SWITCHING;
+
+		if (enable_pt_guest_exec_control)
+			msrs->secondary_ctls_high |=
+				SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
 	}
 
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 17/18] KVM: VMX: Allow MBEC with EVMCS
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (15 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 16/18] KVM: nVMX: Setup Intel MBEC in nested secondary controls Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-05-12 21:35   ` Sean Christopherson
  2025-03-13 20:36 ` [RFC PATCH 18/18] KVM: x86: Enable module parameter for MBEC Jon Kohler
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Vitaly Kuznetsov
  Cc: Jon Kohler

Extend EVMCS1_SUPPORTED_2NDEXEC to understand MBEC enablement,
otherwise presenting both EVMCS and MBEC at the same time will disable
MBEC presentation into the guest.

Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/kvm/vmx/hyperv.c       | 5 ++++-
 arch/x86/kvm/vmx/hyperv_evmcs.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/hyperv.c b/arch/x86/kvm/vmx/hyperv.c
index fab6a1ad98dc..941a29c9e667 100644
--- a/arch/x86/kvm/vmx/hyperv.c
+++ b/arch/x86/kvm/vmx/hyperv.c
@@ -138,7 +138,10 @@ void nested_evmcs_filter_control_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *
 		ctl_high &= evmcs_get_supported_ctls(EVMCS_EXEC_CTRL);
 		break;
 	case MSR_IA32_VMX_PROCBASED_CTLS2:
-		ctl_high &= evmcs_get_supported_ctls(EVMCS_2NDEXEC);
+		supported_ctrls = evmcs_get_supported_ctls(EVMCS_2NDEXEC);
+		if (!vcpu->arch.pt_guest_exec_control)
+			supported_ctrls &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
+		ctl_high &= supported_ctrls;
 		break;
 	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
 	case MSR_IA32_VMX_PINBASED_CTLS:
diff --git a/arch/x86/kvm/vmx/hyperv_evmcs.h b/arch/x86/kvm/vmx/hyperv_evmcs.h
index a543fccfc574..930429f376f9 100644
--- a/arch/x86/kvm/vmx/hyperv_evmcs.h
+++ b/arch/x86/kvm/vmx/hyperv_evmcs.h
@@ -87,6 +87,7 @@
 	 SECONDARY_EXEC_PT_CONCEAL_VMX |				\
 	 SECONDARY_EXEC_BUS_LOCK_DETECTION |				\
 	 SECONDARY_EXEC_NOTIFY_VM_EXITING |				\
+	 SECONDARY_EXEC_MODE_BASED_EPT_EXEC |				\
 	 SECONDARY_EXEC_ENCLS_EXITING)
 
 #define EVMCS1_SUPPORTED_3RDEXEC (0ULL)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 18/18] KVM: x86: Enable module parameter for MBEC
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (16 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 17/18] KVM: VMX: Allow MBEC with EVMCS Jon Kohler
@ 2025-03-13 20:36 ` Jon Kohler
  2025-04-15  9:29 ` [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Mickaël Salaün
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-03-13 20:36 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel
  Cc: Jon Kohler

Enable 'enable_pt_guest_exec_control', which will allow user space to
control enablement of Intel MBEC by advertising secondary exec control
bit 22 to a given vCPU.

Signed-off-by: Jon Kohler <jon@nutanix.com>

---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b2fbb9088ea..607ed2142ce8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -197,7 +197,7 @@ module_param(eager_page_split, bool, 0644);
 static bool __read_mostly mitigate_smt_rsb;
 module_param(mitigate_smt_rsb, bool, 0444);
 
-bool __read_mostly enable_pt_guest_exec_control;
+bool __read_mostly enable_pt_guest_exec_control = true;
 EXPORT_SYMBOL_GPL(enable_pt_guest_exec_control);
 module_param(enable_pt_guest_exec_control, bool, 0444);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (17 preceding siblings ...)
  2025-03-13 20:36 ` [RFC PATCH 18/18] KVM: x86: Enable module parameter for MBEC Jon Kohler
@ 2025-04-15  9:29 ` Mickaël Salaün
  2025-04-15 14:43   ` Sean Christopherson
  2025-04-15 14:43   ` Jon Kohler
  2025-04-23 13:54 ` Adrian-Ken Rueegsegger
  2025-05-12 21:46 ` Sean Christopherson
  20 siblings, 2 replies; 62+ messages in thread
From: Mickaël Salaün @ 2025-04-15  9:29 UTC (permalink / raw)
  To: Jon Kohler, Sean Christopherson, Paolo Bonzini
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, kvm, linux-kernel,
	Alexander Grest, Nicolas Saenz Julienne,
	Madhavan T . Venkataraman, Tao Su, Xiaoyao Li, Zhao Liu

Hi,

This series looks good, just some inlined questions.

Sean, Paolo, what do you think?

Jon, what is the status of the QEMU patches?

Regards,
 Mickaël

On Thu, Mar 13, 2025 at 01:36:39PM -0700, Jon Kohler wrote:
> ## Summary
> This series introduces support for Intel Mode-Based Execute Control
> (MBEC) to KVM and nested VMX virtualization, aiming to significantly
> reduce VMexits and improve performance for Windows guests running with
> Hypervisor-Protected Code Integrity (HVCI).
> 
> ## What?
> Intel MBEC is a hardware feature, introduced in the Kabylake
> generation, that allows for more granular control over execution
> permissions. MBEC enables the separation and tracking of execution
> permissions for supervisor (kernel) and user-mode code. It is used as
> an accelerator for Microsoft's Memory Integrity [1] (also known as
> hypervisor-protected code integrity or HVCI).
> 
> ## Why?
> The primary reason for this feature is performance.
> 
> Without hardware-level MBEC, enabling Windows HVCI runs a 'software
> MBEC' known as Restricted User Mode, which imposes a runtime overhead
> due to increased state transitions between the guest's L2 root
> partition and the L2 secure partition for running kernel mode code
> integrity operations.
> 
> In practice, this results in a significant number of exits. For
> example, playing a YouTube video within the Edge Browser produces
> roughly 1.2 million VMexits/second across an 8 vCPU Windows 11 guest.
> 
> Most of these exits are VMREAD/VMWRITE operations, which can be
> emulated with Enlightened VMCS (eVMCS). However, even with eVMCS, this
> configuration still produces around 200,000 VMexits/second.
> 
> With MBEC exposed to the L1 Windows Hypervisor, the same scenario
> results in approximately 50,000 VMexits/second, a *24x* reduction from
> the baseline.
> 
> Not a typo, 24x reduction in VMexits.
> 
> ## How?
> This series implements core KVM support for exposing the MBEC bit in
> secondary execution controls (bit 22) to L1 and L2, based on
> configuration from user space and a module parameter
> 'enable_pt_guest_exec_control'. The inspiration for this series
> started with Mickaël's series for Heki [3], where we've extracted,
> refactored, and extended the MBEC-specific use case to be
> general-purpose.
> 
> MBEC, which appears in Linux /proc/cpuinfo as ept_mode_based_exec,
> splits the EPT exec bit (bit 2 in PTE) into two bits. When secondary
> execution control bit 22 is set, PTE bit 2 reflects supervisor mode
> executable, and PTE bit 10 reflects user mode executable.
> 
> The semantics for EPT violation qualifications also change when MBEC
> is enabled, with bit 5 reflecting supervisor/kernel mode execute
> permissions and bit 6 reflecting user mode execute permissions.
> This ultimately serves to expose this feature to the L1 hypervisor,
> which consumes MBEC and informs the L2 partitions not to use the
> software MBEC by removing bit 14 in 0x40000004 EAX [4].
> 
> ## Where?
> Enablement spans both VMX code and MMU code to teach the shadow MMU
> about the different execution modes, as well as user space VMM to pass
> secondary execution control bit 22. A patch for QEMU enablement is
> available [5].
> 
> ## Testing
> Initial testing has been on done on 6.12-based code with:
>   Guests
>     - Windows 11 24H2 26100.2894
>     - Windows Server 2025 24H2 26100.2894
>     - Windows Server 2022 W1H2 20348.825
>   Processors:
>     - Intel Skylake 6154
>     - Intel Sapphire Rapids 6444Y
> 
> ## Acknowledgements
> Special thanks to all contributors and reviewers who have provided
> valuable feedback and support for this patch series.
> 
> [1] https://learn.microsoft.com/en-us/windows/security/hardware-security/enable-virtualization-based-protection-of-code-integrity
> [2] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/nested-virtualization#enlightened-vmcs-intel
> [3] https://patchwork.kernel.org/project/kvm/patch/20231113022326.24388-6-mic@digikod.net/
> [4] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/feature-discovery#implementation-recommendations---0x40000004
> [5] https://github.com/JonKohler/qemu/tree/mbec-rfc-v1
> 
> Cc: Alexander Grest <Alexander.Grest@microsoft.com>
> Cc: Nicolas Saenz Julienne <nsaenz@amazon.es>
> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
> Cc: Mickaël Salaün <mic@digikod.net>
> Cc: Tao Su <tao1.su@linux.intel.com>
> Cc: Xiaoyao Li <xiaoyao.li@intel.com>
> Cc: Zhao Liu <zhao1.liu@intel.com>
> 
> Jon Kohler (11):
>   KVM: x86: Add module parameter for Intel MBEC
>   KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch
>   KVM: VMX: Wire up Intel MBEC enable/disable logic
>   KVM: x86/mmu: Remove SPTE_PERM_MASK
>   KVM: VMX: Extend EPT Violation protection bits
>   KVM: x86/mmu: Introduce shadow_ux_mask
>   KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC
>   KVM: x86/mmu: Extend make_spte to understand MBEC
>   KVM: nVMX: Setup Intel MBEC in nested secondary controls
>   KVM: VMX: Allow MBEC with EVMCS
>   KVM: x86: Enable module parameter for MBEC
> 
> Mickaël Salaün (5):
>   KVM: VMX: add cpu_has_vmx_mbec helper
>   KVM: VMX: Define VMX_EPT_USER_EXECUTABLE_MASK
>   KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role
>   KVM: VMX: Enhance EPT violation handler for PROT_USER_EXEC
>   KVM: x86/mmu: Extend is_executable_pte to understand MBEC
> 
> Nikolay Borisov (1):
>   KVM: VMX: Remove EPT_VIOLATIONS_ACC_*_BIT defines
> 
> Sean Christopherson (1):
>   KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bits
> 
>  arch/x86/include/asm/kvm_host.h | 13 +++++----
>  arch/x86/include/asm/vmx.h      | 45 ++++++++++++++++++++---------
>  arch/x86/kvm/mmu.h              |  3 +-
>  arch/x86/kvm/mmu/mmu.c          | 13 +++++----
>  arch/x86/kvm/mmu/mmutrace.h     | 23 ++++++++++-----
>  arch/x86/kvm/mmu/paging_tmpl.h  | 19 +++++++++---
>  arch/x86/kvm/mmu/spte.c         | 51 ++++++++++++++++++++++++++++-----
>  arch/x86/kvm/mmu/spte.h         | 36 +++++++++++++++--------
>  arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
>  arch/x86/kvm/vmx/capabilities.h |  6 ++++
>  arch/x86/kvm/vmx/hyperv.c       |  5 +++-
>  arch/x86/kvm/vmx/hyperv_evmcs.h |  1 +
>  arch/x86/kvm/vmx/nested.c       |  4 +++
>  arch/x86/kvm/vmx/vmx.c          | 21 ++++++++++++--
>  arch/x86/kvm/vmx/vmx.h          |  7 +++++
>  arch/x86/kvm/x86.c              |  4 +++
>  16 files changed, 192 insertions(+), 61 deletions(-)
> 
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
  2025-04-15  9:29 ` [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Mickaël Salaün
@ 2025-04-15 14:43   ` Sean Christopherson
  2025-05-12 15:26     ` Jon Kohler
  2025-04-15 14:43   ` Jon Kohler
  1 sibling, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-04-15 14:43 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Jon Kohler, Paolo Bonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	kvm, linux-kernel, Alexander Grest, Nicolas Saenz Julienne,
	Madhavan T . Venkataraman, Tao Su, Xiaoyao Li, Zhao Liu

On Tue, Apr 15, 2025, Mickaël Salaün wrote:
> Hi,
> 
> This series looks good, just some inlined questions.
> 
> Sean, Paolo, what do you think?

It's high up on my todo, but I've been swamped with non-upstream stuff for the
last few weeks (and I'm not quite out of the woods), so I might not get to it
this week.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
  2025-04-15  9:29 ` [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Mickaël Salaün
  2025-04-15 14:43   ` Sean Christopherson
@ 2025-04-15 14:43   ` Jon Kohler
  2025-04-16 15:44     ` Mickaël Salaün
  1 sibling, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-04-15 14:43 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Sean Christopherson, Paolo Bonzini, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Alexander Grest,
	Nicolas Saenz Julienne, Madhavan T . Venkataraman, Tao Su,
	Xiaoyao Li, Zhao Liu



> On Apr 15, 2025, at 5:29 AM, Mickaël Salaün <mic@digikod.net> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> Hi,
> 
> This series looks good, just some inlined questions.

RE Inlined questions - Did you send those elsewhere? I didn’t
see any others in my inbox, nor on lore.

> Sean, Paolo, what do you think?
> 
> Jon, what is the status of the QEMU patches?

I was waiting for comments here before sending to mailing list, but
I did post a link to the tree in the cover letter. The actual commit itself
is wicked trivial, so knock on wood, I’d imagine that would be the easiest
part of this endeavor.

Would you suggest I sent those to QEMU mailing list now, while kernel side
is still in RFC? Happy to do so if that makes sense.

https://github.com/JonKohler/qemu/commit/7a245414a0138b83cabcb809f5585ef8b5f78553

> Regards,
> Mickaël
> 
> On Thu, Mar 13, 2025 at 01:36:39PM -0700, Jon Kohler wrote:
>> ## Summary
>> This series introduces support for Intel Mode-Based Execute Control
>> (MBEC) to KVM and nested VMX virtualization, aiming to significantly
>> reduce VMexits and improve performance for Windows guests running with
>> Hypervisor-Protected Code Integrity (HVCI).
>> 
>> ## What?
>> Intel MBEC is a hardware feature, introduced in the Kabylake
>> generation, that allows for more granular control over execution
>> permissions. MBEC enables the separation and tracking of execution
>> permissions for supervisor (kernel) and user-mode code. It is used as
>> an accelerator for Microsoft's Memory Integrity [1] (also known as
>> hypervisor-protected code integrity or HVCI).
>> 
>> ## Why?
>> The primary reason for this feature is performance.
>> 
>> Without hardware-level MBEC, enabling Windows HVCI runs a 'software
>> MBEC' known as Restricted User Mode, which imposes a runtime overhead
>> due to increased state transitions between the guest's L2 root
>> partition and the L2 secure partition for running kernel mode code
>> integrity operations.
>> 
>> In practice, this results in a significant number of exits. For
>> example, playing a YouTube video within the Edge Browser produces
>> roughly 1.2 million VMexits/second across an 8 vCPU Windows 11 guest.
>> 
>> Most of these exits are VMREAD/VMWRITE operations, which can be
>> emulated with Enlightened VMCS (eVMCS). However, even with eVMCS, this
>> configuration still produces around 200,000 VMexits/second.
>> 
>> With MBEC exposed to the L1 Windows Hypervisor, the same scenario
>> results in approximately 50,000 VMexits/second, a *24x* reduction from
>> the baseline.
>> 
>> Not a typo, 24x reduction in VMexits.
>> 
>> ## How?
>> This series implements core KVM support for exposing the MBEC bit in
>> secondary execution controls (bit 22) to L1 and L2, based on
>> configuration from user space and a module parameter
>> 'enable_pt_guest_exec_control'. The inspiration for this series
>> started with Mickaël's series for Heki [3], where we've extracted,
>> refactored, and extended the MBEC-specific use case to be
>> general-purpose.
>> 
>> MBEC, which appears in Linux /proc/cpuinfo as ept_mode_based_exec,
>> splits the EPT exec bit (bit 2 in PTE) into two bits. When secondary
>> execution control bit 22 is set, PTE bit 2 reflects supervisor mode
>> executable, and PTE bit 10 reflects user mode executable.
>> 
>> The semantics for EPT violation qualifications also change when MBEC
>> is enabled, with bit 5 reflecting supervisor/kernel mode execute
>> permissions and bit 6 reflecting user mode execute permissions.
>> This ultimately serves to expose this feature to the L1 hypervisor,
>> which consumes MBEC and informs the L2 partitions not to use the
>> software MBEC by removing bit 14 in 0x40000004 EAX [4].
>> 
>> ## Where?
>> Enablement spans both VMX code and MMU code to teach the shadow MMU
>> about the different execution modes, as well as user space VMM to pass
>> secondary execution control bit 22. A patch for QEMU enablement is
>> available [5].
>> 
>> ## Testing
>> Initial testing has been on done on 6.12-based code with:
>>  Guests
>>    - Windows 11 24H2 26100.2894
>>    - Windows Server 2025 24H2 26100.2894
>>    - Windows Server 2022 W1H2 20348.825
>>  Processors:
>>    - Intel Skylake 6154
>>    - Intel Sapphire Rapids 6444Y
>> 
>> ## Acknowledgements
>> Special thanks to all contributors and reviewers who have provided
>> valuable feedback and support for this patch series.
>> 
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__learn.microsoft.com_en-2Dus_windows_security_hardware-2Dsecurity_enable-2Dvirtualization-2Dbased-2Dprotection-2Dof-2Dcode-2Dintegrity&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=Ewu6Hbm89GQnNaITHMqOckhba692_gB10PKG0rNe4hOr0rIOgaQpYK-DfIdBzjcm&s=DPxad8XItb3O5-k8Gsy0LeE3W_1x1irynDTwm-479Zg&e=
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__learn.microsoft.com_en-2Dus_virtualization_hyper-2Dv-2Don-2Dwindows_tlfs_nested-2Dvirtualization-23enlightened-2Dvmcs-2Dintel&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=Ewu6Hbm89GQnNaITHMqOckhba692_gB10PKG0rNe4hOr0rIOgaQpYK-DfIdBzjcm&s=xlj7veNuJTBSOyW3RkuSkvMelWN00qLahH5VO1UFpuY&e=
>> [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_project_kvm_patch_20231113022326.24388-2D6-2Dmic-40digikod.net_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=Ewu6Hbm89GQnNaITHMqOckhba692_gB10PKG0rNe4hOr0rIOgaQpYK-DfIdBzjcm&s=CpP7GYp_yjpWwZRjFEjzi6Kn2VbZm4qrFRFbpMDuAyk&e=
>> [4] https://urldefense.proofpoint.com/v2/url?u=https-3A__learn.microsoft.com_en-2Dus_virtualization_hyper-2Dv-2Don-2Dwindows_tlfs_feature-2Ddiscovery-23implementation-2Drecommendations-2D-2D-2D0x40000004&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=Ewu6Hbm89GQnNaITHMqOckhba692_gB10PKG0rNe4hOr0rIOgaQpYK-DfIdBzjcm&s=sSrPcF9R4QfC8hI-x9o4BWcA3S5N3_7EsAUMkTGa-aU&e=
>> [5] https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JonKohler_qemu_tree_mbec-2Drfc-2Dv1&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=Ewu6Hbm89GQnNaITHMqOckhba692_gB10PKG0rNe4hOr0rIOgaQpYK-DfIdBzjcm&s=2KyY0t6Q01ndYAWKGgsJCkE4UBURU487tPSzzjIpFfQ&e=
>> 
>> Cc: Alexander Grest <Alexander.Grest@microsoft.com>
>> Cc: Nicolas Saenz Julienne <nsaenz@amazon.es>
>> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
>> Cc: Mickaël Salaün <mic@digikod.net>
>> Cc: Tao Su <tao1.su@linux.intel.com>
>> Cc: Xiaoyao Li <xiaoyao.li@intel.com>
>> Cc: Zhao Liu <zhao1.liu@intel.com>
>> 
>> Jon Kohler (11):
>>  KVM: x86: Add module parameter for Intel MBEC
>>  KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch
>>  KVM: VMX: Wire up Intel MBEC enable/disable logic
>>  KVM: x86/mmu: Remove SPTE_PERM_MASK
>>  KVM: VMX: Extend EPT Violation protection bits
>>  KVM: x86/mmu: Introduce shadow_ux_mask
>>  KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC
>>  KVM: x86/mmu: Extend make_spte to understand MBEC
>>  KVM: nVMX: Setup Intel MBEC in nested secondary controls
>>  KVM: VMX: Allow MBEC with EVMCS
>>  KVM: x86: Enable module parameter for MBEC
>> 
>> Mickaël Salaün (5):
>>  KVM: VMX: add cpu_has_vmx_mbec helper
>>  KVM: VMX: Define VMX_EPT_USER_EXECUTABLE_MASK
>>  KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role
>>  KVM: VMX: Enhance EPT violation handler for PROT_USER_EXEC
>>  KVM: x86/mmu: Extend is_executable_pte to understand MBEC
>> 
>> Nikolay Borisov (1):
>>  KVM: VMX: Remove EPT_VIOLATIONS_ACC_*_BIT defines
>> 
>> Sean Christopherson (1):
>>  KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bits
>> 
>> arch/x86/include/asm/kvm_host.h | 13 +++++----
>> arch/x86/include/asm/vmx.h      | 45 ++++++++++++++++++++---------
>> arch/x86/kvm/mmu.h              |  3 +-
>> arch/x86/kvm/mmu/mmu.c          | 13 +++++----
>> arch/x86/kvm/mmu/mmutrace.h     | 23 ++++++++++-----
>> arch/x86/kvm/mmu/paging_tmpl.h  | 19 +++++++++---
>> arch/x86/kvm/mmu/spte.c         | 51 ++++++++++++++++++++++++++++-----
>> arch/x86/kvm/mmu/spte.h         | 36 +++++++++++++++--------
>> arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
>> arch/x86/kvm/vmx/capabilities.h |  6 ++++
>> arch/x86/kvm/vmx/hyperv.c       |  5 +++-
>> arch/x86/kvm/vmx/hyperv_evmcs.h |  1 +
>> arch/x86/kvm/vmx/nested.c       |  4 +++
>> arch/x86/kvm/vmx/vmx.c          | 21 ++++++++++++--
>> arch/x86/kvm/vmx/vmx.h          |  7 +++++
>> arch/x86/kvm/x86.c              |  4 +++
>> 16 files changed, 192 insertions(+), 61 deletions(-)
>> 
>> -- 
>> 2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
  2025-04-15 14:43   ` Jon Kohler
@ 2025-04-16 15:44     ` Mickaël Salaün
  0 siblings, 0 replies; 62+ messages in thread
From: Mickaël Salaün @ 2025-04-16 15:44 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Sean Christopherson, Paolo Bonzini, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Alexander Grest,
	Nicolas Saenz Julienne, Madhavan T . Venkataraman, Tao Su,
	Xiaoyao Li, Zhao Liu

On Tue, Apr 15, 2025 at 02:43:57PM +0000, Jon Kohler wrote:
> 
> 
> > On Apr 15, 2025, at 5:29 AM, Mickaël Salaün <mic@digikod.net> wrote:
> > 
> > !-------------------------------------------------------------------|
> >  CAUTION: External Email
> > 
> > |-------------------------------------------------------------------!
> > 
> > Hi,
> > 
> > This series looks good, just some inlined questions.
> 
> RE Inlined questions - Did you send those elsewhere? I didn’t
> see any others in my inbox, nor on lore.

No, I just wanted to highlight that you inserted questions in several
patches. :)

> 
> > Sean, Paolo, what do you think?
> > 
> > Jon, what is the status of the QEMU patches?
> 
> I was waiting for comments here before sending to mailing list, but
> I did post a link to the tree in the cover letter. The actual commit itself
> is wicked trivial, so knock on wood, I’d imagine that would be the easiest
> part of this endeavor.
> 
> Would you suggest I sent those to QEMU mailing list now, while kernel side
> is still in RFC? Happy to do so if that makes sense.
> 
> https://github.com/JonKohler/qemu/commit/7a245414a0138b83cabcb809f5585ef8b5f78553

You can wait until Sean gets a look at this series, but you don't need
to wait for it to be merged before starting a discussion with QEMU
developers.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 05/18] KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch
  2025-03-13 20:36 ` [RFC PATCH 05/18] KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch Jon Kohler
@ 2025-04-22  6:27   ` Chao Gao
  2025-05-12 18:15   ` Sean Christopherson
  1 sibling, 0 replies; 62+ messages in thread
From: Chao Gao @ 2025-04-22  6:27 UTC (permalink / raw)
  To: Jon Kohler
  Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel

On Thu, Mar 13, 2025 at 01:36:44PM -0700, Jon Kohler wrote:
>Add bool for pt_guest_exec_control to kvm_vcpu_arch, to be used for
>runtime checks for Intel Mode Based Execution Control (MBEC) and
>AMD Guest Mode Execute Control (GMET).
>
>Signed-off-by: Jon Kohler <jon@nutanix.com>
>
>---
> arch/x86/include/asm/kvm_host.h | 2 ++
> 1 file changed, 2 insertions(+)
>
>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>index fd37dad38670..192233eb557a 100644
>--- a/arch/x86/include/asm/kvm_host.h
>+++ b/arch/x86/include/asm/kvm_host.h
>@@ -856,6 +856,8 @@ struct kvm_vcpu_arch {
> 	struct kvm_hypervisor_cpuid kvm_cpuid;
> 	bool is_amd_compatible;
> 
>+	bool pt_guest_exec_control;

What is the purpose of this field? Does it indicate whether MBEC is enabled
for L1, L2, or VMCS12?

if it is intended to track whether MBEC is enabled in VMCS12, I think you
need to introduce a new bit in kvm_mmu_page_role rather than using a
per-vCPU variable. This way, the entire shadow EPT is reconstructed if the
L1 VMM toggles the MBEC control bit in VMCS12. Reconstruction is necessary
because toggling MBEC changes the meaning of bits 2 and 10 in EPT page
table, i.e., previous shadow MMU pages cannot be reused.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-03-13 20:36 ` [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic Jon Kohler
@ 2025-04-22  7:06   ` Chao Gao
  2025-05-12 18:23   ` Sean Christopherson
  1 sibling, 0 replies; 62+ messages in thread
From: Chao Gao @ 2025-04-22  7:06 UTC (permalink / raw)
  To: Jon Kohler
  Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel

On Thu, Mar 13, 2025 at 01:36:45PM -0700, Jon Kohler wrote:
>Add logic to enable / disable Intel Mode Based Execution Control (MBEC)
>based on specific conditions.
>
>MBEC depends on:
>- User space exposing secondary execution control bit 22

The code below doesn't check this.

>- Extended Page Tables (EPT)
>- The KVM module parameter `enable_pt_guest_exec_control`
>
>If any of these conditions are not met, MBEC will be disabled
>accordingly.
>
>Store runtime enablement within `kvm_vcpu_arch.pt_guest_exec_control`.
>
>Signed-off-by: Jon Kohler <jon@nutanix.com>
>
>---
> arch/x86/kvm/vmx/vmx.c | 11 +++++++++++
> arch/x86/kvm/vmx/vmx.h |  7 +++++++
> 2 files changed, 18 insertions(+)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 7a98f03ef146..116910159a3f 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -2694,6 +2694,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> 			return -EIO;
> 
> 		vmx_cap->ept = 0;
>+		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> 		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
> 	}
> 	if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
>@@ -4641,11 +4642,15 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
> 		exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
> 	if (!enable_ept) {
> 		exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
>+		exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> 		exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
> 		enable_unrestricted_guest = 0;
> 	}
> 	if (!enable_unrestricted_guest)
> 		exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
>+	if (!enable_pt_guest_exec_control)
>+		exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
>+
> 	if (kvm_pause_in_guest(vmx->vcpu.kvm))
> 		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
> 	if (!kvm_vcpu_apicv_active(vcpu))
>@@ -4770,6 +4775,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
> 		if (vmx->ve_info)
> 			vmcs_write64(VE_INFORMATION_ADDRESS,
> 				     __pa(vmx->ve_info));
>+
>+		vmx->vcpu.arch.pt_guest_exec_control =
>+			enable_pt_guest_exec_control && vmx_has_mbec(vmx);

Is it possible for vmx->vcpu.arch.pt_guest_exec_control and
enable_pt_guest_exec_control to differ?

To me, the answer is no. So, why not use enable_pt_guest_exec_control
directly?

> 	}
> 
> 	if (cpu_has_tertiary_exec_ctrls())
>@@ -8472,6 +8480,9 @@ __init int vmx_hardware_setup(void)
> 	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
> 		enable_unrestricted_guest = 0;
> 
>+	if (!cpu_has_vmx_mbec() || !enable_ept)
>+		enable_pt_guest_exec_control = false;
>+
> 	if (!cpu_has_vmx_flexpriority())
> 		flexpriority_enabled = 0;
> 
>diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>index d1e537bf50ea..9f4ae3139a90 100644
>--- a/arch/x86/kvm/vmx/vmx.h
>+++ b/arch/x86/kvm/vmx/vmx.h
>@@ -580,6 +580,7 @@ static inline u8 vmx_get_rvi(void)
> 	 SECONDARY_EXEC_ENABLE_VMFUNC |					\
> 	 SECONDARY_EXEC_BUS_LOCK_DETECTION |				\
> 	 SECONDARY_EXEC_NOTIFY_VM_EXITING |				\
>+	 SECONDARY_EXEC_MODE_BASED_EPT_EXEC |				\
> 	 SECONDARY_EXEC_ENCLS_EXITING |					\
> 	 SECONDARY_EXEC_EPT_VIOLATION_VE)
> 
>@@ -721,6 +722,12 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
> 		SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
> }
> 
>+static inline bool vmx_has_mbec(struct vcpu_vmx *vmx)
>+{
>+	return secondary_exec_controls_get(vmx) &
>+		SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
>+}
>+
> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> {
> 	if (!enable_ept)
>-- 
>2.43.0
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 12/18] KVM: x86/mmu: Introduce shadow_ux_mask
  2025-03-13 20:36 ` [RFC PATCH 12/18] KVM: x86/mmu: Introduce shadow_ux_mask Jon Kohler
@ 2025-04-23  3:06   ` Chao Gao
  2025-05-12 19:13   ` Sean Christopherson
  1 sibling, 0 replies; 62+ messages in thread
From: Chao Gao @ 2025-04-23  3:06 UTC (permalink / raw)
  To: Jon Kohler
  Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Sergey Dyasli

>@@ -28,6 +28,7 @@ u64 __read_mostly shadow_host_writable_mask;
> u64 __read_mostly shadow_mmu_writable_mask;
> u64 __read_mostly shadow_nx_mask;
> u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
>+u64 __read_mostly shadow_ux_mask;
> u64 __read_mostly shadow_user_mask;
> u64 __read_mostly shadow_accessed_mask;
> u64 __read_mostly shadow_dirty_mask;
>@@ -313,8 +314,14 @@ u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
> 		 * the page executable as the NX hugepage mitigation no longer
> 		 * applies.
> 		 */
>-		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled(kvm))
>+		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled(kvm)) {

is it possible that role.access has ACC_USER_EXEC_MASK set but not ACC_EXEC_MASK?
if so, this condition should be:

		if ((role.access & (ACC_EXEC_MASK | ACC_USER_EXEC_MASK)) &&
		    is_nx_huge_page_enabled(kvm)) {

> 			child_spte = make_spte_executable(child_spte);
>+			// TODO: For LKML: switch to vcpu->arch.pt_guest_exec_control? up
>+			// for suggestions on how best to toggle this.
>+			if (enable_pt_guest_exec_control &&

If enable_pt_guest_exec_control is 0, then shadow_ux_mask will also be 0. i.e.,
this check isn't needed.

>+			    role.access & ACC_USER_EXEC_MASK)
>+				child_spte |= shadow_ux_mask;

MBEC can be tracked in the kvm_mmu_page_role, then you can do:

			if (role.mbec_enabled && role.access & ACC_USER_EXEC_MASK)
				child_spte |= shadow_ux_mask;

>+		}
> 	}
> 
> 	return child_spte;
>@@ -326,7 +333,7 @@ u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
> 	u64 spte = SPTE_MMU_PRESENT_MASK;
> 
> 	spte |= __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
>-		shadow_user_mask | shadow_x_mask | shadow_me_value;
>+		shadow_user_mask | shadow_x_mask | shadow_ux_mask | shadow_me_value;
> 
> 	if (ad_disabled)
> 		spte |= SPTE_TDP_AD_DISABLED;
>@@ -420,7 +427,8 @@ void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_set_me_spte_mask);
> 
>-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
>+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only,
>+			   bool has_guest_exec_ctrl)
> {
> 	shadow_user_mask	= VMX_EPT_READABLE_MASK;
> 	shadow_accessed_mask	= has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
>@@ -428,8 +436,14 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> 	shadow_nx_mask		= 0ull;
> 	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
> 	/* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
>+	// For LKML Review:
>+	// Do we need to modify shadow_present_mask in the MBEC case?
> 	shadow_present_mask	=
> 		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;

No, the SDM requires that present EPT entries have the VMX_EPT_READABLE_MASK
set if execute only translation is not supported. (The VMX_EPT_SUPPRESS_VE_BIT
is for TDX).

MBEC does not change this requirement.

>+
>+	shadow_ux_mask		=
>+		has_guest_exec_ctrl ? VMX_EPT_USER_EXECUTABLE_MASK : 0ull;
>+
> 	/*
> 	 * EPT overrides the host MTRRs, and so KVM must program the desired
> 	 * memtype directly into the SPTEs.  Note, this mask is just the mask
>@@ -484,6 +498,7 @@ void kvm_mmu_reset_all_pte_masks(void)
> 	shadow_dirty_mask	= PT_DIRTY_MASK;
> 	shadow_nx_mask		= PT64_NX_MASK;
> 	shadow_x_mask		= 0;
>+	shadow_ux_mask		= 0;
> 	shadow_present_mask	= PT_PRESENT_MASK;
> 
> 	/*
>diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
>index d9e22133b6d0..dc2f0dc9c46e 100644
>--- a/arch/x86/kvm/mmu/spte.h
>+++ b/arch/x86/kvm/mmu/spte.h
>@@ -171,6 +171,7 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
> extern u64 __read_mostly shadow_nx_mask;
> extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> extern u64 __read_mostly shadow_user_mask;
>+extern u64 __read_mostly shadow_ux_mask;
> extern u64 __read_mostly shadow_accessed_mask;
> extern u64 __read_mostly shadow_dirty_mask;
> extern u64 __read_mostly shadow_mmio_value;
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 0aadfa924045..d16e3f170258 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -8544,7 +8544,8 @@ __init int vmx_hardware_setup(void)
> 
> 	if (enable_ept)
> 		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
>-				      cpu_has_vmx_ept_execute_only());
>+				      cpu_has_vmx_ept_execute_only(),
>+				      enable_pt_guest_exec_control);
> 
> 	/*
> 	 * Setup shadow_me_value/shadow_me_mask to include MKTME KeyID
>-- 
>2.43.0
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 13/18] KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC
  2025-03-13 20:36 ` [RFC PATCH 13/18] KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC Jon Kohler
@ 2025-04-23  5:37   ` Chao Gao
  2025-05-12 19:37   ` Sean Christopherson
  1 sibling, 0 replies; 62+ messages in thread
From: Chao Gao @ 2025-04-23  5:37 UTC (permalink / raw)
  To: Jon Kohler
  Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel

On Thu, Mar 13, 2025 at 01:36:52PM -0700, Jon Kohler wrote:
>Adjust the SPTE_MMIO_ALLOWED_MASK and associated values to make these
>masks aware of PTE Bit 10, to be used by Intel MBEC.
>
>Intel SDM 30.3.3.1 EPT Misconfigurations states:
>  An EPT misconfiguration occurs if translation of a guest-physical
>  address encounters an EPT paging-structure entry that meets any of
>  the following conditions:
>   - Bit 0 of the entry is clear (indicating that data reads are not
>     allowed) and any of the following hold:
>     — Bit 1 is set (indicating that data writes are allowed).
>     — The processor does not support execute-only translations and
>       either of the following hold:
>       - Bit 2 is set (indicating that instruction fetches are allowed)
>         Note: If the “mode-based execute control for EPT” VM-execution
>         control is 1, setting bit 2 indicates that instruction fetches
>         are allowed from supervisor-mode linear addresses.
>       - The “mode-based execute control for EPT” VM-execution control
>         is 1 and bit 10 is set (indicating that instruction fetches
>         are allowed from user-mode linear addresses).
>
>For LKML Review:
>SDM 30.3.3.1 also states that "Software should read the VMX capability
>MSR IA32_VMX_EPT_VPID_CAP to determine whether execute-only
>translations are supported (see Appendix A.10)." A.10 indicates that
>this is specified by bit 0; if bit 0 is 1, then the processor supports
>execute-only transactions by EPT.
>
>Searching around a bit, it looks like this bit is checked by
>vmx/capabilities.h:cpu_has_vmx_ept_execute_only(), which is used only
>in kvm/vmx/vmx.c:vmx_hardware_setup(), passed as the has_exec_only
>argument to kvm_mmu_set_ept_masks(), which uses it to set
>shadow_present_mask.
>
>I'm not sure if this actually matters for this change(?), but thought
>it was at least worth surfacing for others to consider.

KVM needs to emulate the hardware behavior when walking guest EPT to report
EPT misconfigurations/violations accurately. IMO, below functions should be
modified:

FNAME(is_present_gpte)
FNAME(is_bad_mt_xwr)

>
>Signed-off-by: Jon Kohler <jon@nutanix.com>
>
>---
> arch/x86/include/asm/vmx.h |  6 ++++--
> arch/x86/kvm/mmu/spte.h    | 13 +++++++------
> 2 files changed, 11 insertions(+), 8 deletions(-)
>
>diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>index 84c5be416f5c..961d37e108b5 100644
>--- a/arch/x86/include/asm/vmx.h
>+++ b/arch/x86/include/asm/vmx.h
>@@ -541,7 +541,8 @@ enum vmcs_field {
> #define VMX_EPT_SUPPRESS_VE_BIT			(1ull << 63)
> #define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
> 						 VMX_EPT_WRITABLE_MASK |       \
>-						 VMX_EPT_EXECUTABLE_MASK)
>+						 VMX_EPT_EXECUTABLE_MASK |     \
>+						 VMX_EPT_USER_EXECUTABLE_MASK)
> #define VMX_EPT_MT_MASK				(7ull << VMX_EPT_MT_EPTE_SHIFT)
> 
> static inline u8 vmx_eptp_page_walk_level(u64 eptp)
>@@ -558,7 +559,8 @@ static inline u8 vmx_eptp_page_walk_level(u64 eptp)
> 
> /* The mask to use to trigger an EPT Misconfiguration in order to track MMIO */
> #define VMX_EPT_MISCONFIG_WX_VALUE		(VMX_EPT_WRITABLE_MASK |       \
>-						 VMX_EPT_EXECUTABLE_MASK)
>+						 VMX_EPT_EXECUTABLE_MASK |     \
>+						 VMX_EPT_USER_EXECUTABLE_MASK)

This change is not needed. whether MEBC is enabled doesn't make
VMX_EPT_WRITABLE_MASK | VMX_EPT_EXECUTABLE_MASK a valid entry for EPT.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 14/18] KVM: x86/mmu: Extend is_executable_pte to understand MBEC
  2025-03-13 20:36 ` [RFC PATCH 14/18] KVM: x86/mmu: Extend is_executable_pte " Jon Kohler
@ 2025-04-23  6:16   ` Chao Gao
  2025-05-12 21:16   ` Sean Christopherson
  1 sibling, 0 replies; 62+ messages in thread
From: Chao Gao @ 2025-04-23  6:16 UTC (permalink / raw)
  To: Jon Kohler
  Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Mickaël Salaün

>-static inline bool is_executable_pte(u64 spte)
>+static inline bool is_executable_pte(u64 spte, bool for_kernel_mode,
>+				     struct kvm_vcpu *vcpu)
> {
>-	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
>+	u64 x_mask = shadow_x_mask;
>+
>+	if (vcpu->arch.pt_guest_exec_control) {
>+		x_mask |= shadow_ux_mask;
>+		if (for_kernel_mode)
>+			x_mask &= ~VMX_EPT_USER_EXECUTABLE_MASK;
>+		else
>+			x_mask &= ~VMX_EPT_EXECUTABLE_MASK;
>+	}

using VMX_EPT_* directly here looks weird. how about:

	u64 x_mask = shadow_x_mask;

	if (/* mbec enabled */ && !for_kernel_mode)
		x_mask = shadow_ux_mask;

	return (spte & (x_mask | shadow_nx_mask)) == x_mask;

>+
>+	return (spte & (x_mask | shadow_nx_mask)) == x_mask;
> }

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (18 preceding siblings ...)
  2025-04-15  9:29 ` [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Mickaël Salaün
@ 2025-04-23 13:54 ` Adrian-Ken Rueegsegger
  2025-05-12 15:26   ` Jon Kohler
  2025-05-12 21:46 ` Sean Christopherson
  20 siblings, 1 reply; 62+ messages in thread
From: Adrian-Ken Rueegsegger @ 2025-04-23 13:54 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Alexander Grest, Nicolas Saenz Julienne,
	Madhavan T . Venkataraman, Mickaël Salaün, Tao Su,
	Xiaoyao Li, Zhao Liu, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, seanjc, pbonzini

Hi,

On 3/13/25 21:36, Jon Kohler wrote:

[snip]

> The semantics for EPT violation qualifications also change when MBEC
> is enabled, with bit 5 reflecting supervisor/kernel mode execute
> permissions and bit 6 reflecting user mode execute permissions.
> This ultimately serves to expose this feature to the L1 hypervisor,
> which consumes MBEC and informs the L2 partitions not to use the
> software MBEC by removing bit 14 in 0x40000004 EAX [4].

Should this say bit 13 of 0x40000004.EAX? According to the referenced 
docs [4]:

Bit 13: "Recommend using INT for MBEC system calls."

Bit 14: "Recommend a nested hypervisor using the enlightened VMCS 
interface. Also indicates that additional nested enlightenments may be 
available (see leaf 0x4000000A)."

Regards,
Adrian

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
  2025-04-15 14:43   ` Sean Christopherson
@ 2025-05-12 15:26     ` Jon Kohler
  0 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-05-12 15:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mickaël Salaün, Paolo Bonzini, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Alexander Grest,
	Nicolas Saenz Julienne, Madhavan T . Venkataraman, Tao Su,
	Xiaoyao Li, Zhao Liu



> On Apr 15, 2025, at 10:43 AM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Tue, Apr 15, 2025, Mickaël Salaün wrote:
>> Hi,
>> 
>> This series looks good, just some inlined questions.
>> 
>> Sean, Paolo, what do you think?
> 
> It's high up on my todo, but I've been swamped with non-upstream stuff for the
> last few weeks (and I'm not quite out of the woods), so I might not get to it
> this week.

Gentle ping on this series, I know you’ve been swamped, any
line of sight on getting out of the woods? Just got back from travel
myself so I’m catching up on todos

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
  2025-04-23 13:54 ` Adrian-Ken Rueegsegger
@ 2025-05-12 15:26   ` Jon Kohler
  0 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-05-12 15:26 UTC (permalink / raw)
  To: Adrian-Ken Rueegsegger
  Cc: Alexander Grest, Nicolas Saenz Julienne,
	Madhavan T . Venkataraman, Mickaël Salaün, Tao Su,
	Xiaoyao Li, Zhao Liu, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	seanjc@google.com, pbonzini@redhat.com



> On Apr 23, 2025, at 9:54 AM, Adrian-Ken Rueegsegger <ken@codelabs.ch> wrote:
> 
> !-------------------------------------------------------------------|
> CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> Hi,
> 
> On 3/13/25 21:36, Jon Kohler wrote:
> 
> [snip]
> 
>> The semantics for EPT violation qualifications also change when MBEC
>> is enabled, with bit 5 reflecting supervisor/kernel mode execute
>> permissions and bit 6 reflecting user mode execute permissions.
>> This ultimately serves to expose this feature to the L1 hypervisor,
>> which consumes MBEC and informs the L2 partitions not to use the
>> software MBEC by removing bit 14 in 0x40000004 EAX [4].
> 
> Should this say bit 13 of 0x40000004.EAX? According to the referenced docs [4]:
> 
> Bit 13: "Recommend using INT for MBEC system calls."
> 
> Bit 14: "Recommend a nested hypervisor using the enlightened VMCS interface. Also indicates that additional nested enlightenments may be available (see leaf 0x4000000A)."
> 
> Regards,
> Adrian

Yes, you are correct, I’ll fix on the next go-around, thanks for
pointing that out

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 03/18] KVM: x86: Add module parameter for Intel MBEC
  2025-03-13 20:36 ` [RFC PATCH 03/18] KVM: x86: Add module parameter for Intel MBEC Jon Kohler
@ 2025-05-12 18:08   ` Sean Christopherson
  2025-05-13  2:18     ` Jon Kohler
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 18:08 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel

On Thu, Mar 13, 2025, Jon Kohler wrote:
> Add 'enable_pt_guest_exec_control' module parameter to x86 code, with
> default value false.

...

> +bool __read_mostly enable_pt_guest_exec_control;
> +EXPORT_SYMBOL_GPL(enable_pt_guest_exec_control);
> +module_param(enable_pt_guest_exec_control, bool, 0444);

The default value of a parameter doesn't prevent userspace from enabled the param.
I.e. the instant this patch lands, userspace can enable enable_pt_guest_exec_control,
which means MBEC needs to be 100% functional before this can be exposed to userspace.

The right way to do this is to simply omit the module param until KVM is ready to
let userspace enable the feature.

All that said, I don't see any reason to add a module param for this.  *KVM* isn't
using MBEC, the guest is using MBEC.  And unless host userspace is being extremely
careless with VMX MSRs, exposing MBEC to the guest will require additional VMM
enabling and/or user opt-in.

KVM provides module params to control features that KVM is using, generally when
there is no sane alternative to tell KVM not to use a particular feature, i.e.
when there is way for the user to disable a feature for testing/debug purposes.

Furthermore, how this series keys off the module param throughout KVM is completely
wrong.  The *only* input that ultimately matters is the control bit in vmcs12.
Whether or not KVM allows that bit to be set could be controlled by a module param,
but KVM shouldn't be looking at the module param outside of that particular check.

TL;DR: advertising and enabling MBEC should come along when KVM allows the bit to
       be set in vmcs12.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 04/18] KVM: VMX: add cpu_has_vmx_mbec helper
  2025-03-13 20:36 ` [RFC PATCH 04/18] KVM: VMX: add cpu_has_vmx_mbec helper Jon Kohler
@ 2025-05-12 18:14   ` Sean Christopherson
  2025-05-13  2:17     ` Jon Kohler
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 18:14 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Mickaël Salaün

On Thu, Mar 13, 2025, Jon Kohler wrote:
> From: Mickaël Salaün <mic@digikod.net>
> 
> Add 'cpu_has_vmx_mbec' helper to determine whether the cpu based VMCS
> from hardware has Intel Mode Based Execution Control exposed, which is
> secondary execution control bit 22.
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Co-developed-by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Jon Kohler <jon@nutanix.com>

LOL, really?  There's a joke in here about how many SWEs it takes...

> ---
>  arch/x86/kvm/vmx/capabilities.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index cb6588238f46..f83592272920 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -253,6 +253,12 @@ static inline bool cpu_has_vmx_xsaves(void)
>  		SECONDARY_EXEC_ENABLE_XSAVES;
>  }
>  
> +static inline bool cpu_has_vmx_mbec(void)
> +{
> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> +		SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> +}

This absolutely doesn't warrant its own patch.  Introduce it whenever its first
used/needed.

> +
>  static inline bool cpu_has_vmx_waitpkg(void)
>  {
>  	return vmcs_config.cpu_based_2nd_exec_ctrl &
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 05/18] KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch
  2025-03-13 20:36 ` [RFC PATCH 05/18] KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch Jon Kohler
  2025-04-22  6:27   ` Chao Gao
@ 2025-05-12 18:15   ` Sean Christopherson
  1 sibling, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 18:15 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel

On Thu, Mar 13, 2025, Jon Kohler wrote:
> Add bool for pt_guest_exec_control to kvm_vcpu_arch, to be used for
> runtime checks for Intel Mode Based Execution Control (MBEC) and
> AMD Guest Mode Execute Control (GMET).
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> ---
>  arch/x86/include/asm/kvm_host.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fd37dad38670..192233eb557a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -856,6 +856,8 @@ struct kvm_vcpu_arch {
>  	struct kvm_hypervisor_cpuid kvm_cpuid;
>  	bool is_amd_compatible;
>  
> +	bool pt_guest_exec_control;

Again, aside from the fast that putting this in kvm_vcpu_arch is wrong, this not
worth of a separate patch.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-03-13 20:36 ` [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic Jon Kohler
  2025-04-22  7:06   ` Chao Gao
@ 2025-05-12 18:23   ` Sean Christopherson
  2025-05-13  2:16     ` Jon Kohler
  1 sibling, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 18:23 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel

On Thu, Mar 13, 2025, Jon Kohler wrote:
> Add logic to enable / disable Intel Mode Based Execution Control (MBEC)
> based on specific conditions.
> 
> MBEC depends on:
> - User space exposing secondary execution control bit 22
> - Extended Page Tables (EPT)
> - The KVM module parameter `enable_pt_guest_exec_control`
> 
> If any of these conditions are not met, MBEC will be disabled
> accordingly.

Why?  I know why, but I know why despite the changeloge, not because of the
changelog.

> Store runtime enablement within `kvm_vcpu_arch.pt_guest_exec_control`.

Again, why?  If you actually tried to explain this, I think/hope you would realize
why it's wrong.

> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> ---
>  arch/x86/kvm/vmx/vmx.c | 11 +++++++++++
>  arch/x86/kvm/vmx/vmx.h |  7 +++++++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7a98f03ef146..116910159a3f 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2694,6 +2694,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  			return -EIO;
>  
>  		vmx_cap->ept = 0;
> +		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
>  		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
>  	}
>  	if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
> @@ -4641,11 +4642,15 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  		exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
>  	if (!enable_ept) {
>  		exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
> +		exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
>  		exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
>  		enable_unrestricted_guest = 0;
>  	}
>  	if (!enable_unrestricted_guest)
>  		exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
> +	if (!enable_pt_guest_exec_control)
> +		exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;

This is wrong and unnecessary.  As mentioned early, the input that matters is
vmcs12.  This flag should *never* be set for vmcs01.

>  	if (kvm_pause_in_guest(vmx->vcpu.kvm))
>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>  	if (!kvm_vcpu_apicv_active(vcpu))
> @@ -4770,6 +4775,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  		if (vmx->ve_info)
>  			vmcs_write64(VE_INFORMATION_ADDRESS,
>  				     __pa(vmx->ve_info));
> +
> +		vmx->vcpu.arch.pt_guest_exec_control =
> +			enable_pt_guest_exec_control && vmx_has_mbec(vmx);

This should effectively be dead code, because vmx_has_mbec() should never be
true at vCPU creation.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 09/18] KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role
  2025-03-13 20:36 ` [RFC PATCH 09/18] KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role Jon Kohler
@ 2025-05-12 18:32   ` Sean Christopherson
  2025-05-13  2:14     ` Jon Kohler
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 18:32 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Mickaël Salaün, Sergey Dyasli

On Thu, Mar 13, 2025, Jon Kohler wrote:
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 71d6fe28fafc..d9e22133b6d0 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -45,7 +45,9 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
>  #define ACC_EXEC_MASK    1
>  #define ACC_WRITE_MASK   PT_WRITABLE_MASK
>  #define ACC_USER_MASK    PT_USER_MASK
> -#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
> +#define ACC_USER_EXEC_MASK (1ULL << 3)
> +#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK | \
> +			  ACC_USER_EXEC_MASK)

This is very subtly a massive change, and I'm not convinced its one we want to
make.  All usage in the non-nested TDP flows is arguably wrong, because KVM should
never enable MBEC when using non-nested TDP.

And the use in kvm_calc_shadow_ept_root_page_role() is wrong, because the root
page role shouldn't include ACC_USER_EXEC_MASK if the associated VMCS doesn't
have MBEC.  Ditto for the use in kvm_calc_cpu_role().

So I'm pretty sure the only bit of this change that is desriable/correct is the
usage in kvm_mmu_page_get_access().  (And I guess maybe trace_mark_mmio_spte()?)

Off the cuff, I don't know what the best approach is.  One thought would be to
prep for adding ACC_USER_EXEC_MASK to ACC_ALL by introducing ACC_RWX and using
that where KVM really just wants to set RWX permissions.  That would free up
ACC_ALL for the few cases where KVM really truly wants to capture all access bits.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 10/18] KVM: VMX: Extend EPT Violation protection bits
  2025-03-13 20:36 ` [RFC PATCH 10/18] KVM: VMX: Extend EPT Violation protection bits Jon Kohler
@ 2025-05-12 18:37   ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 18:37 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel

On Thu, Mar 13, 2025, Jon Kohler wrote:
> Define macros for READ, WRITE, EXEC protection bits, to be used by
> MBEC-enabled systems.
> 
> No functional change intended.
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> ---
>  arch/x86/include/asm/vmx.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index d7ab0ad63be6..ffc90d672b5d 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -593,8 +593,17 @@ enum vm_entry_failure_code {
>  #define EPT_VIOLATION_GVA_IS_VALID	BIT(7)
>  #define EPT_VIOLATION_GVA_TRANSLATED	BIT(8)
>  
> +#define EPT_VIOLATION_READ_TO_PROT(__epte) (((__epte) & VMX_EPT_READABLE_MASK) << 3)
> +#define EPT_VIOLATION_WRITE_TO_PROT(__epte) (((__epte) & VMX_EPT_WRITABLE_MASK) << 3)
> +#define EPT_VIOLATION_EXEC_TO_PROT(__epte) (((__epte) & VMX_EPT_EXECUTABLE_MASK) << 3)
>  #define EPT_VIOLATION_RWX_TO_PROT(__epte) (((__epte) & VMX_EPT_RWX_MASK) << 3)
>  
> +static_assert(EPT_VIOLATION_READ_TO_PROT(VMX_EPT_READABLE_MASK) ==
> +	      (EPT_VIOLATION_PROT_READ));
> +static_assert(EPT_VIOLATION_WRITE_TO_PROT(VMX_EPT_WRITABLE_MASK) ==
> +	      (EPT_VIOLATION_PROT_WRITE));
> +static_assert(EPT_VIOLATION_EXEC_TO_PROT(VMX_EPT_EXECUTABLE_MASK) ==
> +	      (EPT_VIOLATION_PROT_EXEC));

Again, as a general rule, introduce macros and helpers functions when they are
first used, not as tiny prep patches.  There are exceptions to that rule, e.g. to
avoid cyclical dependencies or to isolate arch/vendor changes, but know of those
exceptions apply in this series.

Patches like this are effectively impossible to review from a design/intent
perspective, because without peeking at the usage that comes along later, there's
no way to determine whether or not it makes sense to add these macros.

And looking ahead, I don't see any reason to slice n' dice the RWX=>prot macro.

TL;DR: drop this patch.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 11/18] KVM: VMX: Enhance EPT violation handler for PROT_USER_EXEC
  2025-03-13 20:36 ` [RFC PATCH 11/18] KVM: VMX: Enhance EPT violation handler for PROT_USER_EXEC Jon Kohler
@ 2025-05-12 18:54   ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 18:54 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Mickaël Salaün

On Thu, Mar 13, 2025, Jon Kohler wrote:
> From: Mickaël Salaün <mic@digikod.net>
> 
> Add EPT_VIOLATION_PROT_USER_EXEC (6) to reflect the user executable
> permissions of a given address when Intel MBEC is enabled.
> 
> Refactor usage of EPT_VIOLATION_RWX_TO_PROT to understand all of the
> specific bits that are now possible with MBEC.
> 
> Intel SDM 'Exit Qualification for EPT Violations' states the following
> for Bit 6.
>   If the “mode-based execute control” VM-execution control is 0, the
>   value of this bit is undefined. If that control is 1, this bit is
>   the logical-AND of bit 10 in the EPT paging-structure entries used
>   to translate the guest-physical address of the access causing the
>   EPT violation. In this case, it indicates whether the guest-physical
>   address was executable for user-mode linear addresses.
> 
>   Bit 6 is cleared to 0 if (1) the “mode-based execute control”
>   VM-execution control is 1; and (2) either (a) any of EPT
>   paging-structure entries used to translate the guest-physical address
>   of the access causing the EPT violation is not present; or
>   (b) 4-level EPT is in use and the guest-physical address sets any
>   bits in the range 51:48.
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Co-developed-by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> ---
>  arch/x86/include/asm/vmx.h     |  7 ++++---
>  arch/x86/kvm/mmu/paging_tmpl.h | 15 ++++++++++++---
>  arch/x86/kvm/vmx/vmx.c         |  7 +++++--
>  3 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index ffc90d672b5d..84c5be416f5c 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -587,6 +587,7 @@ enum vm_entry_failure_code {
>  #define EPT_VIOLATION_PROT_READ		BIT(3)
>  #define EPT_VIOLATION_PROT_WRITE	BIT(4)
>  #define EPT_VIOLATION_PROT_EXEC		BIT(5)
> +#define EPT_VIOLATION_PROT_USER_EXEC	BIT(6)

Ugh, TDX added this as EPT_VIOLATION_EXEC_FOR_RING3_LIN (apparently the TDX module
enables MBEC?).  I like your name a lot better.

>  #define EPT_VIOLATION_PROT_MASK		(EPT_VIOLATION_PROT_READ  | \
>  					 EPT_VIOLATION_PROT_WRITE | \
>  					 EPT_VIOLATION_PROT_EXEC)

Hmm, so I think EPT_VIOLATION_PROT_MASK should include EPT_VIOLATION_PROT_USER_EXEC.
The existing TDX change does not, because unfortunately the bit is undefined if
MBEC is unsupported, but that's easy to solve by unconditionally clearing the bit
in handle_ept_violation().  And then when nested-EPT MBEC support comes along,
handle_ept_violation() can be modified to conditionally clear the bit based on
whether or not the current MMU supports MBEC.

I'll post a patch to include the bit in EPT_VIOLATION_PROT_MASK, and opportunistically
change the name.

> @@ -596,7 +597,7 @@ enum vm_entry_failure_code {
>  #define EPT_VIOLATION_READ_TO_PROT(__epte) (((__epte) & VMX_EPT_READABLE_MASK) << 3)
>  #define EPT_VIOLATION_WRITE_TO_PROT(__epte) (((__epte) & VMX_EPT_WRITABLE_MASK) << 3)
>  #define EPT_VIOLATION_EXEC_TO_PROT(__epte) (((__epte) & VMX_EPT_EXECUTABLE_MASK) << 3)
> -#define EPT_VIOLATION_RWX_TO_PROT(__epte) (((__epte) & VMX_EPT_RWX_MASK) << 3)

Why?  There's no escaping the fact that EXEC, a.k.a. X, is doing double duty as
"exec for all" and "kernel exec".  And KVM has nearly two decades of history
using EXEC/X to refer to "exec for all".  I see no reason to throw all of that
away and discard the intuitive and pervasive RWX logic.

> @@ -510,7 +511,15 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  		 * Note, pte_access holds the raw RWX bits from the EPTE, not
>  		 * ACC_*_MASK flags!
>  		 */
> -		walker->fault.exit_qualification |= EPT_VIOLATION_RWX_TO_PROT(pte_access);
> +		walker->fault.exit_qualification |=
> +			EPT_VIOLATION_READ_TO_PROT(pte_access);
> +		walker->fault.exit_qualification |=
> +			EPT_VIOLATION_WRITE_TO_PROT(pte_access);
> +		walker->fault.exit_qualification |=
> +			EPT_VIOLATION_EXEC_TO_PROT(pte_access);

IMO, this is a big net negative.  I much prefer the existing code, as it highlights
that USER_EXEC is the oddball.

> +		if (vcpu->arch.pt_guest_exec_control)

This is wrong on multiple fronts.  As mentioned earlier in the series, this is a
property of the MMU (more specifically, the root role), not of the vCPU.

And consulting MBEC support *only* when synthesizing the exit qualifcation is
wrong, because it means pte_access contains bogus data when consumed by
FNAME(gpte_access).  At a glance, FNAME(gpte_access) probably needs to be modified
to take in the page role, e.g. like FNAME(sync_spte) and FNAME(prefetch_gpte)
already adjust the access based on the owning shadow page's access mask.

> +			walker->fault.exit_qualification |=
> +				EPT_VIOLATION_USER_EXEC_TO_PROT(pte_access);
>  	}
>  #endif
>  	walker->fault.address = addr;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 116910159a3f..0aadfa924045 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5809,7 +5809,7 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
>  
>  static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  {
> -	unsigned long exit_qualification;
> +	unsigned long exit_qualification, rwx_mask;
>  	gpa_t gpa;
>  	u64 error_code;
>  
> @@ -5839,7 +5839,10 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
>  		      ? PFERR_FETCH_MASK : 0;
>  	/* ept page table entry is present? */
> -	error_code |= (exit_qualification & EPT_VIOLATION_PROT_MASK)
> +	rwx_mask = EPT_VIOLATION_PROT_MASK;
> +	if (vcpu->arch.pt_guest_exec_control)
> +		rwx_mask |= EPT_VIOLATION_PROT_USER_EXEC;
> +	error_code |= (exit_qualification & rwx_mask)
>  		      ? PFERR_PRESENT_MASK : 0;

As mentioned above, if KVM clears EPT_VIOLATION_PROT_USER_EXEC when it's
undefined, then this can simply use EPT_VIOLATION_PROT_MASK unchanged.

>  
>  	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 12/18] KVM: x86/mmu: Introduce shadow_ux_mask
  2025-03-13 20:36 ` [RFC PATCH 12/18] KVM: x86/mmu: Introduce shadow_ux_mask Jon Kohler
  2025-04-23  3:06   ` Chao Gao
@ 2025-05-12 19:13   ` Sean Christopherson
  1 sibling, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 19:13 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Sergey Dyasli

On Thu, Mar 13, 2025, Jon Kohler wrote:
> @@ -28,6 +28,7 @@ u64 __read_mostly shadow_host_writable_mask;
>  u64 __read_mostly shadow_mmu_writable_mask;
>  u64 __read_mostly shadow_nx_mask;
>  u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> +u64 __read_mostly shadow_ux_mask;
>  u64 __read_mostly shadow_user_mask;
>  u64 __read_mostly shadow_accessed_mask;
>  u64 __read_mostly shadow_dirty_mask;
> @@ -313,8 +314,14 @@ u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
>  		 * the page executable as the NX hugepage mitigation no longer
>  		 * applies.
>  		 */
> -		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled(kvm))
> +		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled(kvm)) {

This is wrong, and probably so is every other chunk of KVM that looks at
ACC_EXEC_MASK.  E.g. if a guest hugepage is executable for user but not supervisor,
KVM will fail to make the small child user-executable.

The bug in make_spte() is even worse, because KVM would let an MBEC-aware guest
trigger the iTLB multi-hit #MC.

>  			child_spte = make_spte_executable(child_spte);
> +			// TODO: For LKML: switch to vcpu->arch.pt_guest_exec_control? up
> +			// for suggestions on how best to toggle this.

No, it belongs in the role.

> +			if (enable_pt_guest_exec_control &&
> +			    role.access & ACC_USER_EXEC_MASK)
> +				child_spte |= shadow_ux_mask;
> +		}
>  	}
>  
>  	return child_spte;
> @@ -326,7 +333,7 @@ u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
>  	u64 spte = SPTE_MMU_PRESENT_MASK;
>  
>  	spte |= __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
> -		shadow_user_mask | shadow_x_mask | shadow_me_value;
> +		shadow_user_mask | shadow_x_mask | shadow_ux_mask | shadow_me_value;
>  
>  	if (ad_disabled)
>  		spte |= SPTE_TDP_AD_DISABLED;
> @@ -420,7 +427,8 @@ void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_me_spte_mask);
>  
> -void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> +void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only,
> +			   bool has_guest_exec_ctrl)
>  {
>  	shadow_user_mask	= VMX_EPT_READABLE_MASK;
>  	shadow_accessed_mask	= has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
> @@ -428,8 +436,14 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
>  	shadow_nx_mask		= 0ull;
>  	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
>  	/* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
> +	// For LKML Review:
> +	// Do we need to modify shadow_present_mask in the MBEC case?

No, because MBEC bifurcates X, it doesn't change whether or not an EPTE can be
X without being R.  From the SDM:

  1. If the “mode-based execute control for EPT” VM-execution control is 1,
     setting bit 0 indicates also that software may also configure EPT
     paging-structure entries in which bits 1:0 are both clear and in which bit 10
     is set (indicating a translation that can be used to fetch instructions from a
     supervisor-mode linear address or a user-mode linear address).

>  	shadow_present_mask	=
>  		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
> +
> +	shadow_ux_mask		=
> +		has_guest_exec_ctrl ? VMX_EPT_USER_EXECUTABLE_MASK : 0ull;

This is EPT specific code, just call this what it is:

	shadow_ux_mask		= has_mbec ? VMX_EPT_USER_EXECUTABLE_MASK : 0ull;
> +
>  	/*
>  	 * EPT overrides the host MTRRs, and so KVM must program the desired
>  	 * memtype directly into the SPTEs.  Note, this mask is just the mask
> @@ -484,6 +498,7 @@ void kvm_mmu_reset_all_pte_masks(void)
>  	shadow_dirty_mask	= PT_DIRTY_MASK;
>  	shadow_nx_mask		= PT64_NX_MASK;
>  	shadow_x_mask		= 0;
> +	shadow_ux_mask		= 0;
>  	shadow_present_mask	= PT_PRESENT_MASK;
>  
>  	/*
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index d9e22133b6d0..dc2f0dc9c46e 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -171,6 +171,7 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
>  extern u64 __read_mostly shadow_nx_mask;
>  extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
>  extern u64 __read_mostly shadow_user_mask;
> +extern u64 __read_mostly shadow_ux_mask;
>  extern u64 __read_mostly shadow_accessed_mask;
>  extern u64 __read_mostly shadow_dirty_mask;
>  extern u64 __read_mostly shadow_mmio_value;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0aadfa924045..d16e3f170258 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -8544,7 +8544,8 @@ __init int vmx_hardware_setup(void)
>  
>  	if (enable_ept)
>  		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
> -				      cpu_has_vmx_ept_execute_only());
> +				      cpu_has_vmx_ept_execute_only(),
> +				      enable_pt_guest_exec_control);

Without the module param, just cpu_has_vmx_mbec().

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 13/18] KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC
  2025-03-13 20:36 ` [RFC PATCH 13/18] KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC Jon Kohler
  2025-04-23  5:37   ` Chao Gao
@ 2025-05-12 19:37   ` Sean Christopherson
  2025-05-13  2:11     ` Jon Kohler
  1 sibling, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 19:37 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel

Please be more precise with the shortlogs.  "Understand MBEC" is extremely vague.

On Thu, Mar 13, 2025, Jon Kohler wrote:
> Adjust the SPTE_MMIO_ALLOWED_MASK and associated values to make these
> masks aware of PTE Bit 10, to be used by Intel MBEC.

Same thing here.  "aware of PTE bit 10" doesn't describe the change in a way that
allows for quick review of the patch.  E.g. 

  KVM: x86/mmu: Exclude EPT MBEC's user-executable bit from the MMIO generation

The changelogs also need to explain *why*.  If you actually tried to write out
justification for why KVM can't use bit 10 for the MMIO generation, then unless
you start making stuff up (or Chao and I are missing something), you'll come to
same conclusion that Chao and I came to: this patch is unnecessary.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 14/18] KVM: x86/mmu: Extend is_executable_pte to understand MBEC
  2025-03-13 20:36 ` [RFC PATCH 14/18] KVM: x86/mmu: Extend is_executable_pte " Jon Kohler
  2025-04-23  6:16   ` Chao Gao
@ 2025-05-12 21:16   ` Sean Christopherson
  2025-05-13  2:09     ` Jon Kohler
  1 sibling, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 21:16 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Mickaël Salaün

On Thu, Mar 13, 2025, Jon Kohler wrote:
> @@ -359,15 +360,17 @@ TRACE_EVENT(
>  		__entry->sptep = virt_to_phys(sptep);
>  		__entry->level = level;
>  		__entry->r = shadow_present_mask || (__entry->spte & PT_PRESENT_MASK);
> -		__entry->x = is_executable_pte(__entry->spte);
> +		__entry->kx = is_executable_pte(__entry->spte, true, vcpu);
> +		__entry->ux = is_executable_pte(__entry->spte, false, vcpu);
>  		__entry->u = shadow_user_mask ? !!(__entry->spte & shadow_user_mask) : -1;
>  	),
>  
> -	TP_printk("gfn %llx spte %llx (%s%s%s%s) level %d at %llx",
> +	TP_printk("gfn %llx spte %llx (%s%s%s%s%s) level %d at %llx",
>  		  __entry->gfn, __entry->spte,
>  		  __entry->r ? "r" : "-",
>  		  __entry->spte & PT_WRITABLE_MASK ? "w" : "-",
> -		  __entry->x ? "x" : "-",
> +		  __entry->kx ? "X" : "-",
> +		  __entry->ux ? "x" : "-",

I don't have a better idea, but I do worry that X vs. x will lead to confusion.
But as I said, I don't have a better idea...

>  		  __entry->u == -1 ? "" : (__entry->u ? "u" : "-"),
>  		  __entry->level, __entry->sptep
>  	)
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 1f7b388a56aa..fd7e29a0a567 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -346,9 +346,20 @@ static inline bool is_last_spte(u64 pte, int level)
>  	return (level == PG_LEVEL_4K) || is_large_pte(pte);
>  }
>  
> -static inline bool is_executable_pte(u64 spte)
> +static inline bool is_executable_pte(u64 spte, bool for_kernel_mode,

s/for_kernel_mode/is_user_access and invert.  A handful of KVM comments describe
supervisor as "kernel mode", but those are quite old and IMO unnecessarily imprecise.

> +				     struct kvm_vcpu *vcpu)

This needs to be an mmu (or maybe a root role?).  Hmm, thinking about the page
role, I don't think one new bit will suffice.  Simply adding ACC_USER_EXEC_MASK
won't let KVM differentiate between shadow pages created with ACC_EXEC_MASK for
an MMU without MBEC, and a page created explicitly without ACC_USER_EXEC_MASK
for an MMU *with* MBEC.

What I'm not sure about is if MBEC/GMET support needs to be captured in the base
page role, or if it shoving it in kvm_mmu_extended_role will suffice.  I'll think
more on this and report back, need to refresh all the shadowing paging stuff, again...

>  {
> -	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
> +	u64 x_mask = shadow_x_mask;
> +
> +	if (vcpu->arch.pt_guest_exec_control) {
> +		x_mask |= shadow_ux_mask;
> +		if (for_kernel_mode)
> +			x_mask &= ~VMX_EPT_USER_EXECUTABLE_MASK;
> +		else
> +			x_mask &= ~VMX_EPT_EXECUTABLE_MASK;
> +	}

This is going to get messy when GMET support comes along, because the U/S bit
would need to be inverted to do the right thing for supervisor fetches.  Rather
than trying to shoehorn support into the existing code, I think we should prep
for GMET and make the code a wee bit easier to follow in the process.  We can
even implement the actual GMET semanctics, but guarded with a WARN (emulating
GMET isn't a terrible fallback in the event of a KVM bug).

	if (spte & shadow_nx_mask)
		return false;

	if (!role.has_mode_based_exec)
		return (spte & shadow_x_mask) == shadow_x_mask;

	if (WARN_ON_ONCE(!shadow_x_mask))
		return is_user_access || !(spte & shadow_user_mask);

	return spte & (is_user_access ? shadow_ux_mask : shadow_x_mask);

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 15/18] KVM: x86/mmu: Extend make_spte to understand MBEC
  2025-03-13 20:36 ` [RFC PATCH 15/18] KVM: x86/mmu: Extend make_spte " Jon Kohler
@ 2025-05-12 21:29   ` Sean Christopherson
  2025-05-13  2:04     ` Jon Kohler
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 21:29 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Sergey Dyasli

On Thu, Mar 13, 2025, Jon Kohler wrote:
> Extend make_spte to mask in and out bits depending on MBEC enablement.

Same complaints about the shortlog and changelog not saying anything useful.

> 
> Note: For the RFC/v1 series, I've added several 'For Review' items that
> may require a bit deeper inspection, as well as some long winded
> comments/annotations. These will be cleaned up for the next iteration
> of the series after initial review.
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Co-developed-by: Sergey Dyasli <sergey.dyasli@nutanix.com>
> Signed-off-by: Sergey Dyasli <sergey.dyasli@nutanix.com>
> 
> ---
>  arch/x86/kvm/mmu/paging_tmpl.h |  3 +++
>  arch/x86/kvm/mmu/spte.c        | 30 ++++++++++++++++++++++++++----
>  2 files changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index a3a5cacda614..7675239f2dd1 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -840,6 +840,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  		 * then we should prevent the kernel from executing it
>  		 * if SMEP is enabled.
>  		 */
> +		// FOR REVIEW:
> +		// ACC_USER_EXEC_MASK seems not necessary to add here since
> +		// SMEP is for kernel-only.
>  		if (is_cr4_smep(vcpu->arch.mmu))
>  			walker.pte_access &= ~ACC_EXEC_MASK;

I would straight up WARN, because it should be impossible to reach this code with
ACC_USER_EXEC_MASK set.  In fact, this entire blob of code should be #ifdef'd
out for PTTYPE_EPT.  AFAICT, the only reason it doesn't break nEPT is because
its impossible to have a WRITE EPT violation without READ (a.k.a. USER) being
set.

>  	}
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 6f4994b3e6d0..89bdae3f9ada 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -178,6 +178,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	else if (kvm_mmu_page_ad_need_write_protect(sp))
>  		spte |= SPTE_TDP_AD_WRPROT_ONLY;
>  
> +	// For LKML Review:
> +	// In MBEC case, you can have exec only and also bit 10
> +	// set for user exec only. Do we need to cater for that here?
>  	spte |= shadow_present_mask;
>  	if (!prefetch)
>  		spte |= spte_shadow_accessed_mask(spte);
> @@ -197,12 +200,31 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&

Needs to check ACC_USER_EXEC_MASK.

>  	    is_nx_huge_page_enabled(vcpu->kvm)) {
>  		pte_access &= ~ACC_EXEC_MASK;
> +		if (vcpu->arch.pt_guest_exec_control)
> +			pte_access &= ~ACC_USER_EXEC_MASK;
>  	}
>  
> -	if (pte_access & ACC_EXEC_MASK)
> -		spte |= shadow_x_mask;
> -	else
> -		spte |= shadow_nx_mask;
> +	// For LKML Review:
> +	// We could probably optimize the logic here, but typing it out
> +	// long hand for now to make it clear how we're changing the control
> +	// flow to support MBEC.

I appreciate the effort, but this did far more harm than good.  Reviewing code
that has zero chance of being the end product is a waste of time.  And unless I'm
overlooking a subtlety, you're making this way harder than it needs to be:

	if (pte_access & (ACC_EXEC_MASK | ACC_USER_EXEC_MASK)) {
		if (pte_access & ACC_EXEC_MASK)
			spte |= shadow_x_mask;

		if (pte_access & ACC_USER_EXEC_MASK)
			spte |= shadow_ux_mask;
	} else {
		spte |= shadow_nx_mask;
	}

KVM needs to ensure ACC_USER_EXEC_MASK isn't spuriously set, but KVM should be
doing that anyways.

> +	if (!vcpu->arch.pt_guest_exec_control) { // non-mbec logic
> +		if (pte_access & ACC_EXEC_MASK)
> +			spte |= shadow_x_mask;
> +		else
> +			spte |= shadow_nx_mask;
> +	} else { // mbec logic
> +		if (pte_access & ACC_EXEC_MASK) { /* mbec: kernel exec */
> +			if (pte_access & ACC_USER_EXEC_MASK)
> +				spte |= shadow_x_mask | shadow_ux_mask; // KMX = 1, UMX = 1
> +			else
> +				spte |= shadow_x_mask;  // KMX = 1, UMX = 0
> +		} else if (pte_access & ACC_USER_EXEC_MASK) { /* mbec: user exec, no kernel exec */
> +			spte |= shadow_ux_mask; // KMX = 0, UMX = 1
> +		} else { /* mbec: nx */
> +			spte |= shadow_nx_mask; // KMX = 0, UMX = 0
> +		}
> +	}
>  
>  	if (pte_access & ACC_USER_MASK)
>  		spte |= shadow_user_mask;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 16/18] KVM: nVMX: Setup Intel MBEC in nested secondary controls
  2025-03-13 20:36 ` [RFC PATCH 16/18] KVM: nVMX: Setup Intel MBEC in nested secondary controls Jon Kohler
@ 2025-05-12 21:32   ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 21:32 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel

On Thu, Mar 13, 2025, Jon Kohler wrote:
> Setup Intel Mode Based Execution Control (bit 22) for nested
> guest, gated on module parameter enablement.

*This* is the enablement patch.  And it's not doing "Setup", it's advertising
SECONDARY_EXEC_MODE_BASED_EPT_EXEC to userspace and allowing userspace to expose
and advertise the feature to the guest.

> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> ---
>  arch/x86/kvm/vmx/nested.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 931a7361c30f..ce3a6d6dfce7 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -7099,6 +7099,10 @@ static void nested_vmx_setup_secondary_ctls(u32 ept_caps,
>  		 */
>  		if (cpu_has_vmx_vmfunc())
>  			msrs->vmfunc_controls = VMX_VMFUNC_EPTP_SWITCHING;
> +
> +		if (enable_pt_guest_exec_control)
> +			msrs->secondary_ctls_high |=
> +				SECONDARY_EXEC_MODE_BASED_EPT_EXEC;

Land this above the VMFUNC stuff so that more of the secondary_ctls_high code is
clumped together.

>  	}
>  
>  	/*
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 17/18] KVM: VMX: Allow MBEC with EVMCS
  2025-03-13 20:36 ` [RFC PATCH 17/18] KVM: VMX: Allow MBEC with EVMCS Jon Kohler
@ 2025-05-12 21:35   ` Sean Christopherson
  2025-05-13  2:01     ` Jon Kohler
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 21:35 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Vitaly Kuznetsov

On Thu, Mar 13, 2025, Jon Kohler wrote:
> Extend EVMCS1_SUPPORTED_2NDEXEC to understand MBEC enablement,
> otherwise presenting both EVMCS and MBEC at the same time will disable
> MBEC presentation into the guest.

A brief rundown on any relevant history of eVMCS support for MBEC would be
appreciated, if there is any.
 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> ---
>  arch/x86/kvm/vmx/hyperv.c       | 5 ++++-
>  arch/x86/kvm/vmx/hyperv_evmcs.h | 1 +
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/hyperv.c b/arch/x86/kvm/vmx/hyperv.c
> index fab6a1ad98dc..941a29c9e667 100644
> --- a/arch/x86/kvm/vmx/hyperv.c
> +++ b/arch/x86/kvm/vmx/hyperv.c
> @@ -138,7 +138,10 @@ void nested_evmcs_filter_control_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *
>  		ctl_high &= evmcs_get_supported_ctls(EVMCS_EXEC_CTRL);
>  		break;
>  	case MSR_IA32_VMX_PROCBASED_CTLS2:
> -		ctl_high &= evmcs_get_supported_ctls(EVMCS_2NDEXEC);
> +		supported_ctrls = evmcs_get_supported_ctls(EVMCS_2NDEXEC);
> +		if (!vcpu->arch.pt_guest_exec_control)
> +			supported_ctrls &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;

No idea what you're trying to do, but I don't see how this is necessary in any
capacity.

> +		ctl_high &= supported_ctrls;
>  		break;
>  	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
>  	case MSR_IA32_VMX_PINBASED_CTLS:
> diff --git a/arch/x86/kvm/vmx/hyperv_evmcs.h b/arch/x86/kvm/vmx/hyperv_evmcs.h
> index a543fccfc574..930429f376f9 100644
> --- a/arch/x86/kvm/vmx/hyperv_evmcs.h
> +++ b/arch/x86/kvm/vmx/hyperv_evmcs.h
> @@ -87,6 +87,7 @@
>  	 SECONDARY_EXEC_PT_CONCEAL_VMX |				\
>  	 SECONDARY_EXEC_BUS_LOCK_DETECTION |				\
>  	 SECONDARY_EXEC_NOTIFY_VM_EXITING |				\
> +	 SECONDARY_EXEC_MODE_BASED_EPT_EXEC |				\
>  	 SECONDARY_EXEC_ENCLS_EXITING)
>  
>  #define EVMCS1_SUPPORTED_3RDEXEC (0ULL)
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
  2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
                   ` (19 preceding siblings ...)
  2025-04-23 13:54 ` Adrian-Ken Rueegsegger
@ 2025-05-12 21:46 ` Sean Christopherson
  2025-05-13  1:59   ` Jon Kohler
  20 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-12 21:46 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, Alexander Grest, Nicolas Saenz Julienne,
	Madhavan T . Venkataraman, Mickaël Salaün, Tao Su,
	Xiaoyao Li, Zhao Liu

On Thu, Mar 13, 2025, Jon Kohler wrote:
> ## Summary
> This series introduces support for Intel Mode-Based Execute Control
> (MBEC) to KVM and nested VMX virtualization, aiming to significantly
> reduce VMexits and improve performance for Windows guests running with
> Hypervisor-Protected Code Integrity (HVCI).

...

> ## Testing
> Initial testing has been on done on 6.12-based code with:
>   Guests
>     - Windows 11 24H2 26100.2894
>     - Windows Server 2025 24H2 26100.2894
>     - Windows Server 2022 W1H2 20348.825
>   Processors:
>     - Intel Skylake 6154
>     - Intel Sapphire Rapids 6444Y

This series needs testcases, and lots of 'em.  A short list off the top of my head:

 - New KVM-Unit-Test (KUT) ept_access_xxx testcases to verify KVM does the right
   thing with respect to user and supervisor code fetches when MBEC is:

     1. Supported and Enabled
     2. Supported but Disabled
     3. Unsupported

 - KUT testcases to verify VMLAUNCH/VMRESUME consistency checks.

 - KUT testcases to verify KVM treats WRITABLE+USER_EXEC as an illegal combination,
   i.e. that MBEC doesn't affect the W=1,R=0 behavior.

The access tests in particular absolutely need to be provided along with the next
version.  Unless I'm missing something, this RFC implementation is buggy throughout
due to tracking MBEC on a per-vCPU basis, and all of those bugs should be exposed
by even relative basic testcases.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC)
  2025-05-12 21:46 ` Sean Christopherson
@ 2025-05-13  1:59   ` Jon Kohler
  0 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-05-13  1:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Alexander Grest, Nicolas Saenz Julienne,
	Madhavan T . Venkataraman, Mickaël Salaün, Tao Su,
	Xiaoyao Li, Zhao Liu



> On May 12, 2025, at 5:46 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Mar 13, 2025, Jon Kohler wrote:
>> ## Summary
>> This series introduces support for Intel Mode-Based Execute Control
>> (MBEC) to KVM and nested VMX virtualization, aiming to significantly
>> reduce VMexits and improve performance for Windows guests running with
>> Hypervisor-Protected Code Integrity (HVCI).
> 
> ...
> 
>> ## Testing
>> Initial testing has been on done on 6.12-based code with:
>>  Guests
>>    - Windows 11 24H2 26100.2894
>>    - Windows Server 2025 24H2 26100.2894
>>    - Windows Server 2022 W1H2 20348.825
>>  Processors:
>>    - Intel Skylake 6154
>>    - Intel Sapphire Rapids 6444Y
> 
> This series needs testcases, and lots of 'em.  A short list off the top of my head:
> 
> - New KVM-Unit-Test (KUT) ept_access_xxx testcases to verify KVM does the right
>   thing with respect to user and supervisor code fetches when MBEC is:
> 
>     1. Supported and Enabled
>     2. Supported but Disabled
>     3. Unsupported
> 
> - KUT testcases to verify VMLAUNCH/VMRESUME consistency checks.
> 
> - KUT testcases to verify KVM treats WRITABLE+USER_EXEC as an illegal combination,
>   i.e. that MBEC doesn't affect the W=1,R=0 behavior.
> 
> The access tests in particular absolutely need to be provided along with the next
> version.  Unless I'm missing something, this RFC implementation is buggy throughout
> due to tracking MBEC on a per-vCPU basis, and all of those bugs should be exposed
> by even relative basic testcases.

Thanks for the review, Sean. I’ll work on rebasing my patches from 6.12 to latest
and incorporating the feedback across the board.

On the KUT side, good news is I already have most of that done-ish, so I’ll tune
them up when I get the next rev of the series, and send them both out together.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 17/18] KVM: VMX: Allow MBEC with EVMCS
  2025-05-12 21:35   ` Sean Christopherson
@ 2025-05-13  2:01     ` Jon Kohler
  0 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-05-13  2:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vitaly Kuznetsov



> On May 12, 2025, at 5:35 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Mar 13, 2025, Jon Kohler wrote:
>> Extend EVMCS1_SUPPORTED_2NDEXEC to understand MBEC enablement,
>> otherwise presenting both EVMCS and MBEC at the same time will disable
>> MBEC presentation into the guest.
> 
> A brief rundown on any relevant history of eVMCS support for MBEC would be
> appreciated, if there is any.

There isn’t any, but the broader theme of “make the commit/short log better” will
tidy this up, as I spent quite a lot of time on this eVMCS area trying to wrap my
head around that, I’ll codify that knowledge in the commit log

> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> 
>> ---
>> arch/x86/kvm/vmx/hyperv.c       | 5 ++++-
>> arch/x86/kvm/vmx/hyperv_evmcs.h | 1 +
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/hyperv.c b/arch/x86/kvm/vmx/hyperv.c
>> index fab6a1ad98dc..941a29c9e667 100644
>> --- a/arch/x86/kvm/vmx/hyperv.c
>> +++ b/arch/x86/kvm/vmx/hyperv.c
>> @@ -138,7 +138,10 @@ void nested_evmcs_filter_control_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *
>> ctl_high &= evmcs_get_supported_ctls(EVMCS_EXEC_CTRL);
>> break;
>> case MSR_IA32_VMX_PROCBASED_CTLS2:
>> - ctl_high &= evmcs_get_supported_ctls(EVMCS_2NDEXEC);
>> + supported_ctrls = evmcs_get_supported_ctls(EVMCS_2NDEXEC);
>> + if (!vcpu->arch.pt_guest_exec_control)
>> + supported_ctrls &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> 
> No idea what you're trying to do, but I don't see how this is necessary in any
> capacity.

The eVMCS code has this logic to be able to “peel back” changes based
on runtime level enablement. I think with the broader changes to the series
suggested (moving control out of vcpu structure here), then this goes away.

I’ll seek to simplify this.

> 
>> + ctl_high &= supported_ctrls;
>> break;
>> case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
>> case MSR_IA32_VMX_PINBASED_CTLS:
>> diff --git a/arch/x86/kvm/vmx/hyperv_evmcs.h b/arch/x86/kvm/vmx/hyperv_evmcs.h
>> index a543fccfc574..930429f376f9 100644
>> --- a/arch/x86/kvm/vmx/hyperv_evmcs.h
>> +++ b/arch/x86/kvm/vmx/hyperv_evmcs.h
>> @@ -87,6 +87,7 @@
>> SECONDARY_EXEC_PT_CONCEAL_VMX | \
>> SECONDARY_EXEC_BUS_LOCK_DETECTION | \
>> SECONDARY_EXEC_NOTIFY_VM_EXITING | \
>> + SECONDARY_EXEC_MODE_BASED_EPT_EXEC | \
>> SECONDARY_EXEC_ENCLS_EXITING)
>> 
>> #define EVMCS1_SUPPORTED_3RDEXEC (0ULL)
>> -- 
>> 2.43.0
>> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 15/18] KVM: x86/mmu: Extend make_spte to understand MBEC
  2025-05-12 21:29   ` Sean Christopherson
@ 2025-05-13  2:04     ` Jon Kohler
  2025-05-13 17:54       ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-05-13  2:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Sergey Dyasli



> On May 12, 2025, at 5:29 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Mar 13, 2025, Jon Kohler wrote:
>> Extend make_spte to mask in and out bits depending on MBEC enablement.
> 
> Same complaints about the shortlog and changelog not saying anything useful.

ack

> 
>> 
>> Note: For the RFC/v1 series, I've added several 'For Review' items that
>> may require a bit deeper inspection, as well as some long winded
>> comments/annotations. These will be cleaned up for the next iteration
>> of the series after initial review.
>> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> Co-developed-by: Sergey Dyasli <sergey.dyasli@nutanix.com>
>> Signed-off-by: Sergey Dyasli <sergey.dyasli@nutanix.com>
>> 
>> ---
>> arch/x86/kvm/mmu/paging_tmpl.h |  3 +++
>> arch/x86/kvm/mmu/spte.c        | 30 ++++++++++++++++++++++++++----
>> 2 files changed, 29 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index a3a5cacda614..7675239f2dd1 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -840,6 +840,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>>  * then we should prevent the kernel from executing it
>>  * if SMEP is enabled.
>>  */
>> + // FOR REVIEW:
>> + // ACC_USER_EXEC_MASK seems not necessary to add here since
>> + // SMEP is for kernel-only.
>> if (is_cr4_smep(vcpu->arch.mmu))
>> walker.pte_access &= ~ACC_EXEC_MASK;
> 
> I would straight up WARN, because it should be impossible to reach this code with
> ACC_USER_EXEC_MASK set.  In fact, this entire blob of code should be #ifdef'd
> out for PTTYPE_EPT.  AFAICT, the only reason it doesn't break nEPT is because
> its impossible to have a WRITE EPT violation without READ (a.k.a. USER) being
> set.

Would you like me to send a separate patch out for that to clean up as
I go? Or make such ifdef’ery as part of this series?

> 
>> }
>> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
>> index 6f4994b3e6d0..89bdae3f9ada 100644
>> --- a/arch/x86/kvm/mmu/spte.c
>> +++ b/arch/x86/kvm/mmu/spte.c
>> @@ -178,6 +178,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>> else if (kvm_mmu_page_ad_need_write_protect(sp))
>> spte |= SPTE_TDP_AD_WRPROT_ONLY;
>> 
>> + // For LKML Review:
>> + // In MBEC case, you can have exec only and also bit 10
>> + // set for user exec only. Do we need to cater for that here?
>> spte |= shadow_present_mask;
>> if (!prefetch)
>> spte |= spte_shadow_accessed_mask(spte);
>> @@ -197,12 +200,31 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>> if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
> 
> Needs to check ACC_USER_EXEC_MASK.
> 
>>     is_nx_huge_page_enabled(vcpu->kvm)) {
>> pte_access &= ~ACC_EXEC_MASK;
>> + if (vcpu->arch.pt_guest_exec_control)
>> + pte_access &= ~ACC_USER_EXEC_MASK;
>> }
>> 
>> - if (pte_access & ACC_EXEC_MASK)
>> - spte |= shadow_x_mask;
>> - else
>> - spte |= shadow_nx_mask;
>> + // For LKML Review:
>> + // We could probably optimize the logic here, but typing it out
>> + // long hand for now to make it clear how we're changing the control
>> + // flow to support MBEC.
> 
> I appreciate the effort, but this did far more harm than good.  Reviewing code
> that has zero chance of being the end product is a waste of time.  And unless I'm
> overlooking a subtlety, you're making this way harder than it needs to be:
> 
> if (pte_access & (ACC_EXEC_MASK | ACC_USER_EXEC_MASK)) {
> if (pte_access & ACC_EXEC_MASK)
> spte |= shadow_x_mask;
> 
> if (pte_access & ACC_USER_EXEC_MASK)
> spte |= shadow_ux_mask;
> } else {
> spte |= shadow_nx_mask;
> }

Ack, my apologies, wasn’t trying to make things harder, but I appreciate the
candid feedback. Thanks for the suggested code, I’ll incorporate that on the next
go. 

> 
> KVM needs to ensure ACC_USER_EXEC_MASK isn't spuriously set, but KVM should be
> doing that anyways.
> 
>> + if (!vcpu->arch.pt_guest_exec_control) { // non-mbec logic
>> + if (pte_access & ACC_EXEC_MASK)
>> + spte |= shadow_x_mask;
>> + else
>> + spte |= shadow_nx_mask;
>> + } else { // mbec logic
>> + if (pte_access & ACC_EXEC_MASK) { /* mbec: kernel exec */
>> + if (pte_access & ACC_USER_EXEC_MASK)
>> + spte |= shadow_x_mask | shadow_ux_mask; // KMX = 1, UMX = 1
>> + else
>> + spte |= shadow_x_mask;  // KMX = 1, UMX = 0
>> + } else if (pte_access & ACC_USER_EXEC_MASK) { /* mbec: user exec, no kernel exec */
>> + spte |= shadow_ux_mask; // KMX = 0, UMX = 1
>> + } else { /* mbec: nx */
>> + spte |= shadow_nx_mask; // KMX = 0, UMX = 0
>> + }
>> + }
>> 
>> if (pte_access & ACC_USER_MASK)
>> spte |= shadow_user_mask;
>> -- 
>> 2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 14/18] KVM: x86/mmu: Extend is_executable_pte to understand MBEC
  2025-05-12 21:16   ` Sean Christopherson
@ 2025-05-13  2:09     ` Jon Kohler
  0 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-05-13  2:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Mickaël Salaün, amit.shah@amd.com



> On May 12, 2025, at 5:16 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Mar 13, 2025, Jon Kohler wrote:
>> @@ -359,15 +360,17 @@ TRACE_EVENT(
>> __entry->sptep = virt_to_phys(sptep);
>> __entry->level = level;
>> __entry->r = shadow_present_mask || (__entry->spte & PT_PRESENT_MASK);
>> - __entry->x = is_executable_pte(__entry->spte);
>> + __entry->kx = is_executable_pte(__entry->spte, true, vcpu);
>> + __entry->ux = is_executable_pte(__entry->spte, false, vcpu);
>> __entry->u = shadow_user_mask ? !!(__entry->spte & shadow_user_mask) : -1;
>> ),
>> 
>> - TP_printk("gfn %llx spte %llx (%s%s%s%s) level %d at %llx",
>> + TP_printk("gfn %llx spte %llx (%s%s%s%s%s) level %d at %llx",
>>  __entry->gfn, __entry->spte,
>>  __entry->r ? "r" : "-",
>>  __entry->spte & PT_WRITABLE_MASK ? "w" : "-",
>> -  __entry->x ? "x" : "-",
>> +  __entry->kx ? "X" : "-",
>> +  __entry->ux ? "x" : "-",
> 
> I don't have a better idea, but I do worry that X vs. x will lead to confusion.
> But as I said, I don't have a better idea...

Rampant confusion on this in our internal review, but it was the best we could
come up with on the first go-around here (outside of additional rigor on code
comments, etc) … which certainly don’t help at run/trace time.

> 
>>  __entry->u == -1 ? "" : (__entry->u ? "u" : "-"),
>>  __entry->level, __entry->sptep
>> )
>> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
>> index 1f7b388a56aa..fd7e29a0a567 100644
>> --- a/arch/x86/kvm/mmu/spte.h
>> +++ b/arch/x86/kvm/mmu/spte.h
>> @@ -346,9 +346,20 @@ static inline bool is_last_spte(u64 pte, int level)
>> return (level == PG_LEVEL_4K) || is_large_pte(pte);
>> }
>> 
>> -static inline bool is_executable_pte(u64 spte)
>> +static inline bool is_executable_pte(u64 spte, bool for_kernel_mode,
> 
> s/for_kernel_mode/is_user_access and invert.  A handful of KVM comments describe
> supervisor as "kernel mode", but those are quite old and IMO unnecessarily imprecise.
> 
>> +     struct kvm_vcpu *vcpu)
> 
> This needs to be an mmu (or maybe a root role?).  Hmm, thinking about the page
> role, I don't think one new bit will suffice.  Simply adding ACC_USER_EXEC_MASK
> won't let KVM differentiate between shadow pages created with ACC_EXEC_MASK for
> an MMU without MBEC, and a page created explicitly without ACC_USER_EXEC_MASK
> for an MMU *with* MBEC.
> 
> What I'm not sure about is if MBEC/GMET support needs to be captured in the base
> page role, or if it shoving it in kvm_mmu_extended_role will suffice.  I'll think
> more on this and report back, need to refresh all the shadowing paging stuff, again...
> 
> 
>> {
>> - return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
>> + u64 x_mask = shadow_x_mask;
>> +
>> + if (vcpu->arch.pt_guest_exec_control) {
>> + x_mask |= shadow_ux_mask;
>> + if (for_kernel_mode)
>> + x_mask &= ~VMX_EPT_USER_EXECUTABLE_MASK;
>> + else
>> + x_mask &= ~VMX_EPT_EXECUTABLE_MASK;
>> + }
> 
> This is going to get messy when GMET support comes along, because the U/S bit
> would need to be inverted to do the right thing for supervisor fetches.  Rather
> than trying to shoehorn support into the existing code, I think we should prep
> for GMET and make the code a wee bit easier to follow in the process.  We can
> even implement the actual GMET semanctics, but guarded with a WARN (emulating
> GMET isn't a terrible fallback in the event of a KVM bug).

+Amit

We’re on the same page there. In fact, Amit and I have been talking off list about
GMET with (notionally) this same goal in mind, of trying to make sure we do this in
such a way where we don’t need to rework the whole thing for GMET.

> 
> if (spte & shadow_nx_mask)
> return false;
> 
> if (!role.has_mode_based_exec)
> return (spte & shadow_x_mask) == shadow_x_mask;
> 
> if (WARN_ON_ONCE(!shadow_x_mask))
> return is_user_access || !(spte & shadow_user_mask);
> 
> return spte & (is_user_access ? shadow_ux_mask : shadow_x_mask);

Ack, I’ll chew on this.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 13/18] KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC
  2025-05-12 19:37   ` Sean Christopherson
@ 2025-05-13  2:11     ` Jon Kohler
  0 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-05-13  2:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org



> On May 12, 2025, at 3:37 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> Please be more precise with the shortlogs.  "Understand MBEC" is extremely vague.

Ack, thanks for the feedback. I’ll tune it up across the board

> 
> On Thu, Mar 13, 2025, Jon Kohler wrote:
>> Adjust the SPTE_MMIO_ALLOWED_MASK and associated values to make these
>> masks aware of PTE Bit 10, to be used by Intel MBEC.
> 
> Same thing here.  "aware of PTE bit 10" doesn't describe the change in a way that
> allows for quick review of the patch.  E.g. 
> 
>  KVM: x86/mmu: Exclude EPT MBEC's user-executable bit from the MMIO generation
> 
> The changelogs also need to explain *why*.  If you actually tried to write out
> justification for why KVM can't use bit 10 for the MMIO generation, then unless
> you start making stuff up (or Chao and I are missing something), you'll come to
> same conclusion that Chao and I came to: this patch is unnecessary.

I’ll take a swing at it again, IIRC I couldn’t get it working without this, but I’ll page
that back in and figure it out

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 09/18] KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role
  2025-05-12 18:32   ` Sean Christopherson
@ 2025-05-13  2:14     ` Jon Kohler
  0 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-05-13  2:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Mickaël Salaün, Sergey Dyasli



> On May 12, 2025, at 2:32 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Mar 13, 2025, Jon Kohler wrote:
>> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
>> index 71d6fe28fafc..d9e22133b6d0 100644
>> --- a/arch/x86/kvm/mmu/spte.h
>> +++ b/arch/x86/kvm/mmu/spte.h
>> @@ -45,7 +45,9 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
>> #define ACC_EXEC_MASK    1
>> #define ACC_WRITE_MASK   PT_WRITABLE_MASK
>> #define ACC_USER_MASK    PT_USER_MASK
>> -#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
>> +#define ACC_USER_EXEC_MASK (1ULL << 3)
>> +#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK | \
>> +  ACC_USER_EXEC_MASK)
> 
> This is very subtly a massive change, and I'm not convinced its one we want to
> make.  All usage in the non-nested TDP flows is arguably wrong, because KVM should
> never enable MBEC when using non-nested TDP.
> 
> And the use in kvm_calc_shadow_ept_root_page_role() is wrong, because the root
> page role shouldn't include ACC_USER_EXEC_MASK if the associated VMCS doesn't
> have MBEC.  Ditto for the use in kvm_calc_cpu_role().
> 
> So I'm pretty sure the only bit of this change that is desriable/correct is the
> usage in kvm_mmu_page_get_access().  (And I guess maybe trace_mark_mmio_spte()?)
> 
> Off the cuff, I don't know what the best approach is.  One thought would be to
> prep for adding ACC_USER_EXEC_MASK to ACC_ALL by introducing ACC_RWX and using
> that where KVM really just wants to set RWX permissions.  That would free up
> ACC_ALL for the few cases where KVM really truly wants to capture all access bits.

At first blush, I like this ACC_RWX idea. I’ll chew on that and see what
trouble I can get in.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-05-12 18:23   ` Sean Christopherson
@ 2025-05-13  2:16     ` Jon Kohler
  2025-05-13 13:28       ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-05-13  2:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org



> On May 12, 2025, at 2:23 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Mar 13, 2025, Jon Kohler wrote:
>> Add logic to enable / disable Intel Mode Based Execution Control (MBEC)
>> based on specific conditions.
>> 
>> MBEC depends on:
>> - User space exposing secondary execution control bit 22
>> - Extended Page Tables (EPT)
>> - The KVM module parameter `enable_pt_guest_exec_control`
>> 
>> If any of these conditions are not met, MBEC will be disabled
>> accordingly.
> 
> Why?  I know why, but I know why despite the changeloge, not because of the
> changelog.
> 
>> Store runtime enablement within `kvm_vcpu_arch.pt_guest_exec_control`.
> 
> Again, why?  If you actually tried to explain this, I think/hope you would realize
> why it's wrong.
> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> 
>> ---
>> arch/x86/kvm/vmx/vmx.c | 11 +++++++++++
>> arch/x86/kvm/vmx/vmx.h |  7 +++++++
>> 2 files changed, 18 insertions(+)
>> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 7a98f03ef146..116910159a3f 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2694,6 +2694,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>> return -EIO;
>> 
>> vmx_cap->ept = 0;
>> + _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
>> _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
>> }
>> if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
>> @@ -4641,11 +4642,15 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>> exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
>> if (!enable_ept) {
>> exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
>> + exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
>> exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
>> enable_unrestricted_guest = 0;
>> }
>> if (!enable_unrestricted_guest)
>> exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
>> + if (!enable_pt_guest_exec_control)
>> + exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> 
> This is wrong and unnecessary.  As mentioned early, the input that matters is
> vmcs12.  This flag should *never* be set for vmcs01.

I’ll page this back in, but I’m like 75% sure it didn’t work when I did it that way.

Either way, thanks for the feedback, I’ll chase that do ground.

> 
>> if (kvm_pause_in_guest(vmx->vcpu.kvm))
>> exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>> if (!kvm_vcpu_apicv_active(vcpu))
>> @@ -4770,6 +4775,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>> if (vmx->ve_info)
>> vmcs_write64(VE_INFORMATION_ADDRESS,
>>     __pa(vmx->ve_info));
>> +
>> + vmx->vcpu.arch.pt_guest_exec_control =
>> + enable_pt_guest_exec_control && vmx_has_mbec(vmx);
> 
> This should effectively be dead code, because vmx_has_mbec() should never be
> true at vCPU creation.

Ack, will fix


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 04/18] KVM: VMX: add cpu_has_vmx_mbec helper
  2025-05-12 18:14   ` Sean Christopherson
@ 2025-05-13  2:17     ` Jon Kohler
  0 siblings, 0 replies; 62+ messages in thread
From: Jon Kohler @ 2025-05-13  2:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Mickaël Salaün



> On May 12, 2025, at 2:14 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Mar 13, 2025, Jon Kohler wrote:
>> From: Mickaël Salaün <mic@digikod.net>
>> 
>> Add 'cpu_has_vmx_mbec' helper to determine whether the cpu based VMCS
>> from hardware has Intel Mode Based Execution Control exposed, which is
>> secondary execution control bit 22.
>> 
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Co-developed-by: Jon Kohler <jon@nutanix.com>
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> LOL, really?  There's a joke in here about how many SWEs it takes...

42, I think.

> 
>> ---
>> arch/x86/kvm/vmx/capabilities.h | 6 ++++++
>> 1 file changed, 6 insertions(+)
>> 
>> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
>> index cb6588238f46..f83592272920 100644
>> --- a/arch/x86/kvm/vmx/capabilities.h
>> +++ b/arch/x86/kvm/vmx/capabilities.h
>> @@ -253,6 +253,12 @@ static inline bool cpu_has_vmx_xsaves(void)
>> SECONDARY_EXEC_ENABLE_XSAVES;
>> }
>> 
>> +static inline bool cpu_has_vmx_mbec(void)
>> +{
>> + return vmcs_config.cpu_based_2nd_exec_ctrl &
>> + SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
>> +}
> 
> This absolutely doesn't warrant its own patch.  Introduce it whenever its first
> used/needed.

Yep, will do

> 
>> +
>> static inline bool cpu_has_vmx_waitpkg(void)
>> {
>> return vmcs_config.cpu_based_2nd_exec_ctrl &
>> -- 
>> 2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 03/18] KVM: x86: Add module parameter for Intel MBEC
  2025-05-12 18:08   ` Sean Christopherson
@ 2025-05-13  2:18     ` Jon Kohler
  2025-05-13  7:57       ` Shah, Amit
  0 siblings, 1 reply; 62+ messages in thread
From: Jon Kohler @ 2025-05-13  2:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org



> On May 12, 2025, at 2:08 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Mar 13, 2025, Jon Kohler wrote:
>> Add 'enable_pt_guest_exec_control' module parameter to x86 code, with
>> default value false.
> 
> ...
> 
>> +bool __read_mostly enable_pt_guest_exec_control;
>> +EXPORT_SYMBOL_GPL(enable_pt_guest_exec_control);
>> +module_param(enable_pt_guest_exec_control, bool, 0444);
> 
> The default value of a parameter doesn't prevent userspace from enabled the param.
> I.e. the instant this patch lands, userspace can enable enable_pt_guest_exec_control,
> which means MBEC needs to be 100% functional before this can be exposed to userspace.
> 
> The right way to do this is to simply omit the module param until KVM is ready to
> let userspace enable the feature.
> 
> All that said, I don't see any reason to add a module param for this.  *KVM* isn't
> using MBEC, the guest is using MBEC.  And unless host userspace is being extremely
> careless with VMX MSRs, exposing MBEC to the guest will require additional VMM
> enabling and/or user opt-in.
> 
> KVM provides module params to control features that KVM is using, generally when
> there is no sane alternative to tell KVM not to use a particular feature, i.e.
> when there is way for the user to disable a feature for testing/debug purposes.
> 
> Furthermore, how this series keys off the module param throughout KVM is completely
> wrong.  The *only* input that ultimately matters is the control bit in vmcs12.
> Whether or not KVM allows that bit to be set could be controlled by a module param,
> but KVM shouldn't be looking at the module param outside of that particular check.
> 
> TL;DR: advertising and enabling MBEC should come along when KVM allows the bit to
>       be set in vmcs12.

Gotcha, and I think this fact alone will drive a nice bit of cleanup thru
the entire series. Will mop it up

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 03/18] KVM: x86: Add module parameter for Intel MBEC
  2025-05-13  2:18     ` Jon Kohler
@ 2025-05-13  7:57       ` Shah, Amit
  0 siblings, 0 replies; 62+ messages in thread
From: Shah, Amit @ 2025-05-13  7:57 UTC (permalink / raw)
  To: jon@nutanix.com, seanjc@google.com
  Cc: x86@kernel.org, dave.hansen@linux.intel.com, hpa@zytor.com,
	mingo@redhat.com, tglx@linutronix.de, bp@alien8.de,
	kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Tue, 2025-05-13 at 02:18 +0000, Jon Kohler wrote:
> 
> 
> > On May 12, 2025, at 2:08 PM, Sean Christopherson
> > <seanjc@google.com> wrote:
> > 
> > !------------------------------------------------------------------
> > -|
> >  CAUTION: External Email
> > 
> > > -----------------------------------------------------------------
> > > --!
> > 
> > On Thu, Mar 13, 2025, Jon Kohler wrote:
> > > Add 'enable_pt_guest_exec_control' module parameter to x86 code,
> > > with
> > > default value false.
> > 
> > ...
> > 
> > > +bool __read_mostly enable_pt_guest_exec_control;
> > > +EXPORT_SYMBOL_GPL(enable_pt_guest_exec_control);
> > > +module_param(enable_pt_guest_exec_control, bool, 0444);
> > 
> > The default value of a parameter doesn't prevent userspace from
> > enabled the param.
> > I.e. the instant this patch lands, userspace can enable
> > enable_pt_guest_exec_control,
> > which means MBEC needs to be 100% functional before this can be
> > exposed to userspace.
> > 
> > The right way to do this is to simply omit the module param until
> > KVM is ready to
> > let userspace enable the feature.
> > 
> > All that said, I don't see any reason to add a module param for
> > this.  *KVM* isn't
> > using MBEC, the guest is using MBEC.  And unless host userspace is
> > being extremely
> > careless with VMX MSRs, exposing MBEC to the guest will require
> > additional VMM
> > enabling and/or user opt-in.
> > 
> > KVM provides module params to control features that KVM is using,
> > generally when
> > there is no sane alternative to tell KVM not to use a particular
> > feature, i.e.
> > when there is way for the user to disable a feature for
> > testing/debug purposes.
> > 
> > Furthermore, how this series keys off the module param throughout
> > KVM is completely
> > wrong.  The *only* input that ultimately matters is the control bit
> > in vmcs12.
> > Whether or not KVM allows that bit to be set could be controlled by
> > a module param,
> > but KVM shouldn't be looking at the module param outside of that
> > particular check.
> > 
> > TL;DR: advertising and enabling MBEC should come along when KVM
> > allows the bit to
> >       be set in vmcs12.
> 
> Gotcha, and I think this fact alone will drive a nice bit of cleanup
> thru
> the entire series. Will mop it up

Yea - I think (at least for AMD GMET) if the VMM adds the GMET CPUID
bit to the guest CPUID, it should be taken as 'enabled' by KVM.  No
need for a module param there..

		Amit


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-05-13  2:16     ` Jon Kohler
@ 2025-05-13 13:28       ` Sean Christopherson
  2025-05-14 11:14         ` Shah, Amit
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-13 13:28 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Tue, May 13, 2025, Jon Kohler wrote:
> > On May 12, 2025, at 2:23 PM, Sean Christopherson <seanjc@google.com> wrote:
> > > On Thu, Mar 13, 2025, Jon Kohler wrote:
> >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >> index 7a98f03ef146..116910159a3f 100644
> >> --- a/arch/x86/kvm/vmx/vmx.c
> >> +++ b/arch/x86/kvm/vmx/vmx.c
> >> @@ -2694,6 +2694,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> >> return -EIO;
> >> 
> >> vmx_cap->ept = 0;
> >> + _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> >> _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
> >> }
> >> if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
> >> @@ -4641,11 +4642,15 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
> >> exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
> >> if (!enable_ept) {
> >> exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
> >> + exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> >> exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
> >> enable_unrestricted_guest = 0;
> >> }
> >> if (!enable_unrestricted_guest)
> >> exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
> >> + if (!enable_pt_guest_exec_control)
> >> + exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> > 
> > This is wrong and unnecessary.  As mentioned early, the input that matters is
> > vmcs12.  This flag should *never* be set for vmcs01.
> 
> I’ll page this back in, but I’m like 75% sure it didn’t work when I did it that way.

Then you had other bugs.  The control is per-VMCS and thus needs to be emulated
as such.  Definitely holler if you get stuck, there's no need to develop this in
complete isolation.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 15/18] KVM: x86/mmu: Extend make_spte to understand MBEC
  2025-05-13  2:04     ` Jon Kohler
@ 2025-05-13 17:54       ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-05-13 17:54 UTC (permalink / raw)
  To: Jon Kohler
  Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Sergey Dyasli

On Tue, May 13, 2025, Jon Kohler wrote:
> > I would straight up WARN, because it should be impossible to reach this code with
> > ACC_USER_EXEC_MASK set.  In fact, this entire blob of code should be #ifdef'd
> > out for PTTYPE_EPT.  AFAICT, the only reason it doesn't break nEPT is because
> > its impossible to have a WRITE EPT violation without READ (a.k.a. USER) being
> > set.
> 
> Would you like me to send a separate patch out for that to clean up as
> I go? Or make such ifdef’ery as part of this series?

I'll send a patch.  It's not at all urgent, not a hard dependency for MBEC, the
comment(s) needs to be rewritten, I want to do an audit of paging_tmpl.h to see
if there is more code that'd be worth #idef'ing out for nEPT.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-05-13 13:28       ` Sean Christopherson
@ 2025-05-14 11:14         ` Shah, Amit
  2025-05-14 12:55           ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Shah, Amit @ 2025-05-14 11:14 UTC (permalink / raw)
  To: jon@nutanix.com, seanjc@google.com
  Cc: x86@kernel.org, dave.hansen@linux.intel.com, hpa@zytor.com,
	mingo@redhat.com, tglx@linutronix.de, bp@alien8.de,
	kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Tue, 2025-05-13 at 06:28 -0700, Sean Christopherson wrote:
> On Tue, May 13, 2025, Jon Kohler wrote:
> > > On May 12, 2025, at 2:23 PM, Sean Christopherson
> > > <seanjc@google.com> wrote:
> > > > On Thu, Mar 13, 2025, Jon Kohler wrote:
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index 7a98f03ef146..116910159a3f 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -2694,6 +2694,7 @@ static int setup_vmcs_config(struct
> > > > vmcs_config *vmcs_conf,
> > > > return -EIO;
> > > > 
> > > > vmx_cap->ept = 0;
> > > > + _cpu_based_2nd_exec_control &=
> > > > ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> > > > _cpu_based_2nd_exec_control &=
> > > > ~SECONDARY_EXEC_EPT_VIOLATION_VE;
> > > > }
> > > > if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID)
> > > > &&
> > > > @@ -4641,11 +4642,15 @@ static u32
> > > > vmx_secondary_exec_control(struct vcpu_vmx *vmx)
> > > > exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
> > > > if (!enable_ept) {
> > > > exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
> > > > + exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> > > > exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
> > > > enable_unrestricted_guest = 0;
> > > > }
> > > > if (!enable_unrestricted_guest)
> > > > exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
> > > > + if (!enable_pt_guest_exec_control)
> > > > + exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
> > > 
> > > This is wrong and unnecessary.  As mentioned early, the input
> > > that matters is
> > > vmcs12.  This flag should *never* be set for vmcs01.
> > 
> > I’ll page this back in, but I’m like 75% sure it didn’t work when I
> > did it that way.
> 
> Then you had other bugs.  The control is per-VMCS and thus needs to
> be emulated
> as such.  Definitely holler if you get stuck, there's no need to
> develop this in
> complete isolation.

Looking at this from the AMD GMET POV, here's how I think support for
this feature for a Windows guest would be implemented:

* Do not enable the GMET feature in vmcb01.  Only the Windows guest (L1
guest) sets this bit for its own guest (L2 guest).  KVM (L0) should see
the bit set in vmcb02 (and vmcb12).  OTOH, pass on the CPUID bit to the
L1 guest.

* KVM needs to propagate the #NPF to Windows (instead of handling
anything itself -- ie no shadow page table adjustments or walks
needed).  Windows spawns an L2 guest that causes the #NPF, and Windows
is the one that needs to consume that fault.

* KVM needs to differentiate an #NPF exit due to GMET or non-GMET
condition -- check the CPL and U/S bits from the exit, and the NX bit
from the PTE that faulted.  If due to GMET, propagate it to the guest.
If not, continue handling it

(btw KVM MMU API question -- from the #NPF, I have the GPA of the L2
guest.  How to go from that guest GPA to look up the NX bit for that
page?  I skimmed and there doesn't seem to be an existing API for it -
so is walking the tables the only solution?)

		Amit

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-05-14 11:14         ` Shah, Amit
@ 2025-05-14 12:55           ` Sean Christopherson
  2025-06-16  9:27             ` Shah, Amit
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-05-14 12:55 UTC (permalink / raw)
  To: Amit Shah
  Cc: jon@nutanix.com, x86@kernel.org, dave.hansen@linux.intel.com,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de, bp@alien8.de,
	kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Wed, May 14, 2025, Amit Shah wrote:
> On Tue, 2025-05-13 at 06:28 -0700, Sean Christopherson wrote:
> > On Tue, May 13, 2025, Jon Kohler wrote:
> > > > On May 12, 2025, at 2:23 PM, Sean Christopherson
> > > > This is wrong and unnecessary.  As mentioned early, the input that
> > > > matters is vmcs12.  This flag should *never* be set for vmcs01.
> > > 
> > > I’ll page this back in, but I’m like 75% sure it didn’t work when I
> > > did it that way.
> > 
> > Then you had other bugs.  The control is per-VMCS and thus needs to
> > be emulated
> > as such.  Definitely holler if you get stuck, there's no need to
> > develop this in
> > complete isolation.
> 
> Looking at this from the AMD GMET POV, here's how I think support for
> this feature for a Windows guest would be implemented:
> 
> * Do not enable the GMET feature in vmcb01.  Only the Windows guest (L1
> guest) sets this bit for its own guest (L2 guest).  KVM (L0) should see
> the bit set in vmcb02 (and vmcb12).  OTOH, pass on the CPUID bit to the
> L1 guest.
> 
> * KVM needs to propagate the #NPF to Windows (instead of handling
> anything itself -- ie no shadow page table adjustments or walks
> needed).  Windows spawns an L2 guest that causes the #NPF, and Windows
> is the one that needs to consume that fault.
> 
> * KVM needs to differentiate an #NPF exit due to GMET or non-GMET
> condition -- check the CPL and U/S bits from the exit, and the NX bit
> from the PTE that faulted.  If due to GMET, propagate it to the guest.
> If not, continue handling it

Yes, but no.  KVM shouldn't need to do anything special here other than teaching
update_permission_bitmask() to understand the GMET fault case.  Ditto for MBEC.
I'd type something up, but I would quickly encounter -ENOCOFFE :-)

With the correct mmu->permissions[], permission_fault() will naturally detect
that a #NPF (or EPT Violation) from L2 due to a GMET/MBEC violation is a fault
in the nNPT/nEPT domain and route the exit to L1.

> (btw KVM MMU API question -- from the #NPF, I have the GPA of the L2
> guest.  How to go from that guest GPA to look up the NX bit for that
> page?  I skimmed and there doesn't seem to be an existing API for it -
> so is walking the tables the only solution?)

As above, KVM doesn't manually look up individual bits while handling faults.
The walk of the guest page tables (L1's NPT/EPT for this scenario) performed by
FNAME(walk_addr_generic) will gather the effective permissions in walker->pte_access,
and check for a permission_fault() after the walk is completed.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-05-14 12:55           ` Sean Christopherson
@ 2025-06-16  9:27             ` Shah, Amit
  2025-06-17 14:13               ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Shah, Amit @ 2025-06-16  9:27 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: x86@kernel.org, dave.hansen@linux.intel.com, hpa@zytor.com,
	mingo@redhat.com, tglx@linutronix.de, bp@alien8.de,
	kvm@vger.kernel.org, pbonzini@redhat.com, jon@nutanix.com,
	linux-kernel@vger.kernel.org

On Wed, 2025-05-14 at 05:55 -0700, Sean Christopherson wrote:
> On Wed, May 14, 2025, Amit Shah wrote:
> > On Tue, 2025-05-13 at 06:28 -0700, Sean Christopherson wrote:
> > > On Tue, May 13, 2025, Jon Kohler wrote:
> > > > > On May 12, 2025, at 2:23 PM, Sean Christopherson
> > > > > This is wrong and unnecessary.  As mentioned early, the input
> > > > > that
> > > > > matters is vmcs12.  This flag should *never* be set for
> > > > > vmcs01.
> > > > 
> > > > I’ll page this back in, but I’m like 75% sure it didn’t work
> > > > when I
> > > > did it that way.
> > > 
> > > Then you had other bugs.  The control is per-VMCS and thus needs
> > > to
> > > be emulated
> > > as such.  Definitely holler if you get stuck, there's no need to
> > > develop this in
> > > complete isolation.
> > 
> > Looking at this from the AMD GMET POV, here's how I think support
> > for
> > this feature for a Windows guest would be implemented:
> > 
> > * Do not enable the GMET feature in vmcb01.  Only the Windows guest
> > (L1
> > guest) sets this bit for its own guest (L2 guest).  KVM (L0) should
> > see
> > the bit set in vmcb02 (and vmcb12).  OTOH, pass on the CPUID bit to
> > the
> > L1 guest.
> > 
> > * KVM needs to propagate the #NPF to Windows (instead of handling
> > anything itself -- ie no shadow page table adjustments or walks
> > needed).  Windows spawns an L2 guest that causes the #NPF, and
> > Windows
> > is the one that needs to consume that fault.
> > 
> > * KVM needs to differentiate an #NPF exit due to GMET or non-GMET
> > condition -- check the CPL and U/S bits from the exit, and the NX
> > bit
> > from the PTE that faulted.  If due to GMET, propagate it to the
> > guest.
> > If not, continue handling it
> 
> Yes, but no.  KVM shouldn't need to do anything special here other
> than teaching
> update_permission_bitmask() to understand the GMET fault case.  Ditto
> for MBEC.
> I'd type something up, but I would quickly encounter -ENOCOFFE :-)
> 
> With the correct mmu->permissions[], permission_fault() will
> naturally detect
> that a #NPF (or EPT Violation) from L2 due to a GMET/MBEC violation
> is a fault
> in the nNPT/nEPT domain and route the exit to L1.
>
> > (btw KVM MMU API question -- from the #NPF, I have the GPA of the
> > L2
> > guest.  How to go from that guest GPA to look up the NX bit for
> > that
> > page?  I skimmed and there doesn't seem to be an existing API for
> > it -
> > so is walking the tables the only solution?)
> 
> As above, KVM doesn't manually look up individual bits while handling
> faults.
> The walk of the guest page tables (L1's NPT/EPT for this scenario)
> performed by
> FNAME(walk_addr_generic) will gather the effective permissions in
> walker->pte_access,
> and check for a permission_fault() after the walk is completed.

Hm, despite the discussions in the PUCK calls since this email, I have
this doubt, which may be fairly basic.  To determine whether the exit
was due to GMET, we have to check the effective U/S and NX bit for the
address that faulted.  That means we have to walk the L2's page tables
to get those bits from the L2's PTEs, and then from the error code in
exitinfo1, confirm why the #NPF happened.  (And even with Paolo's neat
SMEP hack, the exit reason due to GMET can only be confirmed by looking
at the guest's U/S and NX bits.)

And from what I see, currently page table walks only happen on L1's
page tables, and not on L2's page tables, is that right?

I'm sure I'm missing something here, though..


		Amit

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-06-16  9:27             ` Shah, Amit
@ 2025-06-17 14:13               ` Sean Christopherson
  2025-07-09 13:40                 ` Shah, Amit
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-06-17 14:13 UTC (permalink / raw)
  To: Amit Shah
  Cc: x86@kernel.org, dave.hansen@linux.intel.com, hpa@zytor.com,
	mingo@redhat.com, tglx@linutronix.de, bp@alien8.de,
	kvm@vger.kernel.org, pbonzini@redhat.com, jon@nutanix.com,
	linux-kernel@vger.kernel.org

On Mon, Jun 16, 2025, Amit Shah wrote:
> On Wed, 2025-05-14 at 05:55 -0700, Sean Christopherson wrote:
> > On Wed, May 14, 2025, Amit Shah wrote:
> > > (btw KVM MMU API question -- from the #NPF, I have the GPA of the L2
> > > guest.  How to go from that guest GPA to look up the NX bit for that
> > > page?  I skimmed and there doesn't seem to be an existing API for it - so
> > > is walking the tables the only solution?)
> > 
> > As above, KVM doesn't manually look up individual bits while handling
> > faults.  The walk of the guest page tables (L1's NPT/EPT for this scenario)
> > performed by FNAME(walk_addr_generic) will gather the effective permissions
> > in walker->pte_access, and check for a permission_fault() after the walk is
> > completed.
> 
> Hm, despite the discussions in the PUCK calls since this email, I have
> this doubt, which may be fairly basic.  To determine whether the exit
> was due to GMET, we have to check the effective U/S and NX bit for the
> address that faulted.  That means we have to walk the L2's page tables
> to get those bits from the L2's PTEs, and then from the error code in
> exitinfo1, confirm why the #NPF happened.  (And even with Paolo's neat
> SMEP hack, the exit reason due to GMET can only be confirmed by looking
> at the guest's U/S and NX bits.)
> 
> And from what I see, currently page table walks only happen on L1's
> page tables, and not on L2's page tables, is that right?

Nit, they aren't _L2's_ page tables, in that (barring crazy paravirt behavior)
L2 does not control the page tables.  In most conversations, that distinction
wouldn't matter, but when talking about which pages KVM walks when running an L2
while L1 is using NPT (or EPT), it's worth being very precise, because KVM may
also need to walk L2's non-nested page tables, i.e. the page table that map L2
GVAs to L2 GPA.

The least awful terminology we've come up with when referring to nested TDP is
to follow KVM's VMCS/VMCB terminology when doing nested virtualization:

  npt12: The NPT page tables controlled by L1 to manage L2 GPAs.  These are
         never referenced by hardware.
  npt02: KVM controlled page tables that shadow npt12, and are consumed by hardware.

> I'm sure I'm missing something here, though..

Heh, yep.  Part of that's my fault for using ambiguous terminology.  When I said
"L1's NPT/EPT" above, what I really meant was npt12.  I.e. this code

  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
  {
	struct guest_walker walker;
	int r;

	WARN_ON_ONCE(fault->is_tdp);

	/*
	 * Look up the guest pte for the faulting address.
	 * If PFEC.RSVD is set, this is a shadow page fault.
	 * The bit needs to be cleared before walking guest page tables.
	 */
	r = FNAME(walk_addr)(&walker, vcpu, fault->addr,
			     fault->error_code & ~PFERR_RSVD_MASK);

	/*
	 * The page is not mapped by the guest.  Let the guest handle it.
	 */
	if (!r) {
		if (!fault->prefetch)
			kvm_inject_emulated_page_fault(vcpu, &walker.fault);  <===== GMET #NPF

		return RET_PF_RETRY;
	}

which leads to the aformentioned FNAME(walk_addr_generic) and walker->pte_access
behavior, is walking npt12.  Because the #NPF will have occurred while running
L2, and by virtue of it being an #NPF (as opposed to a "legacy" #PF), KVM knows
the fault is in the context of npt02.

Before doing anything with respect to npt12, KVM needs to do walk_addr() on _npt12_
to determine whether the access is allowed by np12.  E.g. the simplest scenario
to grok is if L2 accesses a (L2) GPA that isn't mapped by npt12, in case KVM needs
to inject a #NPF into L1.

Same thing here.  On a PRESENT+FETCH+USER fault, if the effective protections
in npt12 have U/S=1 and GMET is enabled, then KVM needs to inject a #NPF into
L1.  

Side topic, someone should check with the AMD architects as to whether or not
GMET depends on EFER.NXE=1.  The APM says that all NPT mappings are executable
if EFER.NXE=0 in the host (where the "host" is L1 when dealing with nested NPT).
To me, that implies GMET is effectively ignored if EFER.NXE=0.

  Similarly, if the EFER.NXE bit is cleared for the host, all nested page table
  mappings are executable at the underlying nested level.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic
  2025-06-17 14:13               ` Sean Christopherson
@ 2025-07-09 13:40                 ` Shah, Amit
  0 siblings, 0 replies; 62+ messages in thread
From: Shah, Amit @ 2025-07-09 13:40 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: x86@kernel.org, dave.hansen@linux.intel.com, hpa@zytor.com,
	mingo@redhat.com, tglx@linutronix.de, bp@alien8.de,
	kvm@vger.kernel.org, pbonzini@redhat.com, jon@nutanix.com,
	linux-kernel@vger.kernel.org

On Tue, 2025-06-17 at 07:13 -0700, Sean Christopherson wrote:


> [snipped nested page walk overview]

Thanks a lot for this!

> 
> Side topic, someone should check with the AMD architects as to
> whether or not
> GMET depends on EFER.NXE=1.  The APM says that all NPT mappings are
> executable
> if EFER.NXE=0 in the host (where the "host" is L1 when dealing with
> nested NPT).
> To me, that implies GMET is effectively ignored if EFER.NXE=0.
> 
>   Similarly, if the EFER.NXE bit is cleared for the host, all nested
> page table
>   mappings are executable at the underlying nested level.

The "effective NX" computation includes EFER.NXE.  If that's 0, GMET is
still active and depends on the U/S bit if enabled, as mentioned in the
APM.

Cheers,

		Amit

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2025-07-09 13:40 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-13 20:36 [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 01/18] KVM: VMX: Remove EPT_VIOLATIONS_ACC_*_BIT defines Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 02/18] KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bits Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 03/18] KVM: x86: Add module parameter for Intel MBEC Jon Kohler
2025-05-12 18:08   ` Sean Christopherson
2025-05-13  2:18     ` Jon Kohler
2025-05-13  7:57       ` Shah, Amit
2025-03-13 20:36 ` [RFC PATCH 04/18] KVM: VMX: add cpu_has_vmx_mbec helper Jon Kohler
2025-05-12 18:14   ` Sean Christopherson
2025-05-13  2:17     ` Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 05/18] KVM: x86: Add pt_guest_exec_control to kvm_vcpu_arch Jon Kohler
2025-04-22  6:27   ` Chao Gao
2025-05-12 18:15   ` Sean Christopherson
2025-03-13 20:36 ` [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic Jon Kohler
2025-04-22  7:06   ` Chao Gao
2025-05-12 18:23   ` Sean Christopherson
2025-05-13  2:16     ` Jon Kohler
2025-05-13 13:28       ` Sean Christopherson
2025-05-14 11:14         ` Shah, Amit
2025-05-14 12:55           ` Sean Christopherson
2025-06-16  9:27             ` Shah, Amit
2025-06-17 14:13               ` Sean Christopherson
2025-07-09 13:40                 ` Shah, Amit
2025-03-13 20:36 ` [RFC PATCH 07/18] KVM: VMX: Define VMX_EPT_USER_EXECUTABLE_MASK Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 08/18] KVM: x86/mmu: Remove SPTE_PERM_MASK Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 09/18] KVM: x86/mmu: Extend access bitfield in kvm_mmu_page_role Jon Kohler
2025-05-12 18:32   ` Sean Christopherson
2025-05-13  2:14     ` Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 10/18] KVM: VMX: Extend EPT Violation protection bits Jon Kohler
2025-05-12 18:37   ` Sean Christopherson
2025-03-13 20:36 ` [RFC PATCH 11/18] KVM: VMX: Enhance EPT violation handler for PROT_USER_EXEC Jon Kohler
2025-05-12 18:54   ` Sean Christopherson
2025-03-13 20:36 ` [RFC PATCH 12/18] KVM: x86/mmu: Introduce shadow_ux_mask Jon Kohler
2025-04-23  3:06   ` Chao Gao
2025-05-12 19:13   ` Sean Christopherson
2025-03-13 20:36 ` [RFC PATCH 13/18] KVM: x86/mmu: Adjust SPTE_MMIO_ALLOWED_MASK to understand MBEC Jon Kohler
2025-04-23  5:37   ` Chao Gao
2025-05-12 19:37   ` Sean Christopherson
2025-05-13  2:11     ` Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 14/18] KVM: x86/mmu: Extend is_executable_pte " Jon Kohler
2025-04-23  6:16   ` Chao Gao
2025-05-12 21:16   ` Sean Christopherson
2025-05-13  2:09     ` Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 15/18] KVM: x86/mmu: Extend make_spte " Jon Kohler
2025-05-12 21:29   ` Sean Christopherson
2025-05-13  2:04     ` Jon Kohler
2025-05-13 17:54       ` Sean Christopherson
2025-03-13 20:36 ` [RFC PATCH 16/18] KVM: nVMX: Setup Intel MBEC in nested secondary controls Jon Kohler
2025-05-12 21:32   ` Sean Christopherson
2025-03-13 20:36 ` [RFC PATCH 17/18] KVM: VMX: Allow MBEC with EVMCS Jon Kohler
2025-05-12 21:35   ` Sean Christopherson
2025-05-13  2:01     ` Jon Kohler
2025-03-13 20:36 ` [RFC PATCH 18/18] KVM: x86: Enable module parameter for MBEC Jon Kohler
2025-04-15  9:29 ` [RFC PATCH 00/18] KVM: VMX: Introduce Intel Mode-Based Execute Control (MBEC) Mickaël Salaün
2025-04-15 14:43   ` Sean Christopherson
2025-05-12 15:26     ` Jon Kohler
2025-04-15 14:43   ` Jon Kohler
2025-04-16 15:44     ` Mickaël Salaün
2025-04-23 13:54 ` Adrian-Ken Rueegsegger
2025-05-12 15:26   ` Jon Kohler
2025-05-12 21:46 ` Sean Christopherson
2025-05-13  1:59   ` Jon Kohler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).