[PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
@ 2025-02-01  0:50 Sean Christopherson
  2025-02-01  0:50 ` [PATCH 1/2] x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state() Sean Christopherson
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-02-01  0:50 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Dionna Glaze, Peter Gonda,
	Jürgen Groß, Kirill Shutemov, Vitaly Kuznetsov,
	H . Peter Anvin, Binbin Wu, Tom Lendacky

Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
memory from TOLUD => 4GiB, as unconditionally writeback.

TDX in particular has created an impossible situation with MTRRs.  Because
TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
too simple).

Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
as WB, which in turn prevents device drivers from mapping device memory as
WC/UC-.

For the record, I hate this hack.  But it's the safest approach I can come
up with.  E.g. forcing ioremap() to always use WB scares me because it's
possible, however unlikely, that the kernel could try to map non-emulated
memory (that is presented as MMIO to the guest) as WC/UC-, and silently
forcing those mappings to WB could do weird things.

My initial thought was to effectively revert the offending commit and
skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(

Sean Christopherson (2):
  x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
  x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced
    WB

 arch/x86/include/asm/mtrr.h        |  5 +++--
 arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
 arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
 3 files changed, 39 insertions(+), 8 deletions(-)

base-commit: fd8c09ad0d87783b9b6a27900d66293be45b7bad
-- 
2.48.1.362.g079036d154-goog

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
  2025-02-01  0:50 [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Sean Christopherson
@ 2025-02-01  0:50 ` Sean Christopherson
  2025-02-01  0:50 ` [PATCH 2/2] x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced WB Sean Christopherson
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-02-01  0:50 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Dionna Glaze, Peter Gonda,
	Jürgen Groß, Kirill Shutemov, Vitaly Kuznetsov,
	H . Peter Anvin, Binbin Wu, Tom Lendacky

When *potentially* forcing MTRRs to a single memory type, return whether
or not MTRRs were indeed overridden so that the caller can take additional
action when necessary.  E.g. KVM-as-a-guest will use the information to
also force the PAT memtype for legacy devices to be WB.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/mtrr.h        |  5 +++--
 arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
index c69e269937c5..598753189f19 100644
--- a/arch/x86/include/asm/mtrr.h
+++ b/arch/x86/include/asm/mtrr.h
@@ -58,7 +58,7 @@ struct mtrr_state_type {
  */
 # ifdef CONFIG_MTRR
 void mtrr_bp_init(void);
-void guest_force_mtrr_state(struct mtrr_var_range *var, unsigned int num_var,
+bool guest_force_mtrr_state(struct mtrr_var_range *var, unsigned int num_var,
 			    mtrr_type def_type);
 extern u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform);
 extern void mtrr_save_fixed_ranges(void *);
@@ -75,10 +75,11 @@ void mtrr_disable(void);
 void mtrr_enable(void);
 void mtrr_generic_set_state(void);
 #  else
-static inline void guest_force_mtrr_state(struct mtrr_var_range *var,
+static inline bool guest_force_mtrr_state(struct mtrr_var_range *var,
 					  unsigned int num_var,
 					  mtrr_type def_type)
 {
+	return false;
 }
 
 static inline u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform)
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 2fdfda2b60e4..4fd704907dbc 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -435,19 +435,21 @@ void __init mtrr_copy_map(void)
  * @var: MTRR variable range array to use
  * @num_var: length of the @var array
  * @def_type: default caching type
+ *
+ * Returns %true if MTRRs were overridden, %false if they were not.
  */
-void guest_force_mtrr_state(struct mtrr_var_range *var, unsigned int num_var,
+bool guest_force_mtrr_state(struct mtrr_var_range *var, unsigned int num_var,
 			    mtrr_type def_type)
 {
 	unsigned int i;
 
 	/* Only allowed to be called once before mtrr_bp_init(). */
 	if (WARN_ON_ONCE(mtrr_state_set))
-		return;
+		return false;
 
 	/* Only allowed when running virtualized. */
 	if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
-		return;
+		return false;
 
 	/*
 	 * Only allowed for special virtualization cases:
@@ -460,7 +462,7 @@ void guest_force_mtrr_state(struct mtrr_var_range *var, unsigned int num_var,
 	    !hv_is_isolation_supported() &&
 	    !cpu_feature_enabled(X86_FEATURE_XENPV) &&
 	    !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
-		return;
+		return false;
 
 	/* Disable MTRR in order to disable MTRR modifications. */
 	setup_clear_cpu_cap(X86_FEATURE_MTRR);
@@ -480,6 +482,7 @@ void guest_force_mtrr_state(struct mtrr_var_range *var, unsigned int num_var,
 	mtrr_state.enabled |= MTRR_STATE_MTRR_ENABLED;
 
 	mtrr_state_set = 1;
+	return true;
 }
 
 static u8 type_merge(u8 type, u8 new_type, u8 *uniform)
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced WB
  2025-02-01  0:50 [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Sean Christopherson
  2025-02-01  0:50 ` [PATCH 1/2] x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state() Sean Christopherson
@ 2025-02-01  0:50 ` Sean Christopherson
  2025-02-01 14:25 ` [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Dionna Amalie Glaze
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-02-01  0:50 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Dionna Glaze, Peter Gonda,
	Jürgen Groß, Kirill Shutemov, Vitaly Kuznetsov,
	H . Peter Anvin, Binbin Wu, Tom Lendacky

When running as an SNP or TDX guest under KVM, treat the legacy PCI hole,
i.e. memory between Top of Lower Usable DRAM and 4GiB, as an untracked PAT
range to workaround issues with mapping legacy devices when MTRRs are
forced to WB.

In most KVM-based setups, legacy devices such as the HPET and TPM are
enumerated via ACPI.  For unknown reasons, ACPI auto-maps such devices as
WB, whereas the dedicated device drivers map memory as WC or UC.  In normal
setups, the entire mess "works" as firmware configures the PCI hole (and
other device memory) to be UC in the MTRRs.  As a result, the ACPI mappings
end up UC, which is compatible with the drivers' requested WC/UC-.

With WB MTRRs, the ACPI mappings get their requested WB.  If acpi_init()
runs before the corresponding device driver is probed, ACPI's WB mapping
will "win", and result in the driver's ioremap() failing because the
existing WB mapping isn't compatible with the requested WC/UC-.

E.g. when a TPM is emulated by the hypervisor (ignoring the security
implications of relying on what is allegedly an untrusted entity to store
measurements), the TPM driver will request UC and fail:

  [  1.730459] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
  [  1.732780] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12

Note, the '0x2' and '0x0' values refer to "enum page_cache_mode", not x86's
memtypes (which frustratingly are an almost pure inversion; 2 == WB, 0 == UC).

The above trace is from a Google-VMM based VM, but the same behavior
happens with a QEMU based VM.  E.g. tracing mapping requests for HPET under
QEMU yields:

   Mapping HPET, req_type = 0
   WARNING: CPU: 5 PID: 1 at arch/x86/mm/pat/memtype.c:528 memtype_reserve+0x22f/0x3f0
   Call Trace:
    <TASK>
    __ioremap_caller.constprop.0+0xd6/0x330
    acpi_os_map_iomem+0x195/0x1b0
    acpi_ex_system_memory_space_handler+0x11c/0x2f0
    acpi_ev_address_space_dispatch+0x168/0x3b0
    acpi_ex_access_region+0xd7/0x280
    acpi_ex_field_datum_io+0x73/0x210
    acpi_ex_extract_from_field+0x267/0x2a0
    acpi_ex_read_data_from_field+0x8e/0x220
    acpi_ex_resolve_node_to_value+0xe2/0x2b0
    acpi_ds_evaluate_name_path+0xa9/0x100
    acpi_ds_exec_end_op+0x21f/0x4c0
    acpi_ps_parse_loop+0xf4/0x670
    acpi_ps_parse_aml+0x17b/0x3d0
    acpi_ps_execute_method+0x137/0x260
    acpi_ns_evaluate+0x1f0/0x2d0
    acpi_evaluate_object+0x13d/0x2e0
    acpi_evaluate_integer+0x50/0xe0
    acpi_bus_get_status+0x7b/0x140
    acpi_add_single_object+0x3f8/0x750
    acpi_bus_check_add+0xb2/0x340
    acpi_ns_walk_namespace+0xfe/0x200
    acpi_walk_namespace+0xbb/0xe0
    acpi_bus_scan+0x1b5/0x1d0
    acpi_scan_init+0xd5/0x290
    acpi_init+0x1fc/0x520
    do_one_initcall+0x41/0x1d0
    kernel_init_freeable+0x164/0x260
    kernel_init+0x16/0x1a0
    ret_from_fork+0x2d/0x50
    ret_from_fork_asm+0x11/0x20
   ---[ end trace 0000000000000000 ]---

The only reason this doesn't cause problems for HPET is because HPET gets
special treatment via x86_init.timers.timer_init(), and so gets a chance
to create its UC- mapping before acpi_init() clobbers things.  Disabling
the early call to hpet_time_init() yields the same behavior for HPET:

  [  0.318264] ioremap error for 0xfed00000-0xfed01000, requested 0x2, got 0x0

Hack around the mess by forcing such mappings to WB, as the memory type is
irrevelant.  Even in a theoretical setup where such devices are passed
through by the host, i.e. point at real MMIO memory, it is KVM's (as the
hypervisor) responsibility to force the memory to be WC/UC, e.g. via EPT
memtype under TDX or real hardware MTRRs under SNP.  Not doing so cannot
work, and the hypervisor is highly motivated to do the right thing as
letting the guest access hardware MMIO with WB would likely result in a
variety of fatal #MCs.

Limit the hack to the legacy PCI hole on the off chance that there are
use cases that want to map virtual devices with WC/UC.  E.g. in theory, it
would be possible to expose hardware GPU buffers to an SNP or TDX guest.
Extending the hack, e.g. if there are use cases for memory above 4GiB that
are affected by ACPI, is far easier than debugging memory corruption if a
driver requests WC/UC and silently gets WB.

Double down on forcing everything to WB, e.g. instead of fixing the CR0.CD
issue and reverting to a "normal" model, as OVMF has also been taught to
ignore MTRRs when running as a TDX guest:

  3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return FALSE in TD-Guest")
  071d2cfab8 ("OvmfPkg/Sec: Skip setup MTRR early in TD-Guest")

And running with firmware that doesn't program MTRRs would likely put the
kernel back into the conundrum of ACPI mapping devices WB, with drivers
wanting WC/UC-.

Fixes: 8e690b817e38 ("x86/kvm: Override default caching mode for SEV-SNP and TDX")
Cc: stable@vger.kernel.org
Cc: Dionna Glaze <dionnaglaze@google.com>
Cc: Peter Gonda <pgonda@google.com>
Cc: Jürgen Groß <jgross@suse.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Binbin Wu <binbin.wu@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/kvm.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7a422a6c5983..7ae294fe99c3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -931,6 +931,23 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
 			   KVM_MAP_GPA_RANGE_ENC_STAT(enc) | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
 }

+static u64 kvm_tolud __ro_after_init;
+
+static bool kvm_is_forced_wb_range(u64 start, u64 end)
+{
+	/*
+	 * In addition to the standard ISA override, force all low memory above
+	 * TOLUD to WB so that legacy devices are mapped with WB when running
+	 * as an SNP or TDX guest.  The memtype itself is completely irrevelant
+	 * as the devices are emulated, the override^Whack is needed purely to
+	 * avoid failures due to ACPI mapping device memory as WB in advance of
+	 * device drivers requesting WC or UC.  In a system with MTRRs, ACPI's
+	 * mappings get forced to UC via MTRRs (programmed sanely by firmware).
+	 */
+	return is_ISA_range(start, end) ||
+	       (start >= kvm_tolud && end <= SZ_4G);
+}
+
 static void __init kvm_init_platform(void)
 {
 	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
@@ -982,8 +999,18 @@ static void __init kvm_init_platform(void)
 	kvmclock_init();
 	x86_platform.apic_post_init = kvm_apic_init;

-	/* Set WB as the default cache mode for SEV-SNP and TDX */
-	guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK);
+	/*
+	 * Set WB as the default cache mode for SEV-SNP and TDX.  MTRRs may be
+	 * enumerated as supported, but neither the TDX-Module (Secure EPT) nor
+	 * KVM (normal EPT for TDX, virtual MTRRs for NPT) actually virtualizes
+	 * MTRR memory types.  If MTRRs are forced to writeback, register KVM's
+	 * range-based WB override to handle cases where device drivers try to
+	 * map an emulated device's memory as WC, and fail because it's all WB.
+	 */
+	if (guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK)) {
+		kvm_tolud = (e820__end_of_low_ram_pfn() << PAGE_SHIFT);
+		x86_platform.is_untracked_pat_range = kvm_is_forced_wb_range;
+	}
 }

 #if defined(CONFIG_AMD_MEM_ENCRYPT)
-- 
2.48.1.362.g079036d154-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-01  0:50 [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Sean Christopherson
  2025-02-01  0:50 ` [PATCH 1/2] x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state() Sean Christopherson
  2025-02-01  0:50 ` [PATCH 2/2] x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced WB Sean Christopherson
@ 2025-02-01 14:25 ` Dionna Amalie Glaze
  2025-02-03 18:14 ` Edgecombe, Rick P
  2025-07-08 14:24 ` Nikolay Borisov
  4 siblings, 0 replies; 12+ messages in thread
From: Dionna Amalie Glaze @ 2025-02-01 14:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Paolo Bonzini, linux-kernel, kvm, Peter Gonda,
	Jürgen Groß, Kirill Shutemov, Vitaly Kuznetsov,
	H . Peter Anvin, Binbin Wu, Tom Lendacky, Yamahata, Isaku,
	Xiang, Qinglan, Xu, Min M

On Fri, Jan 31, 2025 at 4:50 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> memory from TOLUD => 4GiB, as unconditionally writeback.
>
> TDX in particular has created an impossible situation with MTRRs.  Because
> TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> too simple).
>
> Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> as WB, which in turn prevents device drivers from mapping device memory as
> WC/UC-.
>
> For the record, I hate this hack.  But it's the safest approach I can come
> up with.  E.g. forcing ioremap() to always use WB scares me because it's
> possible, however unlikely, that the kernel could try to map non-emulated
> memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> forcing those mappings to WB could do weird things.
>
> My initial thought was to effectively revert the offending commit and
> skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(
>

EDK2 has a bug tracker. Maybe this is still fixable on Intel's end.
Adding Qinglan, Isaku, and Min to comment.

> Sean Christopherson (2):
>   x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
>   x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced
>     WB
>
>  arch/x86/include/asm/mtrr.h        |  5 +++--
>  arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
>  arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
>  3 files changed, 39 insertions(+), 8 deletions(-)
>
>
> base-commit: fd8c09ad0d87783b9b6a27900d66293be45b7bad
> --
> 2.48.1.362.g079036d154-goog
>


-- 
-Dionna Glaze, PhD, CISSP, CCSP (she/her)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-01  0:50 [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-02-01 14:25 ` [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Dionna Amalie Glaze
@ 2025-02-03 18:14 ` Edgecombe, Rick P
  2025-02-03 20:33   ` Sean Christopherson
  2025-07-08 14:24 ` Nikolay Borisov
  4 siblings, 1 reply; 12+ messages in thread
From: Edgecombe, Rick P @ 2025-02-03 18:14 UTC (permalink / raw)
  To: tglx@linutronix.de, x86@kernel.org, mingo@redhat.com,
	pbonzini@redhat.com, seanjc@google.com, bp@alien8.de,
	dave.hansen@linux.intel.com
  Cc: pgonda@google.com, jgross@suse.com, Wu, Binbin,
	vkuznets@redhat.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, dionnaglaze@google.com,
	kvm@vger.kernel.org, thomas.lendacky@amd.com

On Fri, 2025-01-31 at 16:50 -0800, Sean Christopherson wrote:
> Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> memory from TOLUD => 4GiB, as unconditionally writeback.
> 
> TDX in particular has created an impossible situation with MTRRs.  Because
> TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> too simple).
> 
> Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> as WB, which in turn prevents device drivers from mapping device memory as
> WC/UC-.
> 
> For the record, I hate this hack.  But it's the safest approach I can come
> up with.  E.g. forcing ioremap() to always use WB scares me because it's
> possible, however unlikely, that the kernel could try to map non-emulated
> memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> forcing those mappings to WB could do weird things.
> 
> My initial thought was to effectively revert the offending commit and
> skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(

Oof. The missing context in 8e690b817e38 ("x86/kvm: Override default caching
mode for SEV-SNP and TDX"), is that since it is impossible to virtualize MTRRs
on TDX private memory (in the old way KVM used to do it) and there was no
upstream support yet, there looked like an opportunity to avoid strange "happens
to work" support that normal VMs ended up with. Instead KVM could just not
support TDX KVM MTRRs from the beginning. So part of the thinking was that we
could drop all TDX KVM MTRR hacks. (which I guess turned out to be wrong).

Since there is no upstream KVM TDX support yet, why isn't it an option to still
revert the EDKII commit too? It was a relatively recent change.

To me it seems that the normal KVM MTRR support is not ideal, because it is
still lying about what it is doing. For example, in the past there was an
attempt to use UC to prevent speculative execution accesses to sensitive data.
The KVM MTRR support only happens to work with existing guests, but not all
possible MTRR usages.

Since diverging from the architecture creates loose ends like that, we could
instead define some other way for EDKII to communicate the ranges to the kernel.
Like some simple KVM PV MSRs that are for communication only, and not
overlapping with architecture that expects to cause memory behavior. EDKII and
the kernel could be changed to use them.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-03 18:14 ` Edgecombe, Rick P
@ 2025-02-03 20:33   ` Sean Christopherson
  2025-02-03 23:01     ` Edgecombe, Rick P
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-02-03 20:33 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: tglx@linutronix.de, x86@kernel.org, mingo@redhat.com,
	pbonzini@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	pgonda@google.com, jgross@suse.com, Binbin Wu,
	vkuznets@redhat.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, dionnaglaze@google.com,
	kvm@vger.kernel.org, thomas.lendacky@amd.com

On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-01-31 at 16:50 -0800, Sean Christopherson wrote:
> > Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> > x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> > memory from TOLUD => 4GiB, as unconditionally writeback.
> > 
> > TDX in particular has created an impossible situation with MTRRs.  Because
> > TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> > was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> > too simple).
> > 
> > Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> > make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> > as WB, which in turn prevents device drivers from mapping device memory as
> > WC/UC-.
> > 
> > For the record, I hate this hack.  But it's the safest approach I can come
> > up with.  E.g. forcing ioremap() to always use WB scares me because it's
> > possible, however unlikely, that the kernel could try to map non-emulated
> > memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> > forcing those mappings to WB could do weird things.
> > 
> > My initial thought was to effectively revert the offending commit and
> > skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> > but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(
> 
> Oof. The missing context in 8e690b817e38 ("x86/kvm: Override default caching
> mode for SEV-SNP and TDX"), is that since it is impossible to virtualize MTRRs
> on TDX private memory (in the old way KVM used to do it) and there was no
> upstream support yet, there looked like an opportunity to avoid strange "happens
> to work" support that normal VMs ended up with. Instead KVM could just not
> support TDX KVM MTRRs from the beginning. So part of the thinking was that we
> could drop all TDX KVM MTRR hacks. (which I guess turned out to be wrong).
> 
> Since there is no upstream KVM TDX support yet, why isn't it an option to still
> revert the EDKII commit too? It was a relatively recent change.

I'm fine with that route too, but it too is a band-aid.  Relying on the *untrusted*
hypervisor to essentially communicate memory maps is not a winning strategy. 

> To me it seems that the normal KVM MTRR support is not ideal, because it is
> still lying about what it is doing. For example, in the past there was an
> attempt to use UC to prevent speculative execution accesses to sensitive data.
> The KVM MTRR support only happens to work with existing guests, but not all
> possible MTRR usages.
> 
> Since diverging from the architecture creates loose ends like that, we could
> instead define some other way for EDKII to communicate the ranges to the kernel.
> Like some simple KVM PV MSRs that are for communication only, and not

Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
through the hypervisor to communicate information within the guest is asinine,
especially for CoCo VMs.

> overlapping with architecture that expects to cause memory behavior. EDKII and
> the kernel could be changed to use them.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-03 20:33   ` Sean Christopherson
@ 2025-02-03 23:01     ` Edgecombe, Rick P
  2025-02-04  0:27       ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Edgecombe, Rick P @ 2025-02-03 23:01 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com, Wu, Binbin,
	kirill.shutemov@linux.intel.com, mingo@redhat.com,
	pbonzini@redhat.com, tglx@linutronix.de,
	linux-kernel@vger.kernel.org, hpa@zytor.com, vkuznets@redhat.com,
	bp@alien8.de, jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
> > Since there is no upstream KVM TDX support yet, why isn't it an option to
> > still
> > revert the EDKII commit too? It was a relatively recent change.
> 
> I'm fine with that route too, but it too is a band-aid.  Relying on the
> *untrusted*
> hypervisor to essentially communicate memory maps is not a winning strategy. 
> 
> > To me it seems that the normal KVM MTRR support is not ideal, because it is
> > still lying about what it is doing. For example, in the past there was an
> > attempt to use UC to prevent speculative execution accesses to sensitive
> > data.
> > The KVM MTRR support only happens to work with existing guests, but not all
> > possible MTRR usages.
> > 
> > Since diverging from the architecture creates loose ends like that, we could
> > instead define some other way for EDKII to communicate the ranges to the
> > kernel.
> > Like some simple KVM PV MSRs that are for communication only, and not
> 
> Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
> through the hypervisor to communicate information within the guest is asinine,
> especially for CoCo VMs.

Hmm, right.

So the other options could be:

1. Some TDX module feature to hold the ranges:
 - Con: Not shared with AMD

2. Re-use MTRRs for the communication, revert changes in guest and edk2:
 - Con: Creating more half support, when it's technically not required
 - Con: Still bouncing through the hypervisor
 - Pro: Design and code is clear

3. Create some new architectural definition, like a bit that means "MTRRs don't
actually work:
 - Con: Takes a long time, need to get agreement
 - Con: Still bouncing through the hypervisor
 - Pro: More pure solution

4. Do this series:
 - Pro: Looks ok to me
 - Cons: As explained in the patches, it's a bit hacky.
 - Cons: Could there be more cases than the legacy PCI hole?


I would kind of like to see something like 3, but 2 or 4 seem the only feasible
ones if we want to resolve this soon.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-03 23:01     ` Edgecombe, Rick P
@ 2025-02-04  0:27       ` Sean Christopherson
  2025-02-05  3:51         ` Edgecombe, Rick P
  2025-02-10 15:29         ` Binbin Wu
  0 siblings, 2 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-02-04  0:27 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com, Binbin Wu,
	kirill.shutemov@linux.intel.com, mingo@redhat.com,
	pbonzini@redhat.com, tglx@linutronix.de,
	linux-kernel@vger.kernel.org, hpa@zytor.com, vkuznets@redhat.com,
	bp@alien8.de, jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
> > > Since there is no upstream KVM TDX support yet, why isn't it an option to
> > > still
> > > revert the EDKII commit too? It was a relatively recent change.
> > 
> > I'm fine with that route too, but it too is a band-aid.  Relying on the
> > *untrusted*
> > hypervisor to essentially communicate memory maps is not a winning strategy. 
> > 
> > > To me it seems that the normal KVM MTRR support is not ideal, because it is
> > > still lying about what it is doing. For example, in the past there was an
> > > attempt to use UC to prevent speculative execution accesses to sensitive
> > > data.
> > > The KVM MTRR support only happens to work with existing guests, but not all
> > > possible MTRR usages.
> > > 
> > > Since diverging from the architecture creates loose ends like that, we could
> > > instead define some other way for EDKII to communicate the ranges to the
> > > kernel.
> > > Like some simple KVM PV MSRs that are for communication only, and not
> > 
> > Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
> > through the hypervisor to communicate information within the guest is asinine,
> > especially for CoCo VMs.
> 
> Hmm, right.
> 
> So the other options could be:
> 
> 1. Some TDX module feature to hold the ranges:
>  - Con: Not shared with AMD
> 
> 2. Re-use MTRRs for the communication, revert changes in guest and edk2:

Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
changes is necessary regardless of what happens in the kernel.  Or at the least,
somehow communicate to EDK2 users that ingesting those changes is a bad idea
unless the kernel has also been updated.

AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
guest kernel, there's no way to know if deploying those EDK2 changes is safe.

[*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf

>  - Con: Creating more half support, when it's technically not required
>  - Con: Still bouncing through the hypervisor

I assume by "Re-use MTRRs for the communication" you also mean updating the guest
to address the "everything is UC!" flaw, otherwise another con is:

   - Con: Doesn't address the performance issue with TDX guests "using" UC
          memory by default (unless there's yet more enabled).

Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
doing MTRR stuff as nonrmal?

>  - Pro: Design and code is clear
>
> 3. Create some new architectural definition, like a bit that means "MTRRs don't
> actually work:
>  - Con: Takes a long time, need to get agreement
>  - Con: Still bouncing through the hypervisor

Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
don't actually affect the memory type when running under KVM.

FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
overwhelming majority of guests.  That's not desirable long term because it
prevents the guest from using WC (via PAT) in situations where doing so is needed
for performance and/or correctness.

>  - Pro: More pure solution

MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
because there's never a valid mapping, i.e. there is no physical memory and thus
no memtype.  In other words, under KVM guests (and possibly other hypervisors),
MTRRs end up being nothing more than a communication channel between guest firmware
and the kernel.

The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
by the untrusted host.  But that's largely a future problem, unless someone has a
clever way to fix the kernel mess.

> 4. Do this series:
>  - Pro: Looks ok to me
>  - Cons: As explained in the patches, it's a bit hacky.
>  - Cons: Could there be more cases than the legacy PCI hole?
> 
> I would kind of like to see something like 3, but 2 or 4 seem the only feasible
> ones if we want to resolve this soon.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-04  0:27       ` Sean Christopherson
@ 2025-02-05  3:51         ` Edgecombe, Rick P
  2025-02-05  7:49           ` Xu, Min M
  2025-02-10 15:29         ` Binbin Wu
  1 sibling, 1 reply; 12+ messages in thread
From: Edgecombe, Rick P @ 2025-02-05  3:51 UTC (permalink / raw)
  To: seanjc@google.com, Xu, Min M
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com, Wu, Binbin,
	linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	hpa@zytor.com, vkuznets@redhat.com, bp@alien8.de, jgross@suse.com,
	x86@kernel.org, pgonda@google.com

+Min, can you comment?

3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return FALSE in
TD-Guest") turned out to be problematic in practice.

Full thread:
https://lore.kernel.org/kvm/20250201005048.657470-1-seanjc@google.com/

On Mon, 2025-02-03 at 16:27 -0800, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
> > > > Since there is no upstream KVM TDX support yet, why isn't it an option to
> > > > still
> > > > revert the EDKII commit too? It was a relatively recent change.
> > > 
> > > I'm fine with that route too, but it too is a band-aid.  Relying on the
> > > *untrusted*
> > > hypervisor to essentially communicate memory maps is not a winning strategy. 
> > > 
> > > > To me it seems that the normal KVM MTRR support is not ideal, because it is
> > > > still lying about what it is doing. For example, in the past there was an
> > > > attempt to use UC to prevent speculative execution accesses to sensitive
> > > > data.
> > > > The KVM MTRR support only happens to work with existing guests, but not all
> > > > possible MTRR usages.
> > > > 
> > > > Since diverging from the architecture creates loose ends like that, we could
> > > > instead define some other way for EDKII to communicate the ranges to the
> > > > kernel.
> > > > Like some simple KVM PV MSRs that are for communication only, and not
> > > 
> > > Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
> > > through the hypervisor to communicate information within the guest is asinine,
> > > especially for CoCo VMs.
> > 
> > Hmm, right.
> > 
> > So the other options could be:
> > 
> > 1. Some TDX module feature to hold the ranges:
> >  - Con: Not shared with AMD
> > 
> > 2. Re-use MTRRs for the communication, revert changes in guest and edk2:
> 
> Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
> changes is necessary regardless of what happens in the kernel.  Or at the least,
> somehow communicate to EDK2 users that ingesting those changes is a bad idea
> unless the kernel has also been updated.
> 
> AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
> is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
> will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
> default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
> guest kernel, there's no way to know if deploying those EDK2 changes is safe.
>  
> [*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf
> 

Hmm. Since there is no upstream TDX KVM support, for it's part, I guess KVM
should still get a chance to define a cleaner solution (if there actually was a
cleaner solution). But yea, it would mean only components from after the
solution was settled could be used together for a fully working stack. And
it should probably be called out somehow. Maybe could be in the KVM TDX docs or
something.

Still seems like a thing to avoid if possible.

> >  - Con: Creating more half support, when it's technically not required
> >  - Con: Still bouncing through the hypervisor
> 
> I assume by "Re-use MTRRs for the communication" you also mean updating the guest
> to address the "everything is UC!" flaw, otherwise another con is:
> 
>    - Con: Doesn't address the performance issue with TDX guests "using" UC
>           memory by default (unless there's yet more enabled).

Hmm. This is quite the tangled corner.

> 
> Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
> doing MTRR stuff as nonrmal?

I'll have to get back to you on this one. Kirill probably could give a better
answer, but likely will not be able to follow up on this thread until next week.

> 
> >  - Pro: Design and code is clear
> > 
> > 3. Create some new architectural definition, like a bit that means "MTRRs don't
> > actually work:
> >  - Con: Takes a long time, need to get agreement
> >  - Con: Still bouncing through the hypervisor
> 
> Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
> don't actually affect the memory type when running under KVM.
> 
> FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
> KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
> overwhelming majority of guests.  That's not desirable long term because it
> prevents the guest from using WC (via PAT) in situations where doing so is needed
> for performance and/or correctness.
> 
> >  - Pro: More pure solution
> 
> MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
> it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
> desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
> because there's never a valid mapping, i.e. there is no physical memory and thus
> no memtype.  In other words, under KVM guests (and possibly other hypervisors),
> MTRRs end up being nothing more than a communication channel between guest firmware
> and the kernel.

Yea.

> 
> The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
> by the untrusted host.  But that's largely a future problem, unless someone has a
> clever way to fix the kernel mess.
> 
> 

Yea, I wondered about that too. I imagine the thinking was that since it is only
controlling shared memory, it can be untrusted.

And I guess the solution in this patchset is hypothetically a bit more locked
down in that respect.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-05  3:51         ` Edgecombe, Rick P
@ 2025-02-05  7:49           ` Xu, Min M
  0 siblings, 0 replies; 12+ messages in thread
From: Xu, Min M @ 2025-02-05  7:49 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com, Wu, Binbin
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com,
	linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	hpa@zytor.com, vkuznets@redhat.com, bp@alien8.de, jgross@suse.com,
	x86@kernel.org, pgonda@google.com, Xu, Min M, Yao, Jiewen

On Wednesday, February 5, 2025 11:51 AM, Edgecombe, Rick P wrote:
> 
> +Min, can you comment?
> 
> 3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return FALSE
> in
> TD-Guest") turned out to be problematic in practice.
> 
As the commit(3a3b12cbda) message mentioned that "For Linux kernel, there is a mechanism called SW defined MTRR introduced  by the patch https://lore.kernel.org/all/20230502120931.20719-4-jgross@suse.com/". 
From the discussion in below thread it seems the patch (SW defined MTR in kernel) has not been introduced into kernel master branch, right?
We need some time to investigate it and will give an update here. 
> Full thread:
> https://lore.kernel.org/kvm/20250201005048.657470-1-seanjc@google.com/
> 

Thanks
Min

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-04  0:27       ` Sean Christopherson
  2025-02-05  3:51         ` Edgecombe, Rick P
@ 2025-02-10 15:29         ` Binbin Wu
  1 sibling, 0 replies; 12+ messages in thread
From: Binbin Wu @ 2025-02-10 15:29 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe, Xu, Min M
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com, Binbin Wu,
	kirill.shutemov@linux.intel.com, mingo@redhat.com,
	pbonzini@redhat.com, tglx@linutronix.de,
	linux-kernel@vger.kernel.org, hpa@zytor.com, vkuznets@redhat.com,
	bp@alien8.de, jgross@suse.com, pgonda@google.com, x86@kernel.org,
	jiewen.yao, Binbin Wu



On 2/4/2025 8:27 AM, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
>> On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
>>>> Since there is no upstream KVM TDX support yet, why isn't it an option to
>>>> still
>>>> revert the EDKII commit too? It was a relatively recent change.
>>> I'm fine with that route too, but it too is a band-aid.  Relying on the
>>> *untrusted*
>>> hypervisor to essentially communicate memory maps is not a winning strategy.
>>>
>>>> To me it seems that the normal KVM MTRR support is not ideal, because it is
>>>> still lying about what it is doing. For example, in the past there was an
>>>> attempt to use UC to prevent speculative execution accesses to sensitive
>>>> data.
>>>> The KVM MTRR support only happens to work with existing guests, but not all
>>>> possible MTRR usages.
>>>>
>>>> Since diverging from the architecture creates loose ends like that, we could
>>>> instead define some other way for EDKII to communicate the ranges to the
>>>> kernel.
>>>> Like some simple KVM PV MSRs that are for communication only, and not
>>> Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
>>> through the hypervisor to communicate information within the guest is asinine,
>>> especially for CoCo VMs.
>> Hmm, right.
>>
>> So the other options could be:
>>
>> 1. Some TDX module feature to hold the ranges:
>>   - Con: Not shared with AMD
>>
>> 2. Re-use MTRRs for the communication, revert changes in guest and edk2:
> Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
> changes is necessary regardless of what happens in the kernel.
IIUC, 071d2cfab8 ("OvmfPkg/Sec: Skip setup MTRR early in TD-Guest") was added
to avoid changing the setting of MTRR, which will trigger #VE by setting
CR0.CD=1.

And recently, 3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always
return FALSE in TD-Guest") was added to avoid touching MTRR MSRs at all,
so that the MTRR MSRs support for TDX guests was dropped as described in
[PATCH 00/18] KVM: TDX: TDX "the rest" part [1].

If we want to revert the two commits, we need to:
1. Make sure that OVMF will not set CR0.CD to 1 for TDX guests, probably
    needs some kind of hack in OVMF.

2. Need to bring back the support of MTRR MSRs in KVM for TDX guests.
    TDX KVM basic support patch set v19 and earlier versions enforce default
    MTRR type as WB and disabled fixed/variable MTRR (by reporting MSR_MTRRcap
    as 0) for TDX guests, which was kind of half support and needed
    "clearcpuid=mtrr" to avoid toggling of CR0.CD.

    If we really want to rely on the OVMF to program the MTRR values, maybe we
    can treat MTRR MSRs for TDX guests as normal VMs, i.e., allow guests to
    read/write the values without any further virtualization.
    Of course, it needs to prompt the guest kernel to skip toggling CR0.CD for
    TDX guests.

[1] https://lore.kernel.org/kvm/20241210004946.3718496-1-binbin.wu@linux.intel.com


>    Or at the least,
> somehow communicate to EDK2 users that ingesting those changes is a bad idea
> unless the kernel has also been updated.
>
> AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
> is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
> will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
> default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
> guest kernel, there's no way to know if deploying those EDK2 changes is safe.

Oh, yea.

And if we drop the MTRR MSRs access in KVM for TDX guests, an old guest kernel
without commit 8e690b817e38 ("x86/kvm: Override default caching mode for
SEV-SNP and TDX") will require "clearcpuid=mtrr". :(

>   
> [*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf
>
>>   - Con: Creating more half support, when it's technically not required
>>   - Con: Still bouncing through the hypervisor
> I assume by "Re-use MTRRs for the communication" you also mean updating the guest
> to address the "everything is UC!" flaw, otherwise another con is:
>
>     - Con: Doesn't address the performance issue with TDX guests "using" UC
>            memory by default (unless there's yet more enabled).
>
> Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
> doing MTRR stuff as nonrmal?
>
>>   - Pro: Design and code is clear
>>
>> 3. Create some new architectural definition, like a bit that means "MTRRs don't
>> actually work:
>>   - Con: Takes a long time, need to get agreement
>>   - Con: Still bouncing through the hypervisor
> Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
> don't actually affect the memory type when running under KVM.
>
> FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
> KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
> overwhelming majority of guests.  That's not desirable long term because it
> prevents the guest from using WC (via PAT) in situations where doing so is needed
> for performance and/or correctness.
>
>>   - Pro: More pure solution
> MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
> it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
> desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
> because there's never a valid mapping, i.e. there is no physical memory and thus
> no memtype.  In other words, under KVM guests (and possibly other hypervisors),
> MTRRs end up being nothing more than a communication channel between guest firmware
> and the kernel.
>
> The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
> by the untrusted host.  But that's largely a future problem, unless someone has a
> clever way to fix the kernel mess.
>
>> 4. Do this series:
>>   - Pro: Looks ok to me
It looks OK to me too.
But as mentioned above, mismatch of OVMF, guest kernel, host kernel
version will cause problem.

>>   - Cons: As explained in the patches, it's a bit hacky.
Skipping toggling CR0.CD in guest kernel seems also a bit hacky.

>>   - Cons: Could there be more cases than the legacy PCI hole?
>>
>> I would kind of like to see something like 3, but 2 or 4 seem the only feasible
>> ones if we want to resolve this soon.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-01  0:50 [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-02-03 18:14 ` Edgecombe, Rick P
@ 2025-07-08 14:24 ` Nikolay Borisov
  4 siblings, 0 replies; 12+ messages in thread
From: Nikolay Borisov @ 2025-07-08 14:24 UTC (permalink / raw)
  To: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Paolo Bonzini
  Cc: linux-kernel, kvm, Dionna Glaze, Peter Gonda,
	Jürgen Groß, Kirill Shutemov, Vitaly Kuznetsov,
	H . Peter Anvin, Binbin Wu, Tom Lendacky



On 2/1/25 02:50, Sean Christopherson wrote:
> Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> memory from TOLUD => 4GiB, as unconditionally writeback.
> 
> TDX in particular has created an impossible situation with MTRRs.  Because
> TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> too simple).
> 
> Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> as WB, which in turn prevents device drivers from mapping device memory as
> WC/UC-.
> 
> For the record, I hate this hack.  But it's the safest approach I can come
> up with.  E.g. forcing ioremap() to always use WB scares me because it's
> possible, however unlikely, that the kernel could try to map non-emulated
> memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> forcing those mappings to WB could do weird things.
> 
> My initial thought was to effectively revert the offending commit and
> skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(
> 
> Sean Christopherson (2):
>    x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
>    x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced
>      WB
> 
>   arch/x86/include/asm/mtrr.h        |  5 +++--
>   arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
>   arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
>   3 files changed, 39 insertions(+), 8 deletions(-)
> 
> 
> base-commit: fd8c09ad0d87783b9b6a27900d66293be45b7bad


This prevents TPM from functioning which in turn breaks attestation on 
TDX enabled guests. So what's the status of it?

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-07-08 14:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-01  0:50 [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Sean Christopherson
2025-02-01  0:50 ` [PATCH 1/2] x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state() Sean Christopherson
2025-02-01  0:50 ` [PATCH 2/2] x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced WB Sean Christopherson
2025-02-01 14:25 ` [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Dionna Amalie Glaze
2025-02-03 18:14 ` Edgecombe, Rick P
2025-02-03 20:33   ` Sean Christopherson
2025-02-03 23:01     ` Edgecombe, Rick P
2025-02-04  0:27       ` Sean Christopherson
2025-02-05  3:51         ` Edgecombe, Rick P
2025-02-05  7:49           ` Xu, Min M
2025-02-10 15:29         ` Binbin Wu
2025-07-08 14:24 ` Nikolay Borisov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).