[PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT

rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT
@ 2024-03-09  1:09 Sean Christopherson
  2024-03-09  1:09 ` [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes Sean Christopherson
                   ` (7 more replies)
  0 siblings, 8 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-03-09  1:09 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

First, rip out KVM's support for virtualizing guest MTRRs on VMX.  The
code is costly to main, a drag on guest boot performance, imperfect, and
not required for functional correctness with modern guest kernels.  Many
details in patch 1's changelog.

With MTRR virtualization gone, always honor guest PAT on Intel CPUs that
support self-snoop, as such CPUs are guaranteed to maintain coherency
even if the guest is aliasing memtypes, e.g. if the host is using WB but
the guest is using WC.  Honoring guest PAT is desirable for use cases
where the guest must use WC when accessing memory that is DMA'd from a
non-coherent device that does NOT bounce through VFIO, e.g. for mediated
virtual GPUs.

The SRCU patch adds an API that is effectively documentation for the
memory barrier in srcu_read_lock().  Intel CPUs with self-snoop require
a memory barrier after VM-Exit to ensure coherency, and KVM always does
a srcu_read_lock() before reading guest memory after VM-Exit.  Relying
on SRCU to provide the barrier allows KVM to avoid emitting a redundant
barrier of its own.

This series needs a _lot_ more testing; I arguably should have tagged it
RFC, but I'm feeling lucky.

Sean Christopherson (3):
  KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
  KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1
  KVM: VMX: Always honor guest PAT on CPUs that support self-snoop

Yan Zhao (2):
  srcu: Add an API for a memory barrier after SRCU read lock
  KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path

 Documentation/virt/kvm/api.rst        |   6 +-
 Documentation/virt/kvm/x86/errata.rst |  18 +
 arch/x86/include/asm/kvm_host.h       |  15 +-
 arch/x86/kvm/mmu.h                    |   7 +-
 arch/x86/kvm/mmu/mmu.c                |  35 +-
 arch/x86/kvm/mtrr.c                   | 644 ++------------------------
 arch/x86/kvm/vmx/vmx.c                |  40 +-
 arch/x86/kvm/x86.c                    |  24 +-
 arch/x86/kvm/x86.h                    |   4 -
 include/linux/srcu.h                  |  14 +
 10 files changed, 105 insertions(+), 702 deletions(-)

base-commit: 964d0c614c7f71917305a5afdca9178fe8231434
-- 
2.44.0.278.ge034bb2e1d-goog

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
  2024-03-09  1:09 [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Sean Christopherson
@ 2024-03-09  1:09 ` Sean Christopherson
  2024-03-11  7:44   ` Yan Zhao
  2024-03-12  1:10   ` Dongli Zhang
  2024-03-09  1:09 ` [PATCH 2/5] KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1 Sean Christopherson
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-03-09  1:09 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR
adds no value, negatively impacts guest performance, and is a maintenance
burden due to it's complexity and oddities.

KVM's approach to virtualizating MTRRs make no sense, at all.  KVM *only*
honors guest MTRR memtypes if EPT is enabled *and* the guest has a device
that may perform non-coherent DMA access.  From a hardware virtualization
perspective of guest MTRRs, there is _nothing_ special about EPT.  Legacy
shadowing paging doesn't magically account for guest MTRRs, nor does NPT.

Unwinding and deciphering KVM's murky history, the MTRR virtualization
code appears to be the result of misdiagnosed issues when EPT + VT-d with
passthrough devices was enabled years and years ago.  And importantly, the
underlying bugs that were fudged around by honoring guest MTRR memtypes
have since been fixed (though rather poorly in some cases).

The zapping GFNs logic in the MTRR virtualization code came from:

  commit efdfe536d8c643391e19d5726b072f82964bfbdb
  Author: Xiao Guangrong <guangrong.xiao@linux.intel.com>
  Date:   Wed May 13 14:42:27 2015 +0800

    KVM: MMU: fix MTRR update

    Currently, whenever guest MTRR registers are changed
    kvm_mmu_reset_context is called to switch to the new root shadow page
    table, however, it's useless since:
    1) the cache type is not cached into shadow page's attribute so that
       the original root shadow page will be reused

    2) the cache type is set on the last spte, that means we should sync
       the last sptes when MTRR is changed

    This patch fixs this issue by drop all the spte in the gfn range which
    is being updated by MTRR

which was a fix for:

  commit 0bed3b568b68e5835ef5da888a372b9beabf7544
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Oct 9 16:01:54 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Wed Dec 31 16:51:44 2008 +0200

      KVM: Improve MTRR structure

      As well as reset mmu context when set MTRR.

which was part of a "MTRR/PAT support for EPT" series that also added:

+       if (mt_mask) {
+               mt_mask = get_memory_type(vcpu, gfn) <<
+                         kvm_x86_ops->get_mt_mask_shift();
+               spte |= mt_mask;
+       }

where get_memory_type() was a truly gnarly helper to retrieve the guest
MTRR memtype for a given memtype.  And *very* subtly, at the time of that
change, KVM *always* set VMX_EPT_IGMT_BIT,

        kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
                VMX_EPT_WRITABLE_MASK |
                VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT |
                VMX_EPT_IGMT_BIT);

which came in via:

  commit 928d4bf747e9c290b690ff515d8f81e8ee226d97
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Nov 6 14:55:45 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Tue Nov 11 21:00:37 2008 +0200

      KVM: VMX: Set IGMT bit in EPT entry

      There is a potential issue that, when guest using pagetable without vmexit when
      EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
      which would be inconsistent with host side and would cause host MCE due to
      inconsistent cache attribute.

      The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
      memory type to protect host (notice that all memory mapped by KVM should be WB).

Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added
the whole "ignoreIGMT things as a bug fix for issues that were detected
during EPT + VT-d + passthrough enabling, but it was applied earlier
because it was a generic fix.

Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other
relevant code, or rather lack thereof, is the handling of *host* MMIO.
That fix came in a bit later, but given the author and timing, it's safe
to say it was all part of the same EPT+VT-d enabling mess.

  commit 2aaf69dcee864f4fb6402638dd2f263324ac839f
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Wed Jan 21 16:52:16 2009 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Sun Feb 15 02:47:37 2009 +0200

    KVM: MMU: Map device MMIO as UC in EPT

    Software are not allow to access device MMIO using cacheable memory type, the
    patch limit MMIO region with UC and WC(guest can select WC using PAT and
    PCD/PWT).

In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization
was obviously never tested on NPT until much later, which lends further
credence to the theory/argument that this was all the result of
misdiagnosed issues.

Discussion from the EPT+MTRR enabling thread[*] more or less confirms that
Sheng Yang was trying to resolve issues with passthrough MMIO.

 * Sheng Yang
  : Do you mean host(qemu) would access this memory and if we set it to guest
  : MTRR, host access would be broken? We would cover this in our shadow MTRR
  : patch, for we encountered this in video ram when doing some experiment with
  : VGA assignment.

And in the same thread, there's also what appears to be confirmation of
Intel running into issues with Windows XP related to a guest device driver
mapping DMA with WC in the PAT.

 * Avi Kavity
  : Sheng Yang wrote:
  : > Yes... But it's easy to do with assigned devices' mmio, but what if guest
  : > specific some non-mmio memory's memory type? E.g. we have met one issue in
  : > Xen, that a assigned-device's XP driver specific one memory region as buffer,
  : > and modify the memory type then do DMA.
  : >
  : > Only map MMIO space can be first step, but I guess we can modify assigned
  : > memory region memory type follow guest's?
  : >
  :
  : With ept/npt, we can't, since the memory type is in the guest's
  : pagetable entries, and these are not accessible.

[*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com

So, for the most part, what likely happened is that 15 years ago, a few
engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially
"fixed" passthrough device MMIO by emulating *guest* MTRRs.  Except for
the below case, everything since then has been a result of those two
intertwined changes.

The one exception, which is actually yet more confirmation of all of the
above, is the revert of Paolo's attempt at "full" virtualization of guest
MTRRs:

  commit 606decd67049217684e3cb5a54104d51ddd4ef35
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:12:47 2015 +0200

    Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"

    This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5.
    It was reported to cause Machine Check Exceptions (bug 104091).

...

  commit fd717f11015f673487ffc826e59b2bad69d20fe5
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:38:13 2015 +0200

    KVM: x86: apply guest MTRR virtualization on host reserved pages

    Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
    However, the guest could prefer a different page type than UC for
    such pages. A good example is that pass-throughed VGA frame buffer is
    not always UC as host expected.

    This patch enables full use of virtual guest MTRRs.

I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC
in EPT" and got the same result: machine checks, likely due to the guest
MTRRs not being trustworthy/sane at all times.

Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that
too got reverted.  Unfortunately, it doesn't appear that anyone ever found
a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused
extremely slow boot times doesn't appear to have a definitive root cause.

  commit fc07e76ac7ffa3afd621a1c3858a503386a14281
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:20:22 2015 +0200

    Revert "KVM: SVM: use NPT page attributes"

    This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22.
    Initializing the mapping from MTRR to PAT values was reported to
    fail nondeterministically, and it also caused extremely slow boot
    (due to caching getting disabled---bug 103321) with assigned devices.

...

  commit 3c2e7f7de3240216042b61073803b61b9b3cfb22
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:32:17 2015 +0200

    KVM: SVM: use NPT page attributes

    Right now, NPT page attributes are not used, and the final page
    attribute depends solely on gPAT (which however is not synced
    correctly), the guest MTRRs and the guest page attributes.

    However, we can do better by mimicking what is done for VMX.
    In the absence of PCI passthrough, the guest PAT can be ignored
    and the page attributes can be just WB.  If passthrough is being
    used, instead, keep respecting the guest PAT, and emulate the guest
    MTRRs through the PAT field of the nested page tables.

    The only snag is that WP memory cannot be emulated correctly,
    because Linux's default PAT setting only includes the other types.

In short, honoring guest MTRRs for VMX was initially a workaround of
sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host
MMIO.  And while there *are* known cases where honoring guest MTRRs is
desirable, e.g. passthrough VGA frame buffers, the desired behavior in
that case is to get WC instead of UC, i.e. at this point it's for
performance, not correctness.

Furthermore, the complete absence of MTRR virtualization on NPT and
shadow paging proves that, while KVM theoretically can do better, it's
by no means necessary for correctnesss.

Lastly, since kernels mostly rely on firmware to do MTRR setup, and the
host typically provides guest firmware, honoring guest MTRRs is effectively
honoring *host* userspace memtypes, which is also backwards.  I.e. it
would be far better for host userspace to communicate its desired memtype
directly to KVM (or perhaps indirectly via VMAs in the host kernel), not
through guest MTRRs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/x86/errata.rst |   7 +
 arch/x86/include/asm/kvm_host.h       |  15 +-
 arch/x86/kvm/mmu.h                    |   7 +-
 arch/x86/kvm/mmu/mmu.c                |  33 +-
 arch/x86/kvm/mtrr.c                   | 644 ++------------------------
 arch/x86/kvm/vmx/vmx.c                |  38 +-
 arch/x86/kvm/x86.c                    |  16 +-
 arch/x86/kvm/x86.h                    |   4 -
 8 files changed, 69 insertions(+), 695 deletions(-)

diff --git a/Documentation/virt/kvm/x86/errata.rst b/Documentation/virt/kvm/x86/errata.rst
index 49a05f24747b..1b70bad7325e 100644
--- a/Documentation/virt/kvm/x86/errata.rst
+++ b/Documentation/virt/kvm/x86/errata.rst
@@ -48,3 +48,10 @@ have the same physical APIC ID, KVM will deliver events targeting that APIC ID
 only to the vCPU with the lowest vCPU ID.  If KVM_X2APIC_API_USE_32BIT_IDS is
 not enabled, KVM follows x86 architecture when processing interrupts (all vCPUs
 matching the target APIC ID receive the interrupt).
+
+MTRRs
+-----
+KVM does not virtualization guest MTRR memory types.  KVM emulates accesses to
+MTRR MSRs, i.e. {RD,WR}MSR in the guest will behave as expected, but KVM does
+not honor guest MTRRs when determining the effective memory type, and instead
+treats all of guest memory as having Writeback (WB) MTRRs.
\ No newline at end of file
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9e7b1a00e265..595710e309b9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -159,7 +159,6 @@
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
 #define KVM_MAX_CPUID_ENTRIES 256
-#define KVM_NR_FIXED_MTRR_REGION 88
 #define KVM_NR_VAR_MTRR 8
 
 #define ASYNC_PF_PER_VCPU 64
@@ -601,18 +600,12 @@ enum {
 	KVM_DEBUGREG_WONT_EXIT = 2,
 };
 
-struct kvm_mtrr_range {
-	u64 base;
-	u64 mask;
-	struct list_head node;
-};
-
 struct kvm_mtrr {
-	struct kvm_mtrr_range var_ranges[KVM_NR_VAR_MTRR];
-	mtrr_type fixed_ranges[KVM_NR_FIXED_MTRR_REGION];
+	u64 var[KVM_NR_VAR_MTRR * 2];
+	u64 fixed_64k;
+	u64 fixed_16k[2];
+	u64 fixed_4k[8];
 	u64 deftype;
-
-	struct list_head head;
 };
 
 /* Hyper-V SynIC timer */
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..c8028eb2cf48 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -245,12 +245,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return -(u32)fault & errcode;
 }
 
-bool __kvm_mmu_honors_guest_mtrrs(bool vm_has_noncoherent_dma);
-
-static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
-{
-	return __kvm_mmu_honors_guest_mtrrs(kvm_arch_has_noncoherent_dma(kvm));
-}
+bool kvm_mmu_may_ignore_guest_pat(void);
 
 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e4cc7f764980..403cd8f914cd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4619,38 +4619,21 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 }
 #endif
 
-bool __kvm_mmu_honors_guest_mtrrs(bool vm_has_noncoherent_dma)
+bool kvm_mmu_may_ignore_guest_pat(void)
 {
 	/*
-	 * If host MTRRs are ignored (shadow_memtype_mask is non-zero), and the
-	 * VM has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is
-	 * to honor the memtype from the guest's MTRRs so that guest accesses
-	 * to memory that is DMA'd aren't cached against the guest's wishes.
-	 *
-	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
-	 * e.g. KVM will force UC memtype for host MMIO.
+	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
+	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
+	 * honor the memtype from the guest's PAT so that guest accesses to
+	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
+	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
+	 * KVM _always_ ignores guest PAT (when EPT is enabled).
 	 */
-	return vm_has_noncoherent_dma && shadow_memtype_mask;
+	return shadow_memtype_mask;
 }
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
-	/*
-	 * If the guest's MTRRs may be used to compute the "real" memtype,
-	 * restrict the mapping level to ensure KVM uses a consistent memtype
-	 * across the entire mapping.
-	 */
-	if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
-		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
-			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
-			gfn_t base = gfn_round_for_level(fault->gfn,
-							 fault->max_level);
-
-			if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
-				break;
-		}
-	}
-
 #ifdef CONFIG_X86_64
 	if (tdp_mmu_enabled)
 		return kvm_tdp_mmu_page_fault(vcpu, fault);
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index a67c28a56417..05490b9d8a43 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -19,33 +19,21 @@
 #include <asm/mtrr.h>
 
 #include "cpuid.h"
-#include "mmu.h"
 
-#define IA32_MTRR_DEF_TYPE_E		(1ULL << 11)
-#define IA32_MTRR_DEF_TYPE_FE		(1ULL << 10)
-#define IA32_MTRR_DEF_TYPE_TYPE_MASK	(0xff)
-
-static bool is_mtrr_base_msr(unsigned int msr)
-{
-	/* MTRR base MSRs use even numbers, masks use odd numbers. */
-	return !(msr & 0x1);
-}
-
-static struct kvm_mtrr_range *var_mtrr_msr_to_range(struct kvm_vcpu *vcpu,
-						    unsigned int msr)
+static u64 *find_mtrr(struct kvm_vcpu *vcpu, unsigned int msr)
 {
-	int index = (msr - MTRRphysBase_MSR(0)) / 2;
+	int index;
 
-	return &vcpu->arch.mtrr_state.var_ranges[index];
-}
-
-static bool msr_mtrr_valid(unsigned msr)
-{
 	switch (msr) {
 	case MTRRphysBase_MSR(0) ... MTRRphysMask_MSR(KVM_NR_VAR_MTRR - 1):
+		index = msr - MTRRphysBase_MSR(0);
+		return &vcpu->arch.mtrr_state.var[index];
 	case MSR_MTRRfix64K_00000:
+		return &vcpu->arch.mtrr_state.fixed_64k;
 	case MSR_MTRRfix16K_80000:
 	case MSR_MTRRfix16K_A0000:
+		index = msr - MSR_MTRRfix16K_80000;
+		return &vcpu->arch.mtrr_state.fixed_16k[index];
 	case MSR_MTRRfix4K_C0000:
 	case MSR_MTRRfix4K_C8000:
 	case MSR_MTRRfix4K_D0000:
@@ -54,10 +42,14 @@ static bool msr_mtrr_valid(unsigned msr)
 	case MSR_MTRRfix4K_E8000:
 	case MSR_MTRRfix4K_F0000:
 	case MSR_MTRRfix4K_F8000:
+		index = msr - MSR_MTRRfix4K_C0000;
+		return &vcpu->arch.mtrr_state.fixed_4k[index];
 	case MSR_MTRRdefType:
-		return true;
+		return &vcpu->arch.mtrr_state.deftype;
+	default:
+		break;
 	}
-	return false;
+	return NULL;
 }
 
 static bool valid_mtrr_type(unsigned t)
@@ -70,9 +62,6 @@ static bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	int i;
 	u64 mask;
 
-	if (!msr_mtrr_valid(msr))
-		return false;
-
 	if (msr == MSR_MTRRdefType) {
 		if (data & ~0xcff)
 			return false;
@@ -85,8 +74,9 @@ static bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	}
 
 	/* variable MTRRs */
-	WARN_ON(!(msr >= MTRRphysBase_MSR(0) &&
-		  msr <= MTRRphysMask_MSR(KVM_NR_VAR_MTRR - 1)));
+	if (WARN_ON_ONCE(!(msr >= MTRRphysBase_MSR(0) &&
+			   msr <= MTRRphysMask_MSR(KVM_NR_VAR_MTRR - 1))))
+		return false;
 
 	mask = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 	if ((msr & 1) == 0) {
@@ -94,309 +84,32 @@ static bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		if (!valid_mtrr_type(data & 0xff))
 			return false;
 		mask |= 0xf00;
-	} else
+	} else {
 		/* MTRR mask */
 		mask |= 0x7ff;
+	}
 
 	return (data & mask) == 0;
 }
 
-static bool mtrr_is_enabled(struct kvm_mtrr *mtrr_state)
-{
-	return !!(mtrr_state->deftype & IA32_MTRR_DEF_TYPE_E);
-}
-
-static bool fixed_mtrr_is_enabled(struct kvm_mtrr *mtrr_state)
-{
-	return !!(mtrr_state->deftype & IA32_MTRR_DEF_TYPE_FE);
-}
-
-static u8 mtrr_default_type(struct kvm_mtrr *mtrr_state)
-{
-	return mtrr_state->deftype & IA32_MTRR_DEF_TYPE_TYPE_MASK;
-}
-
-static u8 mtrr_disabled_type(struct kvm_vcpu *vcpu)
-{
-	/*
-	 * Intel SDM 11.11.2.2: all MTRRs are disabled when
-	 * IA32_MTRR_DEF_TYPE.E bit is cleared, and the UC
-	 * memory type is applied to all of physical memory.
-	 *
-	 * However, virtual machines can be run with CPUID such that
-	 * there are no MTRRs.  In that case, the firmware will never
-	 * enable MTRRs and it is obviously undesirable to run the
-	 * guest entirely with UC memory and we use WB.
-	 */
-	if (guest_cpuid_has(vcpu, X86_FEATURE_MTRR))
-		return MTRR_TYPE_UNCACHABLE;
-	else
-		return MTRR_TYPE_WRBACK;
-}
-
-/*
-* Three terms are used in the following code:
-* - segment, it indicates the address segments covered by fixed MTRRs.
-* - unit, it corresponds to the MSR entry in the segment.
-* - range, a range is covered in one memory cache type.
-*/
-struct fixed_mtrr_segment {
-	u64 start;
-	u64 end;
-
-	int range_shift;
-
-	/* the start position in kvm_mtrr.fixed_ranges[]. */
-	int range_start;
-};
-
-static struct fixed_mtrr_segment fixed_seg_table[] = {
-	/* MSR_MTRRfix64K_00000, 1 unit. 64K fixed mtrr. */
-	{
-		.start = 0x0,
-		.end = 0x80000,
-		.range_shift = 16, /* 64K */
-		.range_start = 0,
-	},
-
-	/*
-	 * MSR_MTRRfix16K_80000 ... MSR_MTRRfix16K_A0000, 2 units,
-	 * 16K fixed mtrr.
-	 */
-	{
-		.start = 0x80000,
-		.end = 0xc0000,
-		.range_shift = 14, /* 16K */
-		.range_start = 8,
-	},
-
-	/*
-	 * MSR_MTRRfix4K_C0000 ... MSR_MTRRfix4K_F8000, 8 units,
-	 * 4K fixed mtrr.
-	 */
-	{
-		.start = 0xc0000,
-		.end = 0x100000,
-		.range_shift = 12, /* 12K */
-		.range_start = 24,
-	}
-};
-
-/*
- * The size of unit is covered in one MSR, one MSR entry contains
- * 8 ranges so that unit size is always 8 * 2^range_shift.
- */
-static u64 fixed_mtrr_seg_unit_size(int seg)
-{
-	return 8 << fixed_seg_table[seg].range_shift;
-}
-
-static bool fixed_msr_to_seg_unit(u32 msr, int *seg, int *unit)
-{
-	switch (msr) {
-	case MSR_MTRRfix64K_00000:
-		*seg = 0;
-		*unit = 0;
-		break;
-	case MSR_MTRRfix16K_80000 ... MSR_MTRRfix16K_A0000:
-		*seg = 1;
-		*unit = array_index_nospec(
-			msr - MSR_MTRRfix16K_80000,
-			MSR_MTRRfix16K_A0000 - MSR_MTRRfix16K_80000 + 1);
-		break;
-	case MSR_MTRRfix4K_C0000 ... MSR_MTRRfix4K_F8000:
-		*seg = 2;
-		*unit = array_index_nospec(
-			msr - MSR_MTRRfix4K_C0000,
-			MSR_MTRRfix4K_F8000 - MSR_MTRRfix4K_C0000 + 1);
-		break;
-	default:
-		return false;
-	}
-
-	return true;
-}
-
-static void fixed_mtrr_seg_unit_range(int seg, int unit, u64 *start, u64 *end)
-{
-	struct fixed_mtrr_segment *mtrr_seg = &fixed_seg_table[seg];
-	u64 unit_size = fixed_mtrr_seg_unit_size(seg);
-
-	*start = mtrr_seg->start + unit * unit_size;
-	*end = *start + unit_size;
-	WARN_ON(*end > mtrr_seg->end);
-}
-
-static int fixed_mtrr_seg_unit_range_index(int seg, int unit)
-{
-	struct fixed_mtrr_segment *mtrr_seg = &fixed_seg_table[seg];
-
-	WARN_ON(mtrr_seg->start + unit * fixed_mtrr_seg_unit_size(seg)
-		> mtrr_seg->end);
-
-	/* each unit has 8 ranges. */
-	return mtrr_seg->range_start + 8 * unit;
-}
-
-static int fixed_mtrr_seg_end_range_index(int seg)
-{
-	struct fixed_mtrr_segment *mtrr_seg = &fixed_seg_table[seg];
-	int n;
-
-	n = (mtrr_seg->end - mtrr_seg->start) >> mtrr_seg->range_shift;
-	return mtrr_seg->range_start + n - 1;
-}
-
-static bool fixed_msr_to_range(u32 msr, u64 *start, u64 *end)
-{
-	int seg, unit;
-
-	if (!fixed_msr_to_seg_unit(msr, &seg, &unit))
-		return false;
-
-	fixed_mtrr_seg_unit_range(seg, unit, start, end);
-	return true;
-}
-
-static int fixed_msr_to_range_index(u32 msr)
-{
-	int seg, unit;
-
-	if (!fixed_msr_to_seg_unit(msr, &seg, &unit))
-		return -1;
-
-	return fixed_mtrr_seg_unit_range_index(seg, unit);
-}
-
-static int fixed_mtrr_addr_to_seg(u64 addr)
-{
-	struct fixed_mtrr_segment *mtrr_seg;
-	int seg, seg_num = ARRAY_SIZE(fixed_seg_table);
-
-	for (seg = 0; seg < seg_num; seg++) {
-		mtrr_seg = &fixed_seg_table[seg];
-		if (mtrr_seg->start <= addr && addr < mtrr_seg->end)
-			return seg;
-	}
-
-	return -1;
-}
-
-static int fixed_mtrr_addr_seg_to_range_index(u64 addr, int seg)
-{
-	struct fixed_mtrr_segment *mtrr_seg;
-	int index;
-
-	mtrr_seg = &fixed_seg_table[seg];
-	index = mtrr_seg->range_start;
-	index += (addr - mtrr_seg->start) >> mtrr_seg->range_shift;
-	return index;
-}
-
-static u64 fixed_mtrr_range_end_addr(int seg, int index)
-{
-	struct fixed_mtrr_segment *mtrr_seg = &fixed_seg_table[seg];
-	int pos = index - mtrr_seg->range_start;
-
-	return mtrr_seg->start + ((pos + 1) << mtrr_seg->range_shift);
-}
-
-static void var_mtrr_range(struct kvm_mtrr_range *range, u64 *start, u64 *end)
-{
-	u64 mask;
-
-	*start = range->base & PAGE_MASK;
-
-	mask = range->mask & PAGE_MASK;
-
-	/* This cannot overflow because writing to the reserved bits of
-	 * variable MTRRs causes a #GP.
-	 */
-	*end = (*start | ~mask) + 1;
-}
-
-static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
-{
-	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
-	gfn_t start, end;
-
-	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
-		return;
-
-	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
-		return;
-
-	/* fixed MTRRs. */
-	if (fixed_msr_to_range(msr, &start, &end)) {
-		if (!fixed_mtrr_is_enabled(mtrr_state))
-			return;
-	} else if (msr == MSR_MTRRdefType) {
-		start = 0x0;
-		end = ~0ULL;
-	} else {
-		/* variable range MTRRs. */
-		var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
-	}
-
-	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
-}
-
-static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
-{
-	return (range->mask & (1 << 11)) != 0;
-}
-
-static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
-{
-	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
-	struct kvm_mtrr_range *tmp, *cur;
-
-	cur = var_mtrr_msr_to_range(vcpu, msr);
-
-	/* remove the entry if it's in the list. */
-	if (var_mtrr_range_is_valid(cur))
-		list_del(&cur->node);
-
-	/*
-	 * Set all illegal GPA bits in the mask, since those bits must
-	 * implicitly be 0.  The bits are then cleared when reading them.
-	 */
-	if (is_mtrr_base_msr(msr))
-		cur->base = data;
-	else
-		cur->mask = data | kvm_vcpu_reserved_gpa_bits_raw(vcpu);
-
-	/* add it to the list if it's enabled. */
-	if (var_mtrr_range_is_valid(cur)) {
-		list_for_each_entry(tmp, &mtrr_state->head, node)
-			if (cur->base >= tmp->base)
-				break;
-		list_add_tail(&cur->node, &tmp->node);
-	}
-}
-
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
-	int index;
+	u64 *mtrr;
+
+	mtrr = find_mtrr(vcpu, msr);
+	if (!mtrr)
+		return 1;
 
 	if (!kvm_mtrr_valid(vcpu, msr, data))
 		return 1;
 
-	index = fixed_msr_to_range_index(msr);
-	if (index >= 0)
-		*(u64 *)&vcpu->arch.mtrr_state.fixed_ranges[index] = data;
-	else if (msr == MSR_MTRRdefType)
-		vcpu->arch.mtrr_state.deftype = data;
-	else
-		set_var_mtrr_msr(vcpu, msr, data);
-
-	update_mtrr(vcpu, msr);
+	*mtrr = data;
 	return 0;
 }
 
 int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 {
-	int index;
+	u64 *mtrr;
 
 	/* MSR_MTRRcap is a readonly MSR. */
 	if (msr == MSR_MTRRcap) {
@@ -410,311 +123,10 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 		return 0;
 	}
 
-	if (!msr_mtrr_valid(msr))
+	mtrr = find_mtrr(vcpu, msr);
+	if (!mtrr)
 		return 1;
 
-	index = fixed_msr_to_range_index(msr);
-	if (index >= 0) {
-		*pdata = *(u64 *)&vcpu->arch.mtrr_state.fixed_ranges[index];
-	} else if (msr == MSR_MTRRdefType) {
-		*pdata = vcpu->arch.mtrr_state.deftype;
-	} else {
-		/* Variable MTRRs */
-		if (is_mtrr_base_msr(msr))
-			*pdata = var_mtrr_msr_to_range(vcpu, msr)->base;
-		else
-			*pdata = var_mtrr_msr_to_range(vcpu, msr)->mask;
-
-		*pdata &= ~kvm_vcpu_reserved_gpa_bits_raw(vcpu);
-	}
-
+	*pdata = *mtrr;
 	return 0;
 }
-
-void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
-{
-	INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
-}
-
-struct mtrr_iter {
-	/* input fields. */
-	struct kvm_mtrr *mtrr_state;
-	u64 start;
-	u64 end;
-
-	/* output fields. */
-	int mem_type;
-	/* mtrr is completely disabled? */
-	bool mtrr_disabled;
-	/* [start, end) is not fully covered in MTRRs? */
-	bool partial_map;
-
-	/* private fields. */
-	union {
-		/* used for fixed MTRRs. */
-		struct {
-			int index;
-			int seg;
-		};
-
-		/* used for var MTRRs. */
-		struct {
-			struct kvm_mtrr_range *range;
-			/* max address has been covered in var MTRRs. */
-			u64 start_max;
-		};
-	};
-
-	bool fixed;
-};
-
-static bool mtrr_lookup_fixed_start(struct mtrr_iter *iter)
-{
-	int seg, index;
-
-	if (!fixed_mtrr_is_enabled(iter->mtrr_state))
-		return false;
-
-	seg = fixed_mtrr_addr_to_seg(iter->start);
-	if (seg < 0)
-		return false;
-
-	iter->fixed = true;
-	index = fixed_mtrr_addr_seg_to_range_index(iter->start, seg);
-	iter->index = index;
-	iter->seg = seg;
-	return true;
-}
-
-static bool match_var_range(struct mtrr_iter *iter,
-			    struct kvm_mtrr_range *range)
-{
-	u64 start, end;
-
-	var_mtrr_range(range, &start, &end);
-	if (!(start >= iter->end || end <= iter->start)) {
-		iter->range = range;
-
-		/*
-		 * the function is called when we do kvm_mtrr.head walking.
-		 * Range has the minimum base address which interleaves
-		 * [looker->start_max, looker->end).
-		 */
-		iter->partial_map |= iter->start_max < start;
-
-		/* update the max address has been covered. */
-		iter->start_max = max(iter->start_max, end);
-		return true;
-	}
-
-	return false;
-}
-
-static void __mtrr_lookup_var_next(struct mtrr_iter *iter)
-{
-	struct kvm_mtrr *mtrr_state = iter->mtrr_state;
-
-	list_for_each_entry_continue(iter->range, &mtrr_state->head, node)
-		if (match_var_range(iter, iter->range))
-			return;
-
-	iter->range = NULL;
-	iter->partial_map |= iter->start_max < iter->end;
-}
-
-static void mtrr_lookup_var_start(struct mtrr_iter *iter)
-{
-	struct kvm_mtrr *mtrr_state = iter->mtrr_state;
-
-	iter->fixed = false;
-	iter->start_max = iter->start;
-	iter->range = NULL;
-	iter->range = list_prepare_entry(iter->range, &mtrr_state->head, node);
-
-	__mtrr_lookup_var_next(iter);
-}
-
-static void mtrr_lookup_fixed_next(struct mtrr_iter *iter)
-{
-	/* terminate the lookup. */
-	if (fixed_mtrr_range_end_addr(iter->seg, iter->index) >= iter->end) {
-		iter->fixed = false;
-		iter->range = NULL;
-		return;
-	}
-
-	iter->index++;
-
-	/* have looked up for all fixed MTRRs. */
-	if (iter->index >= ARRAY_SIZE(iter->mtrr_state->fixed_ranges))
-		return mtrr_lookup_var_start(iter);
-
-	/* switch to next segment. */
-	if (iter->index > fixed_mtrr_seg_end_range_index(iter->seg))
-		iter->seg++;
-}
-
-static void mtrr_lookup_var_next(struct mtrr_iter *iter)
-{
-	__mtrr_lookup_var_next(iter);
-}
-
-static void mtrr_lookup_start(struct mtrr_iter *iter)
-{
-	if (!mtrr_is_enabled(iter->mtrr_state)) {
-		iter->mtrr_disabled = true;
-		return;
-	}
-
-	if (!mtrr_lookup_fixed_start(iter))
-		mtrr_lookup_var_start(iter);
-}
-
-static void mtrr_lookup_init(struct mtrr_iter *iter,
-			     struct kvm_mtrr *mtrr_state, u64 start, u64 end)
-{
-	iter->mtrr_state = mtrr_state;
-	iter->start = start;
-	iter->end = end;
-	iter->mtrr_disabled = false;
-	iter->partial_map = false;
-	iter->fixed = false;
-	iter->range = NULL;
-
-	mtrr_lookup_start(iter);
-}
-
-static bool mtrr_lookup_okay(struct mtrr_iter *iter)
-{
-	if (iter->fixed) {
-		iter->mem_type = iter->mtrr_state->fixed_ranges[iter->index];
-		return true;
-	}
-
-	if (iter->range) {
-		iter->mem_type = iter->range->base & 0xff;
-		return true;
-	}
-
-	return false;
-}
-
-static void mtrr_lookup_next(struct mtrr_iter *iter)
-{
-	if (iter->fixed)
-		mtrr_lookup_fixed_next(iter);
-	else
-		mtrr_lookup_var_next(iter);
-}
-
-#define mtrr_for_each_mem_type(_iter_, _mtrr_, _gpa_start_, _gpa_end_) \
-	for (mtrr_lookup_init(_iter_, _mtrr_, _gpa_start_, _gpa_end_); \
-	     mtrr_lookup_okay(_iter_); mtrr_lookup_next(_iter_))
-
-u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
-{
-	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
-	struct mtrr_iter iter;
-	u64 start, end;
-	int type = -1;
-	const int wt_wb_mask = (1 << MTRR_TYPE_WRBACK)
-			       | (1 << MTRR_TYPE_WRTHROUGH);
-
-	start = gfn_to_gpa(gfn);
-	end = start + PAGE_SIZE;
-
-	mtrr_for_each_mem_type(&iter, mtrr_state, start, end) {
-		int curr_type = iter.mem_type;
-
-		/*
-		 * Please refer to Intel SDM Volume 3: 11.11.4.1 MTRR
-		 * Precedences.
-		 */
-
-		if (type == -1) {
-			type = curr_type;
-			continue;
-		}
-
-		/*
-		 * If two or more variable memory ranges match and the
-		 * memory types are identical, then that memory type is
-		 * used.
-		 */
-		if (type == curr_type)
-			continue;
-
-		/*
-		 * If two or more variable memory ranges match and one of
-		 * the memory types is UC, the UC memory type used.
-		 */
-		if (curr_type == MTRR_TYPE_UNCACHABLE)
-			return MTRR_TYPE_UNCACHABLE;
-
-		/*
-		 * If two or more variable memory ranges match and the
-		 * memory types are WT and WB, the WT memory type is used.
-		 */
-		if (((1 << type) & wt_wb_mask) &&
-		      ((1 << curr_type) & wt_wb_mask)) {
-			type = MTRR_TYPE_WRTHROUGH;
-			continue;
-		}
-
-		/*
-		 * For overlaps not defined by the above rules, processor
-		 * behavior is undefined.
-		 */
-
-		/* We use WB for this undefined behavior. :( */
-		return MTRR_TYPE_WRBACK;
-	}
-
-	if (iter.mtrr_disabled)
-		return mtrr_disabled_type(vcpu);
-
-	/* not contained in any MTRRs. */
-	if (type == -1)
-		return mtrr_default_type(mtrr_state);
-
-	/*
-	 * We just check one page, partially covered by MTRRs is
-	 * impossible.
-	 */
-	WARN_ON(iter.partial_map);
-
-	return type;
-}
-EXPORT_SYMBOL_GPL(kvm_mtrr_get_guest_memory_type);
-
-bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
-					  int page_num)
-{
-	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
-	struct mtrr_iter iter;
-	u64 start, end;
-	int type = -1;
-
-	start = gfn_to_gpa(gfn);
-	end = gfn_to_gpa(gfn + page_num);
-	mtrr_for_each_mem_type(&iter, mtrr_state, start, end) {
-		if (type == -1) {
-			type = iter.mem_type;
-			continue;
-		}
-
-		if (type != iter.mem_type)
-			return false;
-	}
-
-	if (iter.mtrr_disabled)
-		return true;
-
-	if (!iter.partial_map)
-		return true;
-
-	if (type == -1)
-		return true;
-
-	return type == mtrr_default_type(mtrr_state);
-}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7a74388f9ecf..66bf79decdad 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7596,39 +7596,27 @@ static int vmx_vm_init(struct kvm *kvm)
 
 static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
-	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
-	 * memory aliases with conflicting memory types and sometimes MCEs.
-	 * We have to be careful as to what are honored and when.
-	 *
-	 * For MMIO, guest CD/MTRR are ignored.  The EPT memory type is set to
-	 * UC.  The effective memory type is UC or WC depending on guest PAT.
-	 * This was historically the source of MCEs and we want to be
-	 * conservative.
-	 *
-	 * When there is no need to deal with noncoherent DMA (e.g., no VT-d
-	 * or VT-d has snoop control), guest CD/MTRR/PAT are all ignored.  The
-	 * EPT memory type is set to WB.  The effective memory type is forced
-	 * WB.
-	 *
-	 * Otherwise, we trust guest.  Guest CD/MTRR/PAT are all honored.  The
-	 * EPT memory type is used to emulate guest CD/MTRR.
+	/*
+	 * Force UC for host MMIO regions, as allowing the guest to access MMIO
+	 * with cacheable accesses will result in Machine Checks.
 	 */
-
 	if (is_mmio)
 		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
 
+	/*
+	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
+	 * device attached.  Letting the guest control memory types on Intel
+	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
+	 * the guest to behave only as a last resort.
+	 */
 	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
-	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
-		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-			return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
-		else
-			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
-				VMX_EPT_IPAT_BIT;
-	}
+	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD) &&
+	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
+		return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
-	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
+	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
 }
 
 static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx, u32 new_ctl)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 66c4381460dc..2a38b4c26d35 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -962,7 +962,8 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 		kvm_mmu_reset_context(vcpu);
 
 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
+	    kvm_mmu_may_ignore_guest_pat() &&
+	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
 	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
 		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
 }
@@ -12155,7 +12156,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.arch_capabilities = kvm_get_arch_capabilities();
 	vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
 	kvm_xen_init_vcpu(vcpu);
-	kvm_vcpu_mtrr_init(vcpu);
 	vcpu_load(vcpu);
 	kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
 	kvm_vcpu_reset(vcpu, false);
@@ -13425,13 +13425,13 @@ EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
 static void kvm_noncoherent_dma_assignment_start_or_stop(struct kvm *kvm)
 {
 	/*
-	 * Non-coherent DMA assignment and de-assignment will affect
-	 * whether KVM honors guest MTRRs and cause changes in memtypes
-	 * in TDP.
-	 * So, pass %true unconditionally to indicate non-coherent DMA was,
-	 * or will be involved, and that zapping SPTEs might be necessary.
+	 * Non-coherent DMA assignment and de-assignment may affect whether or
+	 * not KVM honors guest PAT, and thus may cause changes in EPT SPTEs
+	 * due to toggling the "ignore PAT" bit.  Zap all SPTEs when the first
+	 * (or last) non-coherent device is (un)registered to so that new SPTEs
+	 * with the correct "ignore guest PAT" setting are created.
 	 */
-	if (__kvm_mmu_honors_guest_mtrrs(true))
+	if (kvm_mmu_may_ignore_guest_pat())
 		kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a8b71803777b..2891a9bb0dd0 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -309,12 +309,8 @@ int handle_ud(struct kvm_vcpu *vcpu);
 void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 				   struct kvm_queued_exception *ex);
 
-void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
-u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
-bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
-					  int page_num);
 bool kvm_vector_hashing_enabled(void);
 void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
 int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type,
-- 
2.44.0.278.ge034bb2e1d-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
  2024-03-09  1:09 ` [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes Sean Christopherson
@ 2024-03-11  7:44   ` Yan Zhao
  2024-03-12  0:08     ` Sean Christopherson
  2024-03-12  1:10   ` Dongli Zhang
  1 sibling, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-03-11  7:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yiwei Zhang

On Fri, Mar 08, 2024 at 05:09:25PM -0800, Sean Christopherson wrote:
> index a67c28a56417..05490b9d8a43 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -19,33 +19,21 @@
>  #include <asm/mtrr.h>
>  
>  #include "cpuid.h"
"cpuid.h" is not required either.

> -#include "mmu.h"
>
...

> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7a74388f9ecf..66bf79decdad 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7596,39 +7596,27 @@ static int vmx_vm_init(struct kvm *kvm)
>  
>  static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  {
> -	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> -	 * memory aliases with conflicting memory types and sometimes MCEs.
> -	 * We have to be careful as to what are honored and when.
> -	 *
> -	 * For MMIO, guest CD/MTRR are ignored.  The EPT memory type is set to
> -	 * UC.  The effective memory type is UC or WC depending on guest PAT.
> -	 * This was historically the source of MCEs and we want to be
> -	 * conservative.
> -	 *
> -	 * When there is no need to deal with noncoherent DMA (e.g., no VT-d
> -	 * or VT-d has snoop control), guest CD/MTRR/PAT are all ignored.  The
> -	 * EPT memory type is set to WB.  The effective memory type is forced
> -	 * WB.
> -	 *
> -	 * Otherwise, we trust guest.  Guest CD/MTRR/PAT are all honored.  The
> -	 * EPT memory type is used to emulate guest CD/MTRR.
> +	/*
> +	 * Force UC for host MMIO regions, as allowing the guest to access MMIO
> +	 * with cacheable accesses will result in Machine Checks.
This does not always force UC. If guest PAT is WC, the effecitve one is WC.

>  	 */
> -
>  	if (is_mmio)
>  		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
>  
 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
  2024-03-11  7:44   ` Yan Zhao
@ 2024-03-12  0:08     ` Sean Christopherson
  0 siblings, 0 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-03-12  0:08 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yiwei Zhang

On Mon, Mar 11, 2024, Yan Zhao wrote:
> On Fri, Mar 08, 2024 at 05:09:25PM -0800, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 7a74388f9ecf..66bf79decdad 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7596,39 +7596,27 @@ static int vmx_vm_init(struct kvm *kvm)
> >  
> >  static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >  {
> > -	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> > -	 * memory aliases with conflicting memory types and sometimes MCEs.
> > -	 * We have to be careful as to what are honored and when.
> > -	 *
> > -	 * For MMIO, guest CD/MTRR are ignored.  The EPT memory type is set to
> > -	 * UC.  The effective memory type is UC or WC depending on guest PAT.
> > -	 * This was historically the source of MCEs and we want to be
> > -	 * conservative.
> > -	 *
> > -	 * When there is no need to deal with noncoherent DMA (e.g., no VT-d
> > -	 * or VT-d has snoop control), guest CD/MTRR/PAT are all ignored.  The
> > -	 * EPT memory type is set to WB.  The effective memory type is forced
> > -	 * WB.
> > -	 *
> > -	 * Otherwise, we trust guest.  Guest CD/MTRR/PAT are all honored.  The
> > -	 * EPT memory type is used to emulate guest CD/MTRR.
> > +	/*
> > +	 * Force UC for host MMIO regions, as allowing the guest to access MMIO
> > +	 * with cacheable accesses will result in Machine Checks.
> This does not always force UC. If guest PAT is WC, the effecitve one is WC.

Doh, right, I keep forgetting that KVM doesn't ignore guest PAT for MMIO.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
  2024-03-09  1:09 ` [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes Sean Christopherson
  2024-03-11  7:44   ` Yan Zhao
@ 2024-03-12  1:10   ` Dongli Zhang
  2024-03-12 17:08     ` Sean Christopherson
  1 sibling, 1 reply; 61+ messages in thread
From: Dongli Zhang @ 2024-03-12  1:10 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang



On 3/8/24 17:09, Sean Christopherson wrote:
> Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR
> adds no value, negatively impacts guest performance, and is a maintenance
> burden due to it's complexity and oddities.
> 
> KVM's approach to virtualizating MTRRs make no sense, at all.  KVM *only*
> honors guest MTRR memtypes if EPT is enabled *and* the guest has a device
> that may perform non-coherent DMA access.  From a hardware virtualization
> perspective of guest MTRRs, there is _nothing_ special about EPT.  Legacy
> shadowing paging doesn't magically account for guest MTRRs, nor does NPT.

[snip]

>  
> -bool __kvm_mmu_honors_guest_mtrrs(bool vm_has_noncoherent_dma)
> +bool kvm_mmu_may_ignore_guest_pat(void)
>  {
>  	/*
> -	 * If host MTRRs are ignored (shadow_memtype_mask is non-zero), and the
> -	 * VM has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is
> -	 * to honor the memtype from the guest's MTRRs so that guest accesses
> -	 * to memory that is DMA'd aren't cached against the guest's wishes.
> -	 *
> -	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
> -	 * e.g. KVM will force UC memtype for host MMIO.
> +	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> +	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
> +	 * honor the memtype from the guest's PAT so that guest accesses to
> +	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
> +	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> +	 * KVM _always_ ignores guest PAT (when EPT is enabled).
>  	 */
> -	return vm_has_noncoherent_dma && shadow_memtype_mask;
> +	return shadow_memtype_mask;
>  }
>  

Any special reason to use the naming 'may_ignore_guest_pat', but not
'may_honor_guest_pat'?

Since it is also controlled by other cases, e.g., kvm_arch_has_noncoherent_dma()
at vmx_get_mt_mask(), it can be 'may_honor_guest_pat' too?

Therefore, why not directly use 'shadow_memtype_mask' (without the API), or some
naming like "ept_enabled_for_hardware".


Even with the code from PATCH 5/5, we still have high chance that VM has
non-coherent DMA?

 bool kvm_mmu_may_ignore_guest_pat(void)
 {
 	/*
-	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
+	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
+	 * not support self-snoop (or is affected by an erratum), and the VM
 	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
 	 * honor the memtype from the guest's PAT so that guest accesses to
 	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
 	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
-	 * KVM _always_ ignores guest PAT (when EPT is enabled).
+	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
+	 * bits in response to non-coherent device (un)registration.
 	 */
-	return shadow_memtype_mask;
+	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
 }


Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
  2024-03-12  1:10   ` Dongli Zhang
@ 2024-03-12 17:08     ` Sean Christopherson
  2024-03-14 10:31       ` Dongli Zhang
  0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-03-12 17:08 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

On Mon, Mar 11, 2024, Dongli Zhang wrote:
> 
> 
> On 3/8/24 17:09, Sean Christopherson wrote:
> > Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR
> > adds no value, negatively impacts guest performance, and is a maintenance
> > burden due to it's complexity and oddities.
> > 
> > KVM's approach to virtualizating MTRRs make no sense, at all.  KVM *only*
> > honors guest MTRR memtypes if EPT is enabled *and* the guest has a device
> > that may perform non-coherent DMA access.  From a hardware virtualization
> > perspective of guest MTRRs, there is _nothing_ special about EPT.  Legacy
> > shadowing paging doesn't magically account for guest MTRRs, nor does NPT.
> 
> [snip]
> 
> >  
> > -bool __kvm_mmu_honors_guest_mtrrs(bool vm_has_noncoherent_dma)
> > +bool kvm_mmu_may_ignore_guest_pat(void)
> >  {
> >  	/*
> > -	 * If host MTRRs are ignored (shadow_memtype_mask is non-zero), and the
> > -	 * VM has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is
> > -	 * to honor the memtype from the guest's MTRRs so that guest accesses
> > -	 * to memory that is DMA'd aren't cached against the guest's wishes.
> > -	 *
> > -	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
> > -	 * e.g. KVM will force UC memtype for host MMIO.
> > +	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> > +	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
> > +	 * honor the memtype from the guest's PAT so that guest accesses to
> > +	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
> > +	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> > +	 * KVM _always_ ignores guest PAT (when EPT is enabled).
> >  	 */
> > -	return vm_has_noncoherent_dma && shadow_memtype_mask;
> > +	return shadow_memtype_mask;
> >  }
> >  
> 
> Any special reason to use the naming 'may_ignore_guest_pat', but not
> 'may_honor_guest_pat'?

Because which (after this series) is would either be misleading or outright wrong.
If KVM returns true from the helper based solely on shadow_memtype_mask, then it's
misleading because KVM will *always* honors guest PAT for such CPUs.  I.e. that
name would yield this misleading statement.

  If the CPU supports self-snoop, KVM may honor guest PAT.

If KVM returns true iff self-snoop is NOT available (as proposed in this series),
then it's outright wrong as KVM would return false, i.e. would make this incorrect
statement:

  If the CPU supports self-snoop, KVM never honors guest PAT.

As saying that KVM may not or cannot do something is saying that KVM will never
do that thing.

And because the EPT flag is "ignore guest PAT", not "honor guest PAT", but that's
as much coincidence as it is anything else.

> Since it is also controlled by other cases, e.g., kvm_arch_has_noncoherent_dma()
> at vmx_get_mt_mask(), it can be 'may_honor_guest_pat' too?
> 
> Therefore, why not directly use 'shadow_memtype_mask' (without the API), or some
> naming like "ept_enabled_for_hardware".

Again, after this series, KVM will *always* honor guest PAT for CPUs with self-snoop,
i.e. KVM will *never* ignore guest PAT.  But for CPUs without self-snoop (or with
errata), KVM conditionally honors/ignores guest PAT.

> Even with the code from PATCH 5/5, we still have high chance that VM has
> non-coherent DMA?

I don't follow.  On CPUs with self-snoop, whether or not the VM has non-coherent
DMA (from VFIO!) is irrelevant.  If the CPU has self-snoop, then KVM can safely
honor guest PAT at all times.

>  bool kvm_mmu_may_ignore_guest_pat(void)
>  {
>  	/*
> -	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> +	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
> +	 * not support self-snoop (or is affected by an erratum), and the VM
>  	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
>  	 * honor the memtype from the guest's PAT so that guest accesses to
>  	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
>  	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> -	 * KVM _always_ ignores guest PAT (when EPT is enabled).
> +	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
> +	 * bits in response to non-coherent device (un)registration.
>  	 */
> -	return shadow_memtype_mask;
> +	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
>  }
> 
> 
> Thank you very much!
> 
> Dongli Zhang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
  2024-03-12 17:08     ` Sean Christopherson
@ 2024-03-14 10:31       ` Dongli Zhang
  2024-03-14 14:47         ` Sean Christopherson
  0 siblings, 1 reply; 61+ messages in thread
From: Dongli Zhang @ 2024-03-14 10:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang



On 3/12/24 10:08, Sean Christopherson wrote:
> On Mon, Mar 11, 2024, Dongli Zhang wrote:
>>
>>
>> On 3/8/24 17:09, Sean Christopherson wrote:
>>> Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR
>>> adds no value, negatively impacts guest performance, and is a maintenance
>>> burden due to it's complexity and oddities.
>>>
>>> KVM's approach to virtualizating MTRRs make no sense, at all.  KVM *only*
>>> honors guest MTRR memtypes if EPT is enabled *and* the guest has a device
>>> that may perform non-coherent DMA access.  From a hardware virtualization
>>> perspective of guest MTRRs, there is _nothing_ special about EPT.  Legacy
>>> shadowing paging doesn't magically account for guest MTRRs, nor does NPT.
>>
>> [snip]
>>
>>>  
>>> -bool __kvm_mmu_honors_guest_mtrrs(bool vm_has_noncoherent_dma)
>>> +bool kvm_mmu_may_ignore_guest_pat(void)
>>>  {
>>>  	/*
>>> -	 * If host MTRRs are ignored (shadow_memtype_mask is non-zero), and the
>>> -	 * VM has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is
>>> -	 * to honor the memtype from the guest's MTRRs so that guest accesses
>>> -	 * to memory that is DMA'd aren't cached against the guest's wishes.
>>> -	 *
>>> -	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
>>> -	 * e.g. KVM will force UC memtype for host MMIO.
>>> +	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
>>> +	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
>>> +	 * honor the memtype from the guest's PAT so that guest accesses to
>>> +	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
>>> +	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
>>> +	 * KVM _always_ ignores guest PAT (when EPT is enabled).
>>>  	 */
>>> -	return vm_has_noncoherent_dma && shadow_memtype_mask;
>>> +	return shadow_memtype_mask;
>>>  }
>>>  
>>
>> Any special reason to use the naming 'may_ignore_guest_pat', but not
>> 'may_honor_guest_pat'?
> 
> Because which (after this series) is would either be misleading or outright wrong.
> If KVM returns true from the helper based solely on shadow_memtype_mask, then it's
> misleading because KVM will *always* honors guest PAT for such CPUs.  I.e. that
> name would yield this misleading statement.
> 
>   If the CPU supports self-snoop, KVM may honor guest PAT.
> 
> If KVM returns true iff self-snoop is NOT available (as proposed in this series),
> then it's outright wrong as KVM would return false, i.e. would make this incorrect
> statement:
> 
>   If the CPU supports self-snoop, KVM never honors guest PAT.
> 
> As saying that KVM may not or cannot do something is saying that KVM will never
> do that thing.
> 
> And because the EPT flag is "ignore guest PAT", not "honor guest PAT", but that's
> as much coincidence as it is anything else.
> 
>> Since it is also controlled by other cases, e.g., kvm_arch_has_noncoherent_dma()
>> at vmx_get_mt_mask(), it can be 'may_honor_guest_pat' too?
>>
>> Therefore, why not directly use 'shadow_memtype_mask' (without the API), or some
>> naming like "ept_enabled_for_hardware".
> 
> Again, after this series, KVM will *always* honor guest PAT for CPUs with self-snoop,
> i.e. KVM will *never* ignore guest PAT.  But for CPUs without self-snoop (or with
> errata), KVM conditionally honors/ignores guest PAT.
> 
>> Even with the code from PATCH 5/5, we still have high chance that VM has
>> non-coherent DMA?
> 
> I don't follow.  On CPUs with self-snoop, whether or not the VM has non-coherent
> DMA (from VFIO!) is irrelevant.  If the CPU has self-snoop, then KVM can safely
> honor guest PAT at all times.


Thank you very much for the explanation.

According to my understanding of the explanation (after this series):

1. When static_cpu_has(X86_FEATURE_SELFSNOOP) == true, it is 100% to "honor
guest PAT".

2. When static_cpu_has(X86_FEATURE_SELFSNOOP) == false (and
shadow_memtype_mask), although only 50% chance (depending on where there is
non-coherent DMA), at least now it is NOT 100% (to honor guest PAT) any longer.

Due to the fact it is not 100% (to honor guest PAT) any longer, there starts the
trend (from 100% to 50%) to "ignore guest PAT", that is:
kvm_mmu_may_ignore_guest_pat().

Dongli Zhang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
  2024-03-14 10:31       ` Dongli Zhang
@ 2024-03-14 14:47         ` Sean Christopherson
  0 siblings, 0 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-03-14 14:47 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

On Thu, Mar 14, 2024, Dongli Zhang wrote:
> On 3/12/24 10:08, Sean Christopherson wrote:
> > On Mon, Mar 11, 2024, Dongli Zhang wrote:
> >> Since it is also controlled by other cases, e.g., kvm_arch_has_noncoherent_dma()
> >> at vmx_get_mt_mask(), it can be 'may_honor_guest_pat' too?
> >>
> >> Therefore, why not directly use 'shadow_memtype_mask' (without the API), or some
> >> naming like "ept_enabled_for_hardware".
> > 
> > Again, after this series, KVM will *always* honor guest PAT for CPUs with self-snoop,
> > i.e. KVM will *never* ignore guest PAT.  But for CPUs without self-snoop (or with
> > errata), KVM conditionally honors/ignores guest PAT.
> > 
> >> Even with the code from PATCH 5/5, we still have high chance that VM has
> >> non-coherent DMA?
> > 
> > I don't follow.  On CPUs with self-snoop, whether or not the VM has non-coherent
> > DMA (from VFIO!) is irrelevant.  If the CPU has self-snoop, then KVM can safely
> > honor guest PAT at all times.
> 
> 
> Thank you very much for the explanation.
> 
> According to my understanding of the explanation (after this series):
> 
> 1. When static_cpu_has(X86_FEATURE_SELFSNOOP) == true, it is 100% to "honor
> guest PAT".

Yes.

> 2. When static_cpu_has(X86_FEATURE_SELFSNOOP) == false (and
> shadow_memtype_mask), although only 50% chance (depending on where there is
> non-coherent DMA), at least now it is NOT 100% (to honor guest PAT) any longer.

Yes, though I wouldn't assign a percent probability to the non-coherent DMA case.

> Due to the fact it is not 100% (to honor guest PAT) any longer, there starts the
> trend (from 100% to 50%) to "ignore guest PAT", that is:
> kvm_mmu_may_ignore_guest_pat().

Yep.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 2/5] KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1
  2024-03-09  1:09 [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Sean Christopherson
  2024-03-09  1:09 ` [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes Sean Christopherson
@ 2024-03-09  1:09 ` Sean Christopherson
  2024-03-09  1:09 ` [PATCH 3/5] srcu: Add an API for a memory barrier after SRCU read lock Sean Christopherson
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-03-09  1:09 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer
honors guest MTRR memtypes, as forcing UC memory for VMs with
non-coherent DMA only makes sense if the guest is using something other
than PAT to configure the memtype for the DMA region.

Furthermore, KVM has forced WB memory for CR0.CD=1 since commit
fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known
VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does
so with non-coherent DMA.

Lastly, commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was
from the same author as commit b18d5431acc7 ("KVM: x86: fix CR0.CD
virtualization"), and followed by a mere month.  I.e. forcing UC memory
was likely the result of code inspection or perhaps misdiagnosed failures,
and not the necessitate by a concrete use case.

Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now
AMD-only, and to take an erratum for lack of CR0.CD virtualization on
Intel.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst        |  6 +++++-
 Documentation/virt/kvm/x86/errata.rst | 19 +++++++++++++++----
 arch/x86/kvm/vmx/vmx.c                |  4 ----
 arch/x86/kvm/x86.c                    |  6 ------
 4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..e85edd26ea5a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7946,7 +7946,11 @@ The valid bits in cap.args[0] are:
                                     When this quirk is disabled, the reset value
                                     is 0x10000 (APIC_LVT_MASKED).
 
- KVM_X86_QUIRK_CD_NW_CLEARED        By default, KVM clears CR0.CD and CR0.NW.
+ KVM_X86_QUIRK_CD_NW_CLEARED        By default, KVM clears CR0.CD and CR0.NW on
+                                    AMD CPUs to workaround buggy guest firmware
+                                    that runs in perpetuity with CR0.CD, i.e.
+                                    with caches in "no fill" mode.
+
                                     When this quirk is disabled, KVM does not
                                     change the value of CR0.CD and CR0.NW.
 
diff --git a/Documentation/virt/kvm/x86/errata.rst b/Documentation/virt/kvm/x86/errata.rst
index 1b70bad7325e..4116045a8744 100644
--- a/Documentation/virt/kvm/x86/errata.rst
+++ b/Documentation/virt/kvm/x86/errata.rst
@@ -51,7 +51,18 @@ matching the target APIC ID receive the interrupt).
 
 MTRRs
 -----
-KVM does not virtualization guest MTRR memory types.  KVM emulates accesses to
-MTRR MSRs, i.e. {RD,WR}MSR in the guest will behave as expected, but KVM does
-not honor guest MTRRs when determining the effective memory type, and instead
-treats all of guest memory as having Writeback (WB) MTRRs.
\ No newline at end of file
+KVM does not virtualize guest MTRR memory types.  KVM emulates accesses to MTRR
+MSRs, i.e. {RD,WR}MSR in the guest will behave as expected, but KVM does not
+honor guest MTRRs when determining the effective memory type, and instead
+treats all of guest memory as having Writeback (WB) MTRRs.
+
+CR0.CD
+------
+KVM does not virtualize CR0.CD on Intel CPUs.  Similar to MTRR MSRs, KVM
+emulates CR0.CD accesses so that loads and stores from/to CR0 behave as
+expected, but setting CR0.CD=1 has no impact on the cachaeability of guest
+memory.
+
+Note, this erratum does not affect AMD CPUs, which fully virtualize CR0.CD in
+hardware, i.e. put the CPU caches into "no fill" mode when CR0.CD=1, even when
+running in the guest.
\ No newline at end of file
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 66bf79decdad..17a8e4fdf9c4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7612,10 +7612,6 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
-	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD) &&
-	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-		return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
-
 	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2a38b4c26d35..276ae56dd888 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -960,12 +960,6 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 
 	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
 		kvm_mmu_reset_context(vcpu);
-
-	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-	    kvm_mmu_may_ignore_guest_pat() &&
-	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
-	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
 
-- 
2.44.0.278.ge034bb2e1d-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 3/5] srcu: Add an API for a memory barrier after SRCU read lock
  2024-03-09  1:09 [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Sean Christopherson
  2024-03-09  1:09 ` [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes Sean Christopherson
  2024-03-09  1:09 ` [PATCH 2/5] KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1 Sean Christopherson
@ 2024-03-09  1:09 ` Sean Christopherson
  2024-03-09  1:09 ` [PATCH 4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path Sean Christopherson
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-03-09  1:09 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

From: Yan Zhao <yan.y.zhao@intel.com>

To avoid redundant memory barriers, add smp_mb__after_srcu_read_lock() to
pair with smp_mb__after_srcu_read_unlock() for use in paths that need to
emit a memory barrier, but already do srcu_read_lock(), which includes a
full memory barrier.  Provide an API, e.g. as opposed to having callers
document the behavior via a comment, as the full memory barrier provided
by srcu_read_lock() is an implementation detail that shouldn't bleed into
random subsystems.

KVM will use smp_mb__after_srcu_read_lock() in it's VM-Exit path to ensure
a memory barrier is emitted, which is necessary to ensure correctness of
mixed memory types on CPUs that support self-snoop.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/srcu.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 236610e4a8fa..1cb4527076de 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -343,6 +343,20 @@ static inline void smp_mb__after_srcu_read_unlock(void)
 	/* __srcu_read_unlock has smp_mb() internally so nothing to do here. */
 }
 
+/**
+ * smp_mb__after_srcu_read_lock - ensure full ordering after srcu_read_lock
+ *
+ * Converts the preceding srcu_read_lock into a two-way memory barrier.
+ *
+ * Call this after srcu_read_lock, to guarantee that all memory operations
+ * that occur after smp_mb__after_srcu_read_lock will appear to happen after
+ * the preceding srcu_read_lock.
+ */
+static inline void smp_mb__after_srcu_read_lock(void)
+{
+	/* __srcu_read_lock has smp_mb() internally so nothing to do here. */
+}
+
 DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
 		    _T->idx = srcu_read_lock(_T->lock),
 		    srcu_read_unlock(_T->lock, _T->idx),
-- 
2.44.0.278.ge034bb2e1d-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
  2024-03-09  1:09 [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Sean Christopherson
                   ` (2 preceding siblings ...)
  2024-03-09  1:09 ` [PATCH 3/5] srcu: Add an API for a memory barrier after SRCU read lock Sean Christopherson
@ 2024-03-09  1:09 ` Sean Christopherson
  2024-06-20 22:38   ` Paolo Bonzini
  2024-03-09  1:09 ` [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Sean Christopherson
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-03-09  1:09 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

From: Yan Zhao <yan.y.zhao@intel.com>

Ensure a full memory barrier is emitted in the VM-Exit path, as a full
barrier is required on Intel CPUs to evict WC buffers.  This will allow
unconditionally honoring guest PAT on Intel CPUs that support self-snoop.

As srcu_read_lock() is always called in the VM-Exit path and it internally
has a smp_mb(), call smp_mb__after_srcu_read_lock() to avoid adding a
second fence and make sure smp_mb() is called without dependency on
implementation details of srcu_read_lock().

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 276ae56dd888..69e815df1699 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11082,6 +11082,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	kvm_vcpu_srcu_read_lock(vcpu);
 
+	/*
+	 * Call this to ensure WC buffers in guest are evicted after each VM
+	 * Exit, so that the evicted WC writes can be snooped across all cpus
+	 */
+	smp_mb__after_srcu_read_lock();
+
 	/*
 	 * Profile KVM exit RIPs:
 	 */
-- 
2.44.0.278.ge034bb2e1d-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
  2024-03-09  1:09 ` [PATCH 4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path Sean Christopherson
@ 2024-06-20 22:38   ` Paolo Bonzini
  2024-06-20 23:42     ` Paul E. McKenney
  2024-06-21  0:52     ` Yan Zhao
  0 siblings, 2 replies; 61+ messages in thread
From: Paolo Bonzini @ 2024-06-20 22:38 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

On 3/9/24 02:09, Sean Christopherson wrote:
> From: Yan Zhao <yan.y.zhao@intel.com>
> 
> Ensure a full memory barrier is emitted in the VM-Exit path, as a full
> barrier is required on Intel CPUs to evict WC buffers.  This will allow
> unconditionally honoring guest PAT on Intel CPUs that support self-snoop.
> 
> As srcu_read_lock() is always called in the VM-Exit path and it internally
> has a smp_mb(), call smp_mb__after_srcu_read_lock() to avoid adding a
> second fence and make sure smp_mb() is called without dependency on
> implementation details of srcu_read_lock().

Do you really need mfence or is a locked operation enough?  mfence is 
mb(), not smp_mb().

Paolo

> +	/*
> +	 * Call this to ensure WC buffers in guest are evicted after each VM
> +	 * Exit, so that the evicted WC writes can be snooped across all cpus
> +	 */
> +	smp_mb__after_srcu_read_lock();


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
  2024-06-20 22:38   ` Paolo Bonzini
@ 2024-06-20 23:42     ` Paul E. McKenney
  2024-06-21  0:52     ` Yan Zhao
  1 sibling, 0 replies; 61+ messages in thread
From: Paul E. McKenney @ 2024-06-20 23:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Lai Jiangshan, Josh Triplett, kvm, rcu,
	linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

On Fri, Jun 21, 2024 at 12:38:21AM +0200, Paolo Bonzini wrote:
> On 3/9/24 02:09, Sean Christopherson wrote:
> > From: Yan Zhao <yan.y.zhao@intel.com>
> > 
> > Ensure a full memory barrier is emitted in the VM-Exit path, as a full
> > barrier is required on Intel CPUs to evict WC buffers.  This will allow
> > unconditionally honoring guest PAT on Intel CPUs that support self-snoop.
> > 
> > As srcu_read_lock() is always called in the VM-Exit path and it internally
> > has a smp_mb(), call smp_mb__after_srcu_read_lock() to avoid adding a
> > second fence and make sure smp_mb() is called without dependency on
> > implementation details of srcu_read_lock().
> 
> Do you really need mfence or is a locked operation enough?  mfence is mb(),
> not smp_mb().

We only need smp_mb(), which is supplied by the srcu_read_lock()
function.  For now, anyway.  If we ever figure out how to get by with
lighter-weight ordering for srcu_read_lock(), then we will add an smp_mb()
to smp_mb__after_srcu_read_lock() to compensate.

							Thanx, Paul

> Paolo
> 
> > +	/*
> > +	 * Call this to ensure WC buffers in guest are evicted after each VM
> > +	 * Exit, so that the evicted WC writes can be snooped across all cpus
> > +	 */
> > +	smp_mb__after_srcu_read_lock();
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
  2024-06-20 22:38   ` Paolo Bonzini
  2024-06-20 23:42     ` Paul E. McKenney
@ 2024-06-21  0:52     ` Yan Zhao
  1 sibling, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2024-06-21  0:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett, kvm, rcu, linux-kernel, Kevin Tian, Yiwei Zhang

On Fri, Jun 21, 2024 at 12:38:21AM +0200, Paolo Bonzini wrote:
> On 3/9/24 02:09, Sean Christopherson wrote:
> > From: Yan Zhao <yan.y.zhao@intel.com>
> > 
> > Ensure a full memory barrier is emitted in the VM-Exit path, as a full
> > barrier is required on Intel CPUs to evict WC buffers.  This will allow
> > unconditionally honoring guest PAT on Intel CPUs that support self-snoop.
> > 
> > As srcu_read_lock() is always called in the VM-Exit path and it internally
> > has a smp_mb(), call smp_mb__after_srcu_read_lock() to avoid adding a
> > second fence and make sure smp_mb() is called without dependency on
> > implementation details of srcu_read_lock().
> 
> Do you really need mfence or is a locked operation enough?  mfence is mb(),
> not smp_mb().
> 
A locked operation should be enough, since the barrier here is to evict
partially filled WC buffers.

"
If the WC buffer is partially filled, the writes may be delayed until the next
occurrence of a serializing event; such as an SFENCE or MFENCE instruction,
CPUID or other serializing instruction, a read or write to uncached memory, an
interrupt occurrence, or an execution of a LOCK instruction (including one with
an XACQUIRE or XRELEASE prefix).
"

> 
> > +	/*
> > +	 * Call this to ensure WC buffers in guest are evicted after each VM
> > +	 * Exit, so that the evicted WC writes can be snooped across all cpus
> > +	 */
> > +	smp_mb__after_srcu_read_lock();
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-09  1:09 [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Sean Christopherson
                   ` (3 preceding siblings ...)
  2024-03-09  1:09 ` [PATCH 4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path Sean Christopherson
@ 2024-03-09  1:09 ` Sean Christopherson
  2024-03-11  1:16   ` Yan Zhao
                     ` (2 more replies)
  2024-03-22  9:29 ` [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Ma, Yongwei
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-03-09  1:09 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

Unconditionally honor guest PAT on CPUs that support self-snoop, as
Intel has confirmed that CPUs that support self-snoop always snoop caches
and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
even in the presence of aliased memtypes, thus there is no need to trust
the guest behaves and only honor PAT as a last resort, as KVM does today.

Honoring guest PAT is desirable for use cases where the guest has access
to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
(mediated, for all intents and purposes) GPU is exposed to the guest, along
with buffers that are consumed directly by the physical GPU, i.e. which
can't be proxied by the host to ensure writes from the guest are performed
with the correct memory type for the GPU.

Cc: Yiwei Zhang <zzyiwei@google.com>
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c |  8 +++++---
 arch/x86/kvm/vmx/vmx.c | 10 ++++++----
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 403cd8f914cd..7fa514830628 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 bool kvm_mmu_may_ignore_guest_pat(void)
 {
 	/*
-	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
+	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
+	 * not support self-snoop (or is affected by an erratum), and the VM
 	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
 	 * honor the memtype from the guest's PAT so that guest accesses to
 	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
 	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
-	 * KVM _always_ ignores guest PAT (when EPT is enabled).
+	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
+	 * bits in response to non-coherent device (un)registration.
 	 */
-	return shadow_memtype_mask;
+	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
 }
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 17a8e4fdf9c4..5dc4c24ae203 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
 	/*
 	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
-	 * device attached.  Letting the guest control memory types on Intel
-	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
-	 * the guest to behave only as a last resort.
+	 * device attached and the CPU doesn't support self-snoop.  Letting the
+	 * guest control memory types on Intel CPUs without self-snoop may
+	 * result in unexpected behavior, and so KVM's (historical) ABI is to
+	 * trust the guest to behave only as a last resort.
 	 */
-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
+	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
 	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
-- 
2.44.0.278.ge034bb2e1d-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-09  1:09 ` [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Sean Christopherson
@ 2024-03-11  1:16   ` Yan Zhao
  2024-03-12  0:25     ` Sean Christopherson
  2024-03-25  3:43   ` Chao Gao
  2024-08-30  9:35   ` Vitaly Kuznetsov
  2 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-03-11  1:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yiwei Zhang

On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> Unconditionally honor guest PAT on CPUs that support self-snoop, as
> Intel has confirmed that CPUs that support self-snoop always snoop caches
> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
> even in the presence of aliased memtypes, thus there is no need to trust
> the guest behaves and only honor PAT as a last resort, as KVM does today.
> 
> Honoring guest PAT is desirable for use cases where the guest has access
> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
> (mediated, for all intents and purposes) GPU is exposed to the guest, along
> with buffers that are consumed directly by the physical GPU, i.e. which
> can't be proxied by the host to ensure writes from the guest are performed
> with the correct memory type for the GPU.
> 
> Cc: Yiwei Zhang <zzyiwei@google.com>
> Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c |  8 +++++---
>  arch/x86/kvm/vmx/vmx.c | 10 ++++++----
>  2 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 403cd8f914cd..7fa514830628 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  bool kvm_mmu_may_ignore_guest_pat(void)
>  {
>  	/*
> -	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> +	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
> +	 * not support self-snoop (or is affected by an erratum), and the VM
>  	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
>  	 * honor the memtype from the guest's PAT so that guest accesses to
>  	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
>  	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> -	 * KVM _always_ ignores guest PAT (when EPT is enabled).
> +	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
> +	 * bits in response to non-coherent device (un)registration.
>  	 */
> -	return shadow_memtype_mask;
> +	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
>  }
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 17a8e4fdf9c4..5dc4c24ae203 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>  	/*
>  	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> -	 * device attached.  Letting the guest control memory types on Intel
> -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> -	 * the guest to behave only as a last resort.
> +	 * device attached and the CPU doesn't support self-snoop.  Letting the
> +	 * guest control memory types on Intel CPUs without self-snoop may
> +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> +	 * trust the guest to behave only as a last resort.
>  	 */
> -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
about unsafe before honoring guest memory type.
Though it's a KVM's historical ABI, it's not safe in the security perspective
because page aliasing without proper cache flush handling on CPUs without
self-snoop may open a door for guest to read uninitialized host data.
e.g. when there's a noncoherent DMA device attached, and if there's a memory
region that is not pinned in vfio/iommufd side, (e.g. memory region in vfio's
skipped section), then though the guest memory from this memory region is not
accessible to noncoherent DMAs, vCPUs can still access this part of guest memory.
Then if vCPUs use WC as guest type, it may bypass host's initialization data in
cache and read stale data in host, causing information leak.

My preference is still to force WB
(i.e. (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) in case of
!static_cpu_has(X86_FEATURE_SELFSNOOP) && kvm_arch_has_noncoherent_dma(vcpu->kvm).
Firstly, it's because there're few CPUs with features VMX without self-snoop;
Secondly, security takes priority over functionality :)

>  
>  	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
> -- 
> 2.44.0.278.ge034bb2e1d-goog
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-11  1:16   ` Yan Zhao
@ 2024-03-12  0:25     ` Sean Christopherson
  2024-03-12  7:30       ` Tian, Kevin
  0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-03-12  0:25 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yiwei Zhang

On Mon, Mar 11, 2024, Yan Zhao wrote:
> On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 17a8e4fdf9c4..5dc4c24ae203 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >  
> >  	/*
> >  	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> > -	 * device attached.  Letting the guest control memory types on Intel
> > -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> > -	 * the guest to behave only as a last resort.
> > +	 * device attached and the CPU doesn't support self-snoop.  Letting the
> > +	 * guest control memory types on Intel CPUs without self-snoop may
> > +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> > +	 * trust the guest to behave only as a last resort.
> >  	 */
> > -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
> For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
> about unsafe before honoring guest memory type.

I don't think it gains us enough to offset the potential pain such a message
would bring.  Assuming the warning isn't outright ignored, the most likely scenario
is that the warning will cause random end users to worry that the setup they've
been running for years is broken, when in reality it's probably just fine for their
use case.

I would be quite surprised if there are people running untrusted workloads on
10+ year old silicon *and* have passthrough devices and non-coherent IOMMUs/DMA.
And anyone exposing a device directly to an untrusted workload really should have
done their homework.

And it's not like we're going to change KVM's historical behavior at this point.

> Though it's a KVM's historical ABI, it's not safe in the security perspective
> because page aliasing without proper cache flush handling on CPUs without
> self-snoop may open a door for guest to read uninitialized host data.
> e.g. when there's a noncoherent DMA device attached, and if there's a memory
> region that is not pinned in vfio/iommufd side, (e.g. memory region in vfio's
> skipped section), then though the guest memory from this memory region is not
> accessible to noncoherent DMAs, vCPUs can still access this part of guest memory.
> Then if vCPUs use WC as guest type, it may bypass host's initialization data in
> cache and read stale data in host, causing information leak.
> 
> My preference is still to force WB
> (i.e. (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) in case of
> !static_cpu_has(X86_FEATURE_SELFSNOOP) && kvm_arch_has_noncoherent_dma(vcpu->kvm).
> Firstly, it's because there're few CPUs with features VMX without self-snoop;

This is unfortunately not true.  I don't know the details, but apparently all
Intel CPUs before Ivy Bridge had a one or more related errata.

/*
 * Processors which have self-snooping capability can handle conflicting
 * memory type across CPUs by snooping its own cache. However, there exists
 * CPU models in which having conflicting memory types still leads to
 * unpredictable behavior, machine check errors, or hangs. Clear this
 * feature to prevent its use on machines with known erratas.
 */
static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
{
	switch (c->x86_model) {
	case INTEL_FAM6_CORE_YONAH:
	case INTEL_FAM6_CORE2_MEROM:
	case INTEL_FAM6_CORE2_MEROM_L:
	case INTEL_FAM6_CORE2_PENRYN:
	case INTEL_FAM6_CORE2_DUNNINGTON:
	case INTEL_FAM6_NEHALEM:
	case INTEL_FAM6_NEHALEM_G:
	case INTEL_FAM6_NEHALEM_EP:
	case INTEL_FAM6_NEHALEM_EX:
	case INTEL_FAM6_WESTMERE:
	case INTEL_FAM6_WESTMERE_EP:
	case INTEL_FAM6_SANDYBRIDGE:
		setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
	}
}

> Secondly, security takes priority over functionality :)

Yeah, but not breaking userspace for setups that have existed for 10+ years takes
priority over all of that :-)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-12  0:25     ` Sean Christopherson
@ 2024-03-12  7:30       ` Tian, Kevin
  2024-03-12 16:07         ` Sean Christopherson
  0 siblings, 1 reply; 61+ messages in thread
From: Tian, Kevin @ 2024-03-12  7:30 UTC (permalink / raw)
  To: Sean Christopherson, Zhao, Yan Y
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm@vger.kernel.org, rcu@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yiwei Zhang

> From: Sean Christopherson <seanjc@google.com>
> Sent: Tuesday, March 12, 2024 8:26 AM
> 
> On Mon, Mar 11, 2024, Yan Zhao wrote:
> > On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 17a8e4fdf9c4..5dc4c24ae203 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu
> *vcpu, gfn_t gfn, bool is_mmio)
> > >
> > >  	/*
> > >  	 * Force WB and ignore guest PAT if the VM does NOT have a non-
> coherent
> > > -	 * device attached.  Letting the guest control memory types on Intel
> > > -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to
> trust
> > > -	 * the guest to behave only as a last resort.
> > > +	 * device attached and the CPU doesn't support self-snoop.  Letting
> the
> > > +	 * guest control memory types on Intel CPUs without self-snoop may
> > > +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> > > +	 * trust the guest to behave only as a last resort.
> > >  	 */
> > > -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > > +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > > +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > >  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT)
> | VMX_EPT_IPAT_BIT;
> >
> > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should
> warn
> > about unsafe before honoring guest memory type.
> 
> I don't think it gains us enough to offset the potential pain such a message
> would bring.  Assuming the warning isn't outright ignored, the most likely
> scenario
> is that the warning will cause random end users to worry that the setup
> they've
> been running for years is broken, when in reality it's probably just fine for
> their
> use case.

Isn't the 'worry' necessary to allow end users evaluate whether "it's
probably just fine for their use case"?

I saw the old comment already mentioned that doing so may lead to
unexpected behaviors. But I'm not sure whether such code-level
caveat has been visible enough to end users.

> 
> I would be quite surprised if there are people running untrusted workloads
> on
> 10+ year old silicon *and* have passthrough devices and non-coherent
> IOMMUs/DMA.

this is probably true.

> And anyone exposing a device directly to an untrusted workload really
> should have
> done their homework.

or they run trusted workloads which might be tampered by virus to
exceed the scope of their homework. 😊

> 
> And it's not like we're going to change KVM's historical behavior at this point.

I agree with your point of not breaking userspace. But still think a warning
might be informative to let users evaluate their setup against a newly
identified "unexpected behavior"  which has security implication beyond
the guest, while the previous interpretation of "unexpected behavior" 
might be that the guest can at most shoot its own foot...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-12  7:30       ` Tian, Kevin
@ 2024-03-12 16:07         ` Sean Christopherson
  2024-03-13  1:18           ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-03-12 16:07 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Yan Y Zhao, Paolo Bonzini, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett, kvm@vger.kernel.org, rcu@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yiwei Zhang

On Tue, Mar 12, 2024, Kevin Tian wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > Sent: Tuesday, March 12, 2024 8:26 AM
> > 
> > On Mon, Mar 11, 2024, Yan Zhao wrote:
> > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
> > > about unsafe before honoring guest memory type.
> > 
> > I don't think it gains us enough to offset the potential pain such a
> > message would bring.  Assuming the warning isn't outright ignored, the most
> > likely scenario is that the warning will cause random end users to worry
> > that the setup they've been running for years is broken, when in reality
> > it's probably just fine for their
> > use case.
> 
> Isn't the 'worry' necessary to allow end users evaluate whether "it's
> probably just fine for their use case"?

Realistically, outside of large scale deployments, no end user is going to be able
to make that evaluation, because practically speaking it requires someone with
quite low-level hardware knowledge to be able to make that judgment call.  And
counting by number of human end users (as opposed to number of VMs being run), I
am willing to bet that the overwhelming majority of KVM users aren't kernel or
systems engineers.

Understandably, users tend to be alarmed by (or suspicious of) new warnings that
show up.  E.g. see the ancient KVM_SET_TSS_ADDR pr_warn[*].  And recently, we had
an internal bug report filed against KVM because they observed a performance
regression when booting a KVM guest, and saw a new message about some CPU
vulnerability being mitigated on VM-Exit that showed up in their *guest* kernel.

In short, my concern is that adding a new pr_warn() will generate noise for end
users *and* for KVM developers/maintainers, because even if we phrase the message
to talk specifically about "untrusted workloads", the majority of affected users
will not have the necessary knowledge to make an informed decision.

[*] https://lore.kernel.org/all/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956@tarent.de

> I saw the old comment already mentioned that doing so may lead to unexpected
> behaviors. But I'm not sure whether such code-level caveat has been visible
> enough to end users.

Another point to consider: KVM is _always_ potentially broken on such CPUs, as
KVM forces WB for guest accesses.  I.e. KVM will create memory aliasing if the
host has guest memory mapped as non-WB in the PAT, without non-coherent DMA
exposed to the guest.

> > I would be quite surprised if there are people running untrusted workloads
> > on 10+ year old silicon *and* have passthrough devices and non-coherent
> > IOMMUs/DMA.
> 
> this is probably true.
> 
> > And anyone exposing a device directly to an untrusted workload really
> > should have done their homework.
> 
> or they run trusted workloads which might be tampered by virus to
> exceed the scope of their homework. 😊

If a workload is being run in a KVM guest for host isolation/security purposes,
and a device was exposed to said workload, then I would firmly consider analyzing
the impact of a compromised guest to be part of their homework.

> > And it's not like we're going to change KVM's historical behavior at this point.
> 
> I agree with your point of not breaking userspace. But still think a warning
> might be informative to let users evaluate their setup against a newly
> identified "unexpected behavior"  which has security implication beyond
> the guest, while the previous interpretation of "unexpected behavior" 
> might be that the guest can at most shoot its own foot...

If this issue weren't limited to 10+ year old hardware, I would be more inclined
to add a message.  But at this point, realistically the only thing KVM would be
saying is "you're running old hardware, that might be unsafe in today's world".

For users that care about security, we'd be telling them something they already
know (and if they don't know, they've got bigger problems).  And for everyone
else, it'd be scary noise without any meaningful benefit.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-12 16:07         ` Sean Christopherson
@ 2024-03-13  1:18           ` Yan Zhao
  2024-03-13  8:52             ` Tian, Kevin
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-03-13  1:18 UTC (permalink / raw)
  To: Sean Christopherson, jgg
  Cc: Kevin Tian, Paolo Bonzini, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett, kvm@vger.kernel.org, rcu@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yiwei Zhang

On Tue, Mar 12, 2024 at 09:07:11AM -0700, Sean Christopherson wrote:
> On Tue, Mar 12, 2024, Kevin Tian wrote:
> > > From: Sean Christopherson <seanjc@google.com>
> > > Sent: Tuesday, March 12, 2024 8:26 AM
> > > 
> > > On Mon, Mar 11, 2024, Yan Zhao wrote:
> > > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > > > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
> > > > about unsafe before honoring guest memory type.
> > > 
> > > I don't think it gains us enough to offset the potential pain such a
> > > message would bring.  Assuming the warning isn't outright ignored, the most
> > > likely scenario is that the warning will cause random end users to worry
> > > that the setup they've been running for years is broken, when in reality
> > > it's probably just fine for their
> > > use case.
> > 
> > Isn't the 'worry' necessary to allow end users evaluate whether "it's
> > probably just fine for their use case"?
> 
> Realistically, outside of large scale deployments, no end user is going to be able
> to make that evaluation, because practically speaking it requires someone with
> quite low-level hardware knowledge to be able to make that judgment call.  And
> counting by number of human end users (as opposed to number of VMs being run), I
> am willing to bet that the overwhelming majority of KVM users aren't kernel or
> systems engineers.
> 
> Understandably, users tend to be alarmed by (or suspicious of) new warnings that
> show up.  E.g. see the ancient KVM_SET_TSS_ADDR pr_warn[*].  And recently, we had
> an internal bug report filed against KVM because they observed a performance
> regression when booting a KVM guest, and saw a new message about some CPU
> vulnerability being mitigated on VM-Exit that showed up in their *guest* kernel.
> 
> In short, my concern is that adding a new pr_warn() will generate noise for end
> users *and* for KVM developers/maintainers, because even if we phrase the message
> to talk specifically about "untrusted workloads", the majority of affected users
> will not have the necessary knowledge to make an informed decision.
> 
> [*] https://lore.kernel.org/all/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956@tarent.de
> 
> > I saw the old comment already mentioned that doing so may lead to unexpected
> > behaviors. But I'm not sure whether such code-level caveat has been visible
> > enough to end users.
>
What about add a new module parameter to turn on honoring guest for
non-coherent DMAs on CPUs without self-snoop?
A previous example is VFIO's "allow_unsafe_interrupts" parameter.

> Another point to consider: KVM is _always_ potentially broken on such CPUs, as
> KVM forces WB for guest accesses.  I.e. KVM will create memory aliasing if the
> host has guest memory mapped as non-WB in the PAT, without non-coherent DMA
> exposed to the guest.
In this case, memory aliasing may only lead to guest not function well, since
guest is not using WC/UC (which can bypass host initialization data in cache).
But if guest has any chance to read information not intended to it, I believe
we need to fix it as well.


> > > I would be quite surprised if there are people running untrusted workloads
> > > on 10+ year old silicon *and* have passthrough devices and non-coherent
> > > IOMMUs/DMA.
What if the guest is a totally malicious one?
Previously we trust the guest in the case of noncoherent DMA is because
we believe a malicious guest will only meet data corruption and shoot his own
foot.

But as Jason raised the security problem in another mail thread [1],
this will expose security hole if CPUs have no self-snoop. So, we need
to fix it, right?
+ Jason, in case I didn't understand this problem correctly.

[1] https://lore.kernel.org/all/20240108153818.GK50406@nvidia.com/

> > this is probably true.
> > 
> > > And anyone exposing a device directly to an untrusted workload really
> > > should have done their homework.
> > 
> > or they run trusted workloads which might be tampered by virus to
> > exceed the scope of their homework. 😊
> 
> If a workload is being run in a KVM guest for host isolation/security purposes,
> and a device was exposed to said workload, then I would firmly consider analyzing
> the impact of a compromised guest to be part of their homework.
> 
> > > And it's not like we're going to change KVM's historical behavior at this point.
> > 
> > I agree with your point of not breaking userspace. But still think a warning
> > might be informative to let users evaluate their setup against a newly
> > identified "unexpected behavior"  which has security implication beyond
> > the guest, while the previous interpretation of "unexpected behavior" 
> > might be that the guest can at most shoot its own foot...
> 
> If this issue weren't limited to 10+ year old hardware, I would be more inclined
> to add a message.  But at this point, realistically the only thing KVM would be
> saying is "you're running old hardware, that might be unsafe in today's world".
> 
> For users that care about security, we'd be telling them something they already
> know (and if they don't know, they've got bigger problems).  And for everyone
> else, it'd be scary noise without any meaningful benefit.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-13  1:18           ` Yan Zhao
@ 2024-03-13  8:52             ` Tian, Kevin
  2024-03-13  8:55               ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Tian, Kevin @ 2024-03-13  8:52 UTC (permalink / raw)
  To: Zhao, Yan Y, Sean Christopherson, jgg@nvidia.com
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm@vger.kernel.org, rcu@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yiwei Zhang

> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Wednesday, March 13, 2024 9:19 AM
> 
> On Tue, Mar 12, 2024 at 09:07:11AM -0700, Sean Christopherson wrote:
> > On Tue, Mar 12, 2024, Kevin Tian wrote:
> > > I saw the old comment already mentioned that doing so may lead to
> unexpected
> > > behaviors. But I'm not sure whether such code-level caveat has been
> visible
> > > enough to end users.
> >
> What about add a new module parameter to turn on honoring guest for
> non-coherent DMAs on CPUs without self-snoop?
> A previous example is VFIO's "allow_unsafe_interrupts" parameter.

Not sure whether such parameter has a real value.

If it's default 'off' then you break those 10yr+ setups and it's unacceptable.

If it's default 'on' then same effect as this patch does then I'm not sure who'd
want to turn it off afterwards. Somebody aware of such limitation can simply
avoid assigning device w/ non-coherent DMA in VM config file instead of 
further going to toggle the module parameter to prevent something which
he already knows not to do.

> 
> > Another point to consider: KVM is _always_ potentially broken on such
> CPUs, as
> > KVM forces WB for guest accesses.  I.e. KVM will create memory aliasing if
> the
> > host has guest memory mapped as non-WB in the PAT, without non-
> coherent DMA
> > exposed to the guest.
> In this case, memory aliasing may only lead to guest not function well, since
> guest is not using WC/UC (which can bypass host initialization data in cache).
> But if guest has any chance to read information not intended to it, I believe
> we need to fix it as well.

Having cache/memory inconsistent could hurt both guest and host.

So in concept forcing WB instead of following host attribute on such CPUs
is kind of broken, though in reality we may not see an usage of exposing
non-WB memory to guest on those old setups as discussed for virtio-gpu case.

> 
> 
> > > > I would be quite surprised if there are people running untrusted
> workloads
> > > > on 10+ year old silicon *and* have passthrough devices and non-
> coherent
> > > > IOMMUs/DMA.
> What if the guest is a totally malicious one?
> Previously we trust the guest in the case of noncoherent DMA is because
> we believe a malicious guest will only meet data corruption and shoot his
> own
> foot.
> 
> But as Jason raised the security problem in another mail thread [1],
> this will expose security hole if CPUs have no self-snoop. So, we need
> to fix it, right?
> + Jason, in case I didn't understand this problem correctly.
> 
> [1] https://lore.kernel.org/all/20240108153818.GK50406@nvidia.com/

We'll certain fix the security hole on CPUs w/ self-snoop. In this case
CPU accesses are guaranteed to be coherent and the vulnerability can
only be exposed via non-coherent DMA which is supposed to be fixed
by your coming series. 

But for old CPUs w/o self-snoop the hole can be exploited using either CPU
or non-coherent DMA once the guest PAT is honored. As long as nobody
is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
by Sean that sustaining the old behavior is probably the best option...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-13  8:52             ` Tian, Kevin
@ 2024-03-13  8:55               ` Yan Zhao
  2024-03-13 15:09                 ` Sean Christopherson
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-03-13  8:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Sean Christopherson, jgg@nvidia.com, Paolo Bonzini, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett, kvm@vger.kernel.org,
	rcu@vger.kernel.org, linux-kernel@vger.kernel.org, Yiwei Zhang

> We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> CPU accesses are guaranteed to be coherent and the vulnerability can
> only be exposed via non-coherent DMA which is supposed to be fixed
> by your coming series. 
> 
> But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> or non-coherent DMA once the guest PAT is honored. As long as nobody
> is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
We can cook a patch to check CPU self-snoop and force WB in EPT even for
non-coherent DMA if no self-snoop. Then back porting such a patch together
with the IOMMU side mitigation for non-coherent DMA.

Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
no self-snoop.

> by Sean that sustaining the old behavior is probably the best option...
Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-13  8:55               ` Yan Zhao
@ 2024-03-13 15:09                 ` Sean Christopherson
  2024-03-14  0:12                   ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-03-13 15:09 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kevin Tian, jgg@nvidia.com, Paolo Bonzini, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett, kvm@vger.kernel.org,
	rcu@vger.kernel.org, linux-kernel@vger.kernel.org, Yiwei Zhang

On Wed, Mar 13, 2024, Yan Zhao wrote:
> > We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> > CPU accesses are guaranteed to be coherent and the vulnerability can
> > only be exposed via non-coherent DMA which is supposed to be fixed
> > by your coming series. 
> > 
> > But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> > or non-coherent DMA once the guest PAT is honored. As long as nobody
> > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
> We can cook a patch to check CPU self-snoop and force WB in EPT even for
> non-coherent DMA if no self-snoop. Then back porting such a patch together
> with the IOMMU side mitigation for non-coherent DMA.

Please don't.  This is a "let sleeping dogs lie" situation.

  let sleeping dogs lie - avoid interfering in a situation that is currently
  causing no problems but might do so as a result of such interference.

Yes, there is technically a flaw, but we have zero evidence that anyone cares or
that it is actually problematic in practice.  On the other hand, any functional
change we make has a non-zero changes of breaking existing setups that have worked
for many years. 

> Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
> no self-snoop.
> 
> > by Sean that sustaining the old behavior is probably the best option...
> Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 

Yes, I think it's acceptable.  Obviously not ideal, but given the alternatives,
I think it is a reasonable risk.

Being 100% secure is simply not possible.  Security is often about balancing the
risk/threat against the cost.  In this case, the risk is low (old hardware,
uncommon setup for untrusted guests, small window of opportunity, and limited
data exposure), whereas the cost is high (decent chance of breaking existing VMs).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-13 15:09                 ` Sean Christopherson
@ 2024-03-14  0:12                   ` Yan Zhao
  2024-03-14  1:00                     ` Sean Christopherson
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-03-14  0:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kevin Tian, jgg@nvidia.com, Paolo Bonzini, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett, kvm@vger.kernel.org,
	rcu@vger.kernel.org, linux-kernel@vger.kernel.org, Yiwei Zhang

On Wed, Mar 13, 2024 at 08:09:28AM -0700, Sean Christopherson wrote:
> On Wed, Mar 13, 2024, Yan Zhao wrote:
> > > We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> > > CPU accesses are guaranteed to be coherent and the vulnerability can
> > > only be exposed via non-coherent DMA which is supposed to be fixed
> > > by your coming series. 
> > > 
> > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> > > or non-coherent DMA once the guest PAT is honored. As long as nobody
> > > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
> > We can cook a patch to check CPU self-snoop and force WB in EPT even for
> > non-coherent DMA if no self-snoop. Then back porting such a patch together
> > with the IOMMU side mitigation for non-coherent DMA.
> 
> Please don't.  This is a "let sleeping dogs lie" situation.
> 
>   let sleeping dogs lie - avoid interfering in a situation that is currently
>   causing no problems but might do so as a result of such interference.
> 
> Yes, there is technically a flaw, but we have zero evidence that anyone cares or
> that it is actually problematic in practice.  On the other hand, any functional
> change we make has a non-zero changes of breaking existing setups that have worked
> for many years. 
> 
> > Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
> > no self-snoop.
> > 
> > > by Sean that sustaining the old behavior is probably the best option...
> > Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 
> 
> Yes, I think it's acceptable.  Obviously not ideal, but given the alternatives,
> I think it is a reasonable risk.
> 
> Being 100% secure is simply not possible.  Security is often about balancing the
> risk/threat against the cost.  In this case, the risk is low (old hardware,
> uncommon setup for untrusted guests, small window of opportunity, and limited
> data exposure), whereas the cost is high (decent chance of breaking existing VMs).
Ok, thanks for explanation!
I still have one last question: if in future there are CPUs with no selfsnoop
(for some unknown reason, or just paranoid), do we allow this unsafe honoring of
guest memory type for non-coherent DMAs? 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-14  0:12                   ` Yan Zhao
@ 2024-03-14  1:00                     ` Sean Christopherson
  0 siblings, 0 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-03-14  1:00 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kevin Tian, jgg@nvidia.com, Paolo Bonzini, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett, kvm@vger.kernel.org,
	rcu@vger.kernel.org, linux-kernel@vger.kernel.org, Yiwei Zhang

On Thu, Mar 14, 2024, Yan Zhao wrote:
> On Wed, Mar 13, 2024 at 08:09:28AM -0700, Sean Christopherson wrote:
> > On Wed, Mar 13, 2024, Yan Zhao wrote:
> > > > We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> > > > CPU accesses are guaranteed to be coherent and the vulnerability can
> > > > only be exposed via non-coherent DMA which is supposed to be fixed
> > > > by your coming series. 
> > > > 
> > > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> > > > or non-coherent DMA once the guest PAT is honored. As long as nobody
> > > > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
> > > We can cook a patch to check CPU self-snoop and force WB in EPT even for
> > > non-coherent DMA if no self-snoop. Then back porting such a patch together
> > > with the IOMMU side mitigation for non-coherent DMA.
> > 
> > Please don't.  This is a "let sleeping dogs lie" situation.
> > 
> >   let sleeping dogs lie - avoid interfering in a situation that is currently
> >   causing no problems but might do so as a result of such interference.
> > 
> > Yes, there is technically a flaw, but we have zero evidence that anyone cares or
> > that it is actually problematic in practice.  On the other hand, any functional
> > change we make has a non-zero changes of breaking existing setups that have worked
> > for many years. 
> > 
> > > Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
> > > no self-snoop.
> > > 
> > > > by Sean that sustaining the old behavior is probably the best option...
> > > Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 
> > 
> > Yes, I think it's acceptable.  Obviously not ideal, but given the alternatives,
> > I think it is a reasonable risk.
> > 
> > Being 100% secure is simply not possible.  Security is often about balancing the
> > risk/threat against the cost.  In this case, the risk is low (old hardware,
> > uncommon setup for untrusted guests, small window of opportunity, and limited
> > data exposure), whereas the cost is high (decent chance of breaking existing VMs).
> Ok, thanks for explanation!
> I still have one last question: if in future there are CPUs with no selfsnoop
> (for some unknown reason, or just paranoid),

Don't jinx us :-)

> do we allow this unsafe honoring of guest memory type for non-coherent DMAs? 

Hmm, no?  AIUI, the intent is that all future CPUs will support self-snoop.
Assuming that's the case, then I think it's fair to put the onus on Intel to not
have escapes, i.e. to not end up shipping CPUs with erratas.  Then, if an erratum
does come along, we can make a decision, e.g. allow older CPUs by default, but
require an admin opt-in for new CPUs.

But let's just hope that's a problem we never have to deal with.  :-D

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-09  1:09 ` [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Sean Christopherson
  2024-03-11  1:16   ` Yan Zhao
@ 2024-03-25  3:43   ` Chao Gao
  2024-04-01 22:29     ` Sean Christopherson
  2024-08-30  9:35   ` Vitaly Kuznetsov
  2 siblings, 1 reply; 61+ messages in thread
From: Chao Gao @ 2024-03-25  3:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
>Unconditionally honor guest PAT on CPUs that support self-snoop, as
>Intel has confirmed that CPUs that support self-snoop always snoop caches
>and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>even in the presence of aliased memtypes, thus there is no need to trust
>the guest behaves and only honor PAT as a last resort, as KVM does today.
>
>Honoring guest PAT is desirable for use cases where the guest has access
>to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>(mediated, for all intents and purposes) GPU is exposed to the guest, along
>with buffers that are consumed directly by the physical GPU, i.e. which
>can't be proxied by the host to ensure writes from the guest are performed
>with the correct memory type for the GPU.
>
>Cc: Yiwei Zhang <zzyiwei@google.com>
>Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
>Suggested-by: Kevin Tian <kevin.tian@intel.com>
>Signed-off-by: Sean Christopherson <seanjc@google.com>
>---
> arch/x86/kvm/mmu/mmu.c |  8 +++++---
> arch/x86/kvm/vmx/vmx.c | 10 ++++++----
> 2 files changed, 11 insertions(+), 7 deletions(-)
>
>diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>index 403cd8f914cd..7fa514830628 100644
>--- a/arch/x86/kvm/mmu/mmu.c
>+++ b/arch/x86/kvm/mmu/mmu.c
>@@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> bool kvm_mmu_may_ignore_guest_pat(void)
> {
> 	/*
>-	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
>+	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
>+	 * not support self-snoop (or is affected by an erratum), and the VM
> 	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
> 	 * honor the memtype from the guest's PAT so that guest accesses to
> 	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
> 	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
>-	 * KVM _always_ ignores guest PAT (when EPT is enabled).
>+	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
>+	 * bits in response to non-coherent device (un)registration.
> 	 */
>-	return shadow_memtype_mask;
>+	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
> }
> 
> int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 17a8e4fdf9c4..5dc4c24ae203 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> 
> 	/*
> 	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
>-	 * device attached.  Letting the guest control memory types on Intel
>-	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
>-	 * the guest to behave only as a last resort.
>+	 * device attached and the CPU doesn't support self-snoop.  Letting the
>+	 * guest control memory types on Intel CPUs without self-snoop may
>+	 * result in unexpected behavior, and so KVM's (historical) ABI is to
>+	 * trust the guest to behave only as a last resort.
> 	 */
>-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
>+	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

W/ this change, guests w/o pass-thru devices can also access UC memory. Locking
UC memory leads to bus lock. So, guests w/o pass-thru devices can potentially
launch DOS attacks on other CPUs on host. isn't it a problem?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-25  3:43   ` Chao Gao
@ 2024-04-01 22:29     ` Sean Christopherson
  0 siblings, 0 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-04-01 22:29 UTC (permalink / raw)
  To: Chao Gao
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

On Mon, Mar 25, 2024, Chao Gao wrote:
> On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> >Unconditionally honor guest PAT on CPUs that support self-snoop, as
> >Intel has confirmed that CPUs that support self-snoop always snoop caches
> >and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
> >even in the presence of aliased memtypes, thus there is no need to trust
> >the guest behaves and only honor PAT as a last resort, as KVM does today.
> >
> >Honoring guest PAT is desirable for use cases where the guest has access
> >to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
> >(mediated, for all intents and purposes) GPU is exposed to the guest, along
> >with buffers that are consumed directly by the physical GPU, i.e. which
> >can't be proxied by the host to ensure writes from the guest are performed
> >with the correct memory type for the GPU.

...

> > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index 17a8e4fdf9c4..5dc4c24ae203 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > 
> > 	/*
> > 	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> >-	 * device attached.  Letting the guest control memory types on Intel
> >-	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> >-	 * the guest to behave only as a last resort.
> >+	 * device attached and the CPU doesn't support self-snoop.  Letting the
> >+	 * guest control memory types on Intel CPUs without self-snoop may
> >+	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> >+	 * trust the guest to behave only as a last resort.
> > 	 */
> >-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> >+	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
> W/ this change, guests w/o pass-thru devices can also access UC memory. Locking
> UC memory leads to bus lock. So, guests w/o pass-thru devices can potentially
> launch DOS attacks on other CPUs on host. isn't it a problem?

Guests can already trigger bus locks with atomic accesses that split cache lines.
And SPR adds bus lock detection.  So practically speaking, I'm pretty sure ICX is
the only CPU where anything close to a novel attack is possible.  And FWIW, such
an attack is already possible on AMD.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-03-09  1:09 ` [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Sean Christopherson
  2024-03-11  1:16   ` Yan Zhao
  2024-03-25  3:43   ` Chao Gao
@ 2024-08-30  9:35   ` Vitaly Kuznetsov
  2024-08-30 11:05     ` Gerd Hoffmann
  2024-10-07 13:28     ` Linux regression tracking (Thorsten Leemhuis)
  2 siblings, 2 replies; 61+ messages in thread
From: Vitaly Kuznetsov @ 2024-08-30  9:35 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang,
	Lai Jiangshan, Paul E. McKenney, Josh Triplett, Gerd Hoffmann

Sean Christopherson <seanjc@google.com> writes:

> Unconditionally honor guest PAT on CPUs that support self-snoop, as
> Intel has confirmed that CPUs that support self-snoop always snoop caches
> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
> even in the presence of aliased memtypes, thus there is no need to trust
> the guest behaves and only honor PAT as a last resort, as KVM does today.
>
> Honoring guest PAT is desirable for use cases where the guest has access
> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
> (mediated, for all intents and purposes) GPU is exposed to the guest, along
> with buffers that are consumed directly by the physical GPU, i.e. which
> can't be proxied by the host to ensure writes from the guest are performed
> with the correct memory type for the GPU.
>
> Cc: Yiwei Zhang <zzyiwei@google.com>
> Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c |  8 +++++---
>  arch/x86/kvm/vmx/vmx.c | 10 ++++++----
>  2 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 403cd8f914cd..7fa514830628 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  bool kvm_mmu_may_ignore_guest_pat(void)
>  {
>  	/*
> -	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> +	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
> +	 * not support self-snoop (or is affected by an erratum), and the VM
>  	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
>  	 * honor the memtype from the guest's PAT so that guest accesses to
>  	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
>  	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> -	 * KVM _always_ ignores guest PAT (when EPT is enabled).
> +	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
> +	 * bits in response to non-coherent device (un)registration.
>  	 */
> -	return shadow_memtype_mask;
> +	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
>  }
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 17a8e4fdf9c4..5dc4c24ae203 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>  	/*
>  	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> -	 * device attached.  Letting the guest control memory types on Intel
> -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> -	 * the guest to behave only as a last resort.
> +	 * device attached and the CPU doesn't support self-snoop.  Letting the
> +	 * guest control memory types on Intel CPUs without self-snoop may
> +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> +	 * trust the guest to behave only as a last resort.
>  	 */
> -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>  
>  	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);

Necroposting!

Turns out that this change broke "bochs-display" driver in QEMU even
when the guest is modern (don't ask me 'who the hell uses bochs for
modern guests', it was basically a configuration error :-). E.g:

$ qemu-kvm -name c10s -nodefaults -smp 4 -machine
q35,smm=on,accel=kvm,kernel-irqchip=split -global
driver=cfi.pflash01,property=secure,value=on -cpu host -drive
id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/var/lib/libvirt/images/c10s.qcow2
-device virtio-blk-pci,id=image2,drive=drive_image2,bootindex=3,bus=pcie.0,addr=0x8
-drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,readonly=on,unit=0
-drive file=/tmp/OVMF_VARS.secboot.fd,if=pflash,format=raw,unit=1
-device ahci,id=ahci0 -vnc :0 -device bochs-display -m 8G -monitor stdio

The failure looks like Wayland starting and failing right away, this
repeats multiple times but after a number of restarts it may try to
pretend to work but then it crashes again. Things go back to normal when
the commit is reverted in the host kernel.

The CPU where this reproduces is fairly modern too (Intel(R) Xeon(R)
Silver 4410Y, Sapphire Rapids). I wish I could give additional details
to what exactly happens in the guest but I can't find anything useful in
the logs ("WARNING: Application 'org.gnome.Shell.desktop' killed by
signal 9") and I know too little (nothing?) about how modern Linux
graphics stack is organized :-( Cc: Gerd just in case.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-08-30  9:35   ` Vitaly Kuznetsov
@ 2024-08-30 11:05     ` Gerd Hoffmann
  2024-08-30 13:47       ` Vitaly Kuznetsov
  2024-10-07 13:28     ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 1 reply; 61+ messages in thread
From: Gerd Hoffmann @ 2024-08-30 11:05 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sean Christopherson, Paolo Bonzini, kvm, rcu, linux-kernel,
	Kevin Tian, Yan Zhao, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett

> Necroposting!
> 
> Turns out that this change broke "bochs-display" driver in QEMU even
> when the guest is modern (don't ask me 'who the hell uses bochs for
> modern guests', it was basically a configuration error :-). E.g:

qemu stdvga (the default display device) is affected too.

bochs-display is effectively stdvga minus the vga compatibility: no text
mode, no silly planar video modes.  Only linear framebuffers in xrgb
format are supported.

Both devices are handled by the bochs drm driver.

> The CPU where this reproduces is fairly modern too (Intel(R) Xeon(R)
> Silver 4410Y, Sapphire Rapids). I wish I could give additional details
> to what exactly happens in the guest but I can't find anything useful in
> the logs ("WARNING: Application 'org.gnome.Shell.desktop' killed by
> signal 9") and I know too little (nothing?) about how modern Linux
> graphics stack is organized :-( Cc: Gerd just in case.

The device has an (purely virtual) pci memory bar for video memory.
The drm driver allocates buffer objects in this video memory.  They
are either used by the kernel directly (fbcon case, which works fine)
or mapped into userspace so the wayland / X11 display server can
software-render into the buffer (which breaks).

take care,
  Gerd

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-08-30 11:05     ` Gerd Hoffmann
@ 2024-08-30 13:47       ` Vitaly Kuznetsov
  2024-08-30 13:52         ` Sean Christopherson
  2024-09-02  1:44         ` Yan Zhao
  0 siblings, 2 replies; 61+ messages in thread
From: Vitaly Kuznetsov @ 2024-08-30 13:47 UTC (permalink / raw)
  To: Sean Christopherson, Gerd Hoffmann
  Cc: Paolo Bonzini, kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao,
	Yiwei Zhang, Lai Jiangshan, Paul E. McKenney, Josh Triplett

Gerd Hoffmann <kraxel@redhat.com> writes:

>> Necroposting!
>> 
>> Turns out that this change broke "bochs-display" driver in QEMU even
>> when the guest is modern (don't ask me 'who the hell uses bochs for
>> modern guests', it was basically a configuration error :-). E.g:
>
> qemu stdvga (the default display device) is affected too.
>

So far, I was only able to verify that the issue has nothing to do with
OVMF and multi-vcpu, it reproduces very well with

$ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
-cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
-device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
-vnc :0 -device VGA -monitor stdio --no-reboot

Comparing traces of working and broken cases, I couldn't find anything
suspicious but I may had missed something of course. For now, it seems
like a userspace misbehavior resulting in a segfault.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-08-30 13:47       ` Vitaly Kuznetsov
@ 2024-08-30 13:52         ` Sean Christopherson
  2024-08-30 14:06           ` Vitaly Kuznetsov
  2024-09-02  1:44         ` Yan Zhao
  1 sibling, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-08-30 13:52 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Gerd Hoffmann, Paolo Bonzini, kvm, rcu, linux-kernel, Kevin Tian,
	Yan Zhao, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett

On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
> Gerd Hoffmann <kraxel@redhat.com> writes:
> 
> >> Necroposting!
> >> 
> >> Turns out that this change broke "bochs-display" driver in QEMU even
> >> when the guest is modern (don't ask me 'who the hell uses bochs for
> >> modern guests', it was basically a configuration error :-). E.g:
> >
> > qemu stdvga (the default display device) is affected too.
> >
> 
> So far, I was only able to verify that the issue has nothing to do with
> OVMF and multi-vcpu, it reproduces very well with
> 
> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> -vnc :0 -device VGA -monitor stdio --no-reboot
> 
> Comparing traces of working and broken cases, I couldn't find anything
> suspicious but I may had missed something of course. For now, it seems
> like a userspace misbehavior resulting in a segfault.

Guest userspace?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-08-30 13:52         ` Sean Christopherson
@ 2024-08-30 14:06           ` Vitaly Kuznetsov
  2024-08-30 14:37             ` Vitaly Kuznetsov
  0 siblings, 1 reply; 61+ messages in thread
From: Vitaly Kuznetsov @ 2024-08-30 14:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Gerd Hoffmann, Paolo Bonzini, kvm, rcu, linux-kernel, Kevin Tian,
	Yan Zhao, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett

Sean Christopherson <seanjc@google.com> writes:

> On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
>> Gerd Hoffmann <kraxel@redhat.com> writes:
>> 
>> >> Necroposting!
>> >> 
>> >> Turns out that this change broke "bochs-display" driver in QEMU even
>> >> when the guest is modern (don't ask me 'who the hell uses bochs for
>> >> modern guests', it was basically a configuration error :-). E.g:
>> >
>> > qemu stdvga (the default display device) is affected too.
>> >
>> 
>> So far, I was only able to verify that the issue has nothing to do with
>> OVMF and multi-vcpu, it reproduces very well with
>> 
>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
>> -vnc :0 -device VGA -monitor stdio --no-reboot
>> 
>> Comparing traces of working and broken cases, I couldn't find anything
>> suspicious but I may had missed something of course. For now, it seems
>> like a userspace misbehavior resulting in a segfault.
>
> Guest userspace?
>

Yes? :-) As Gerd described, video memory is "mapped into userspace so
the wayland / X11 display server can software-render into the buffer"
and it seems that wayland gets something unexpected in this memory and
crashes. 

-- 
Vitaly


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-08-30 14:06           ` Vitaly Kuznetsov
@ 2024-08-30 14:37             ` Vitaly Kuznetsov
  2024-08-30 16:13               ` Sean Christopherson
  0 siblings, 1 reply; 61+ messages in thread
From: Vitaly Kuznetsov @ 2024-08-30 14:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Gerd Hoffmann, Paolo Bonzini, kvm, rcu, linux-kernel, Kevin Tian,
	Yan Zhao, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett

Vitaly Kuznetsov <vkuznets@redhat.com> writes:

> Sean Christopherson <seanjc@google.com> writes:
>
>> On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
>>> Gerd Hoffmann <kraxel@redhat.com> writes:
>>> 
>>> >> Necroposting!
>>> >> 
>>> >> Turns out that this change broke "bochs-display" driver in QEMU even
>>> >> when the guest is modern (don't ask me 'who the hell uses bochs for
>>> >> modern guests', it was basically a configuration error :-). E.g:
>>> >
>>> > qemu stdvga (the default display device) is affected too.
>>> >
>>> 
>>> So far, I was only able to verify that the issue has nothing to do with
>>> OVMF and multi-vcpu, it reproduces very well with
>>> 
>>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
>>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
>>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
>>> -vnc :0 -device VGA -monitor stdio --no-reboot
>>> 
>>> Comparing traces of working and broken cases, I couldn't find anything
>>> suspicious but I may had missed something of course. For now, it seems
>>> like a userspace misbehavior resulting in a segfault.
>>
>> Guest userspace?
>>
>
> Yes? :-) As Gerd described, video memory is "mapped into userspace so
> the wayland / X11 display server can software-render into the buffer"
> and it seems that wayland gets something unexpected in this memory and
> crashes. 

Also, I don't know if it helps or not, but out of two hunks in
377b2f359d1f, it is the vmx_get_mt_mask() one which brings the
issue. I.e. the following is enough to fix things:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..733a0c45d1a6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7659,13 +7659,11 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
        /*
         * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
-        * device attached and the CPU doesn't support self-snoop.  Letting the
-        * guest control memory types on Intel CPUs without self-snoop may
-        * result in unexpected behavior, and so KVM's (historical) ABI is to
-        * trust the guest to behave only as a last resort.
+        * device attached.  Letting the guest control memory types on Intel
+        * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
+        * the guest to behave only as a last resort.
         */
-       if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
-           !kvm_arch_has_noncoherent_dma(vcpu->kvm))
+       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
        return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);


-- 
Vitaly


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-08-30 14:37             ` Vitaly Kuznetsov
@ 2024-08-30 16:13               ` Sean Christopherson
  2024-09-02  8:23                 ` Gerd Hoffmann
  0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-08-30 16:13 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Gerd Hoffmann, Paolo Bonzini, kvm, rcu, linux-kernel, Kevin Tian,
	Yan Zhao, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett

On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
> Vitaly Kuznetsov <vkuznets@redhat.com> writes:
> 
> > Sean Christopherson <seanjc@google.com> writes:
> >
> >> On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
> >>> Gerd Hoffmann <kraxel@redhat.com> writes:
> >>> 
> >>> >> Necroposting!
> >>> >> 
> >>> >> Turns out that this change broke "bochs-display" driver in QEMU even
> >>> >> when the guest is modern (don't ask me 'who the hell uses bochs for
> >>> >> modern guests', it was basically a configuration error :-). E.g:
> >>> >
> >>> > qemu stdvga (the default display device) is affected too.
> >>> >
> >>> 
> >>> So far, I was only able to verify that the issue has nothing to do with
> >>> OVMF and multi-vcpu, it reproduces very well with
> >>> 
> >>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
> >>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
> >>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> >>> -vnc :0 -device VGA -monitor stdio --no-reboot
> >>> 
> >>> Comparing traces of working and broken cases, I couldn't find anything
> >>> suspicious but I may had missed something of course. For now, it seems
> >>> like a userspace misbehavior resulting in a segfault.
> >>
> >> Guest userspace?
> >>
> >
> > Yes? :-) As Gerd described, video memory is "mapped into userspace so
> > the wayland / X11 display server can software-render into the buffer"
> > and it seems that wayland gets something unexpected in this memory and
> > crashes. 
> 
> Also, I don't know if it helps or not, but out of two hunks in
> 377b2f359d1f, it is the vmx_get_mt_mask() one which brings the
> issue. I.e. the following is enough to fix things:
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f18c2d8c7476..733a0c45d1a6 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7659,13 +7659,11 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>         /*
>          * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> -        * device attached and the CPU doesn't support self-snoop.  Letting the
> -        * guest control memory types on Intel CPUs without self-snoop may
> -        * result in unexpected behavior, and so KVM's (historical) ABI is to
> -        * trust the guest to behave only as a last resort.
> +        * device attached.  Letting the guest control memory types on Intel
> +        * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> +        * the guest to behave only as a last resort.
>          */
> -       if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> -           !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>  
>         return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);

Hmm, that suggests the guest kernel maps the buffer as WC.  And looking at the
bochs driver, IIUC, the kernel mappings via ioremap() are UC-, not WC.  So it
could be that userspace doesn't play nice with WC, but could it also be that the
QEMU backend doesn't play nice with WC (on Intel)?

Given that this is a purely synthetic device, is there any reason to use UC or WC?
I.e. can the bochs driver configure its VRAM buffers to be WB?  It doesn't look
super easy (the DRM/TTM code has so. many. layers), but it appears doable.  Since
the device only exists in VMs, it's possible the bochs driver has never run on
Intel CPUs with WC memtype.

The one thing that confuses and concerns me is that this broke in the first place.
KVM has honored guest PAT on SVM since forever, which is why I/we had decent
confidence KVM could honor guest PAT on VMX without breaking anything.  SVM (NPT)
has an explicitlyed document special "WC+" memtype, where guest=WC && host=WB == WC+,
and WC+ accesses snoop caches on all CPUs.

But per Intel engineers, Intel CPUs with self-snoop are supposed to snoop caches
on all processors too.

I assume this same setup works fine on AMD/SVM?  If so, we probably need to do
more digging before fudging around this in the guest.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-08-30 16:13               ` Sean Christopherson
@ 2024-09-02  8:23                 ` Gerd Hoffmann
  0 siblings, 0 replies; 61+ messages in thread
From: Gerd Hoffmann @ 2024-09-02  8:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Paolo Bonzini, kvm, rcu, linux-kernel,
	Kevin Tian, Yan Zhao, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett, Thomas Zimmermann

> > > Yes? :-) As Gerd described, video memory is "mapped into userspace so
> > > the wayland / X11 display server can software-render into the buffer"
> > > and it seems that wayland gets something unexpected in this memory and
> > > crashes. 
> > 
> > Also, I don't know if it helps or not, but out of two hunks in
> > 377b2f359d1f, it is the vmx_get_mt_mask() one which brings the
> > issue. I.e. the following is enough to fix things:
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index f18c2d8c7476..733a0c45d1a6 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7659,13 +7659,11 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >  
> >         /*
> >          * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> > -        * device attached and the CPU doesn't support self-snoop.  Letting the
> > -        * guest control memory types on Intel CPUs without self-snoop may
> > -        * result in unexpected behavior, and so KVM's (historical) ABI is to
> > -        * trust the guest to behave only as a last resort.
> > +        * device attached.  Letting the guest control memory types on Intel
> > +        * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> > +        * the guest to behave only as a last resort.
> >          */
> > -       if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > -           !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > +       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >  
> >         return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
> 
> Hmm, that suggests the guest kernel maps the buffer as WC.  And looking at the
> bochs driver, IIUC, the kernel mappings via ioremap() are UC-, not WC.  So it
> could be that userspace doesn't play nice with WC, but could it also be that the
> QEMU backend doesn't play nice with WC (on Intel)?
> 
> Given that this is a purely synthetic device, is there any reason to use UC or WC?

Well, sharing code with other (real hardware) drivers is pretty much the
only reason.  DRM has a set of helper functions to manage vram in pci
memory bars (see drm_gem_vram_helper.c, drm_gem_ttm_helper.c).

> I.e. can the bochs driver configure its VRAM buffers to be WB?  It doesn't look
> super easy (the DRM/TTM code has so. many. layers), but it appears doable.  Since
> the device only exists in VMs, it's possible the bochs driver has never run on
> Intel CPUs with WC memtype.

Thomas Zimmermann <tzimmermann@suse.de> (Cc'ed) has a drm patch series
in flight which switches the bochs driver to a shadow buffer model, i.e.
all the buffers visible to fbcon and userspace live in main memory.
Display updates are handled via in-kernel memcpy from shadow to vram.
The pci memory bar becomes an bochs driver implementation detail not
visible outside the driver.  This should give the bochs driver the
freedom to map vram with whatever attributes work best with kvm, without
needing drm changes outside the driver.

Of course all this does not help much with current distro kernels broken
by this patch ...

take care,
  Gerd


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-08-30 13:47       ` Vitaly Kuznetsov
  2024-08-30 13:52         ` Sean Christopherson
@ 2024-09-02  1:44         ` Yan Zhao
  2024-09-02  9:49           ` Vitaly Kuznetsov
  1 sibling, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-09-02  1:44 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sean Christopherson, Gerd Hoffmann, Paolo Bonzini, kvm, rcu,
	linux-kernel, Kevin Tian, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett

On Fri, Aug 30, 2024 at 03:47:11PM +0200, Vitaly Kuznetsov wrote:
> Gerd Hoffmann <kraxel@redhat.com> writes:
> 
> >> Necroposting!
> >> 
> >> Turns out that this change broke "bochs-display" driver in QEMU even
> >> when the guest is modern (don't ask me 'who the hell uses bochs for
> >> modern guests', it was basically a configuration error :-). E.g:
> >
> > qemu stdvga (the default display device) is affected too.
> >
> 
> So far, I was only able to verify that the issue has nothing to do with
> OVMF and multi-vcpu, it reproduces very well with
> 
> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> -vnc :0 -device VGA -monitor stdio --no-reboot
> 
> Comparing traces of working and broken cases, I couldn't find anything
> suspicious but I may had missed something of course. For now, it seems
> like a userspace misbehavior resulting in a segfault.
Could you please share steps launch the broken guest desktop?
(better also with guest kernel version, name of desktop processes,
 name of X server)

Currently, I couldn't reproduce the error with "-device bochs-display" or
"-device VGA" locally on a "Coffee Lake-S" test machine. 

Qemu cmd as below:
qemu-system-x86_64 -m 4096 -smp 1 -M q35 -name guest-01
-hda ubuntu22-1.qcow2 -bios /usr/bin/bios.bin -enable-kvm -k en-us
-serial stdio -device bochs-display -machine kernel_irqchip=on
-cpu host -usb -usbdevice tablet

The guest can see a VGA device
    00:02.0 Display controller: Device 1234:1111 (rev 02)
with driver
    # readlink /sys/bus/pci/devices/0000\:00\:02.0/driver
    ../../../bus/pci/drivers/bochs-drm

I have tried hardcoding several fields as below:

(1)  hardcoded the fb_map to wc in the guest driver

--- a/drivers/gpu/drm/tiny/bochs.c
+++ b/drivers/gpu/drm/tiny/bochs.c
@@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
        if (pci_request_region(pdev, 0, "bochs-drm") != 0)
                DRM_WARN("Cannot request framebuffer, boot fb still active?\n");

-       bochs->fb_map = ioremap(addr, size);
+       bochs->fb_map = ioremap_wc(addr, size);
+       printk("bochs wc fb_map=%lx, addr=%lx, size=%lx\n", (unsigned long)bochs->fb_map, (unsigned long)addr, (unsigned long)size);
        if (bochs->fb_map == NULL) {
                DRM_ERROR("Cannot map framebuffer\n");
                return -ENOMEM;

With dmesg as below:

[    7.565840] ioremap wc phys_addr fd000000 size 1000000 to wc
[    7.565856] bochs wc fb_map=ffffc90004000000, addr=fd000000, size=1000000
[    7.565859] [drm] Found bochs VGA, ID 0xb0c5.
[    7.565861] [drm] Framebuffer size 16384 kB @ 0xfd000000, mmio @ 0xfebd9000.
[    7.591995] [drm] Found EDID data blob.
[    7.603956] [drm] Initialized bochs-drm 1.0.0 20130925 for 0000:00:02.0 on minor 1
[    7.614263] bochs-drm 0000:00:02.0: [drm] fb1: bochs-drmdrmfb frame buffer device

(2) hardcoded the memory type to WC in KVM intel driver.
+       if (gfn >= 0xfd000 && gfn < 0xfe000)
+               return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;


(3) hardcoded mmap flags to WC for some bo objects for Xorg.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-02  1:44         ` Yan Zhao
@ 2024-09-02  9:49           ` Vitaly Kuznetsov
  2024-09-03  0:25             ` Yan Zhao
  2024-09-03 15:30             ` Sean Christopherson
  0 siblings, 2 replies; 61+ messages in thread
From: Vitaly Kuznetsov @ 2024-09-02  9:49 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, Gerd Hoffmann, Paolo Bonzini, kvm, rcu,
	linux-kernel, Kevin Tian, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Fri, Aug 30, 2024 at 03:47:11PM +0200, Vitaly Kuznetsov wrote:
>> Gerd Hoffmann <kraxel@redhat.com> writes:
>> 
>> >> Necroposting!
>> >> 
>> >> Turns out that this change broke "bochs-display" driver in QEMU even
>> >> when the guest is modern (don't ask me 'who the hell uses bochs for
>> >> modern guests', it was basically a configuration error :-). E.g:
>> >
>> > qemu stdvga (the default display device) is affected too.
>> >
>> 
>> So far, I was only able to verify that the issue has nothing to do with
>> OVMF and multi-vcpu, it reproduces very well with
>> 
>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
>> -vnc :0 -device VGA -monitor stdio --no-reboot
>> 
>> Comparing traces of working and broken cases, I couldn't find anything
>> suspicious but I may had missed something of course. For now, it seems
>> like a userspace misbehavior resulting in a segfault.
> Could you please share steps launch the broken guest desktop?
> (better also with guest kernel version, name of desktop processes,
>  name of X server)

I think the easiest would be to download the latest Centos Stream 10
iso, e.g:

https://composes.stream.centos.org/stream-10/development/CentOS-Stream-10-20240902.d.0/compose/BaseOS/x86_64/iso/CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso
(the link is probably not eternal but should work for a couple weeks,
check https://composes.stream.centos.org/stream-10/development/ it it
doesn't work anymore).

Then, just run it:
$ /usr/libexec/qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s -cpu host -smp 1 -m 16384 -cdrom CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso -vnc :0 -device VGA -monitor stdio --no-reboot

and connect to VNC console. To speed things up, pick 'Install Centos
Stream 10' in the boot menu to avoid ISO integrity check.

With "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
commit included, you will see the following on the VNC console:
installer tries starting Wayland, crashes and drops back into text
console. If you revert the commit and start over, Wayland will normally
start and you will see the installer.

If the installer environment is inconvenient for debugging, then you can
install in text mode (or with the commit reverted :-) and then the same
problem will be observed when gdm starts.

FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
Silver 4410Y".

-- 
Vitaly

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-02  9:49           ` Vitaly Kuznetsov
@ 2024-09-03  0:25             ` Yan Zhao
  2024-09-03 15:30             ` Sean Christopherson
  1 sibling, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2024-09-03  0:25 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sean Christopherson, Gerd Hoffmann, Paolo Bonzini, kvm, rcu,
	linux-kernel, Kevin Tian, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett

On Mon, Sep 02, 2024 at 11:49:43AM +0200, Vitaly Kuznetsov wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Fri, Aug 30, 2024 at 03:47:11PM +0200, Vitaly Kuznetsov wrote:
> >> Gerd Hoffmann <kraxel@redhat.com> writes:
> >> 
> >> >> Necroposting!
> >> >> 
> >> >> Turns out that this change broke "bochs-display" driver in QEMU even
> >> >> when the guest is modern (don't ask me 'who the hell uses bochs for
> >> >> modern guests', it was basically a configuration error :-). E.g:
> >> >
> >> > qemu stdvga (the default display device) is affected too.
> >> >
> >> 
> >> So far, I was only able to verify that the issue has nothing to do with
> >> OVMF and multi-vcpu, it reproduces very well with
> >> 
> >> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
> >> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
> >> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> >> -vnc :0 -device VGA -monitor stdio --no-reboot
> >> 
> >> Comparing traces of working and broken cases, I couldn't find anything
> >> suspicious but I may had missed something of course. For now, it seems
> >> like a userspace misbehavior resulting in a segfault.
> > Could you please share steps launch the broken guest desktop?
> > (better also with guest kernel version, name of desktop processes,
> >  name of X server)
> 
> I think the easiest would be to download the latest Centos Stream 10
> iso, e.g:
> 
> https://composes.stream.centos.org/stream-10/development/CentOS-Stream-10-20240902.d.0/compose/BaseOS/x86_64/iso/CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso
> (the link is probably not eternal but should work for a couple weeks,
> check https://composes.stream.centos.org/stream-10/development/ it it
> doesn't work anymore).
> 
> Then, just run it:
> $ /usr/libexec/qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s -cpu host -smp 1 -m 16384 -cdrom CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso -vnc :0 -device VGA -monitor stdio --no-reboot
> 
> and connect to VNC console. To speed things up, pick 'Install Centos
> Stream 10' in the boot menu to avoid ISO integrity check.
> 
> With "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
> commit included, you will see the following on the VNC console:
> installer tries starting Wayland, crashes and drops back into text
> console. If you revert the commit and start over, Wayland will normally
> start and you will see the installer.
Hmm, looks this issue can't be reproduced on physical machine "Coffee Lake-S".
The installer can show up to ask for language selection.

But it can be reproduced on the machine "Sapphire Rapids XCC".

> If the installer environment is inconvenient for debugging, then you can
> install in text mode (or with the commit reverted :-) and then the same
> problem will be observed when gdm starts.
>
Same to the gdm.

> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> Silver 4410Y".
Will have a debug to check what's going wrong. Thanks!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-02  9:49           ` Vitaly Kuznetsov
  2024-09-03  0:25             ` Yan Zhao
@ 2024-09-03 15:30             ` Sean Christopherson
  2024-09-03 16:20               ` Vitaly Kuznetsov
  1 sibling, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-09-03 15:30 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Yan Zhao, Gerd Hoffmann, Paolo Bonzini, kvm, rcu, linux-kernel,
	Kevin Tian, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett

On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> Silver 4410Y".

Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
on another hardware issue?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-03 15:30             ` Sean Christopherson
@ 2024-09-03 16:20               ` Vitaly Kuznetsov
  2024-09-04  2:28                 ` Yan Zhao
  2024-09-04 11:47                 ` Vitaly Kuznetsov
  0 siblings, 2 replies; 61+ messages in thread
From: Vitaly Kuznetsov @ 2024-09-03 16:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, Gerd Hoffmann, Paolo Bonzini, kvm, rcu, linux-kernel,
	Kevin Tian, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett

Sean Christopherson <seanjc@google.com> writes:

> On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
>> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
>> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
>> Silver 4410Y".
>
> Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> on another hardware issue?

Very possible, as according to Yan Zhao this doesn't reproduce on at
least "Coffee Lake-S". Let me try to grab some random hardware around
and I'll be back with my observations.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-03 16:20               ` Vitaly Kuznetsov
@ 2024-09-04  2:28                 ` Yan Zhao
  2024-09-04 12:17                   ` Yan Zhao
  2024-09-04 11:47                 ` Vitaly Kuznetsov
  1 sibling, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-09-04  2:28 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sean Christopherson, Gerd Hoffmann, Paolo Bonzini, kvm, rcu,
	linux-kernel, Kevin Tian, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett

On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> >> Silver 4410Y".
> >
> > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > on another hardware issue?
> 
> Very possible, as according to Yan Zhao this doesn't reproduce on at
> least "Coffee Lake-S". Let me try to grab some random hardware around
> and I'll be back with my observations.

Update some new findings from my side:

BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
from 0xfd000000 to 0xfe000000.

On "Sapphire Rapids XCC":

1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
   correctly. 
   i.e.
   if (gfn >= 0xfd000 && gfn < 0xfe000) {
   	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
   }
   return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;

2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
   restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
   correctly in this case).

3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
   this fb_map range as WC, with
   iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));

   However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
   this fb_map has been reserved as uc- by ioremap().
   Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.

   So, with KVM setting WB (no IPAT) to this fb_map range, the effective
   memory type is UC- and installer/gdm restarts endlessly.

4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
   to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
   (didn't verify the installer's case as I can't update the driver in that case).

   The reason is that the ioremap_wc() called during starting GDM will no longer
   meet conflict and can map guest PAT as WC.

WIP to find out why effective UC in fb_map range will make gdm to restart
endlessly.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-04  2:28                 ` Yan Zhao
@ 2024-09-04 12:17                   ` Yan Zhao
  2024-09-05  0:41                     ` Sean Christopherson
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-09-04 12:17 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Sean Christopherson, Gerd Hoffmann,
	Paolo Bonzini, kvm, rcu, linux-kernel, Kevin Tian, Yiwei Zhang,
	Lai Jiangshan, Paul E. McKenney, Josh Triplett

On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> > 
> > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > >> Silver 4410Y".
> > >
> > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > on another hardware issue?
> > 
> > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > least "Coffee Lake-S". Let me try to grab some random hardware around
> > and I'll be back with my observations.
> 
> Update some new findings from my side:
> 
> BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> from 0xfd000000 to 0xfe000000.
> 
> On "Sapphire Rapids XCC":
> 
> 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
>    correctly. 
>    i.e.
>    if (gfn >= 0xfd000 && gfn < 0xfe000) {
>    	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>    }
>    return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> 
> 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
>    restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
>    correctly in this case).
> 
> 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
>    this fb_map range as WC, with
>    iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> 
>    However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
>    this fb_map has been reserved as uc- by ioremap().
>    Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> 
>    So, with KVM setting WB (no IPAT) to this fb_map range, the effective
>    memory type is UC- and installer/gdm restarts endlessly.
> 
> 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
>    to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
>    (didn't verify the installer's case as I can't update the driver in that case).
> 
>    The reason is that the ioremap_wc() called during starting GDM will no longer
>    meet conflict and can map guest PAT as WC.
> 
> 
> WIP to find out why effective UC in fb_map range will make gdm to restart
> endlessly.
Not sure whether it's simply because UC is too slow.

T=Test execution time of a selftest in which guest writes to a GPA for
  0x1000000UL times

              | Sapphire Rapids XCC  | Coffee Lake-S
--------------|----------------------|-----------------
KVM UC+IPAT   |    T=0m4.530s        |  T=0m0.622s
--------------|----------------------|-----------------
KVM WC+IPAT   |    T=0m0.149s        |  T=0m0.176s
--------------|----------------------|-----------------
KVM WB+IPAT   |    T=0m0.148s        |  T=0m0.148s
------------------------------------------------------

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-04 12:17                   ` Yan Zhao
@ 2024-09-05  0:41                     ` Sean Christopherson
  2024-09-05  9:43                       ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-09-05  0:41 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Vitaly Kuznetsov, Gerd Hoffmann, Paolo Bonzini, kvm, rcu,
	linux-kernel, Kevin Tian, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett

On Wed, Sep 04, 2024, Yan Zhao wrote:
> On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > > Sean Christopherson <seanjc@google.com> writes:
> > > 
> > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > > >> Silver 4410Y".
> > > >
> > > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > > on another hardware issue?
> > > 
> > > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > > least "Coffee Lake-S". Let me try to grab some random hardware around
> > > and I'll be back with my observations.
> > 
> > Update some new findings from my side:
> > 
> > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> > from 0xfd000000 to 0xfe000000.
> > 
> > On "Sapphire Rapids XCC":
> > 
> > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
> >    correctly. 
> >    i.e.
> >    if (gfn >= 0xfd000 && gfn < 0xfe000) {
> >    	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >    }
> >    return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> > 
> > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
> >    restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
> >    correctly in this case).
> > 
> > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
> >    this fb_map range as WC, with
> >    iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> > 
> >    However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
> >    this fb_map has been reserved as uc- by ioremap().
> >    Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> > 
> >    So, with KVM setting WB (no IPAT) to this fb_map range, the effective
> >    memory type is UC- and installer/gdm restarts endlessly.
> > 
> > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
> >    to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
> >    (didn't verify the installer's case as I can't update the driver in that case).
> > 
> >    The reason is that the ioremap_wc() called during starting GDM will no longer
> >    meet conflict and can map guest PAT as WC.

Huh.  The upside of this is that it sounds like there's nothing broken with WC
or self-snoop.

> > WIP to find out why effective UC in fb_map range will make gdm to restart
> > endlessly.
> Not sure whether it's simply because UC is too slow.
> 
> T=Test execution time of a selftest in which guest writes to a GPA for
>   0x1000000UL times
> 
>               | Sapphire Rapids XCC  | Coffee Lake-S
> --------------|----------------------|-----------------
> KVM UC+IPAT   |    T=0m4.530s        |  T=0m0.622s

Woah.  Have you tried testing MOVDIR64 and/or WT?  E.g. to see if the problem is
with UC specifically, or if it occurs with any accesses that immediately write
through to main memory.

> --------------|----------------------|-----------------
> KVM WC+IPAT   |    T=0m0.149s        |  T=0m0.176s
> --------------|----------------------|-----------------
> KVM WB+IPAT   |    T=0m0.148s        |  T=0m0.148s
> ------------------------------------------------------

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-05  0:41                     ` Sean Christopherson
@ 2024-09-05  9:43                       ` Yan Zhao
  2024-09-09  5:30                         ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-09-05  9:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Gerd Hoffmann, Paolo Bonzini, kvm, rcu,
	linux-kernel, Kevin Tian, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett

On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote:
> On Wed, Sep 04, 2024, Yan Zhao wrote:
> > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > > > Sean Christopherson <seanjc@google.com> writes:
> > > > 
> > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > > > >> Silver 4410Y".
> > > > >
> > > > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > > > on another hardware issue?
> > > > 
> > > > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > > > least "Coffee Lake-S". Let me try to grab some random hardware around
> > > > and I'll be back with my observations.
> > > 
> > > Update some new findings from my side:
> > > 
> > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> > > from 0xfd000000 to 0xfe000000.
> > > 
> > > On "Sapphire Rapids XCC":
> > > 
> > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
> > >    correctly. 
> > >    i.e.
> > >    if (gfn >= 0xfd000 && gfn < 0xfe000) {
> > >    	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> > >    }
> > >    return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> > > 
> > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
> > >    restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
> > >    correctly in this case).
> > > 
> > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
> > >    this fb_map range as WC, with
> > >    iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> > > 
> > >    However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
> > >    this fb_map has been reserved as uc- by ioremap().
> > >    Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> > > 
> > >    So, with KVM setting WB (no IPAT) to this fb_map range, the effective
> > >    memory type is UC- and installer/gdm restarts endlessly.
> > > 
> > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
> > >    to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
> > >    (didn't verify the installer's case as I can't update the driver in that case).
> > > 
> > >    The reason is that the ioremap_wc() called during starting GDM will no longer
> > >    meet conflict and can map guest PAT as WC.
> 
> Huh.  The upside of this is that it sounds like there's nothing broken with WC
> or self-snoop.
Considering a different perspective, the fb_map range is used as frame buffer
(vram), with the guest writing to this range and the host reading from it.
If the issue were related to self-snooping, we would expect the VNC window to
display distorted data. However, the observed behavior is that the GDM window
shows up correctly for a sec and restarts over and over.

So, do you think we can simply fix this issue by calling ioremap_wc() for the
frame buffer/vram range in bochs driver, as is commonly done in other gpu
drivers?

--- a/drivers/gpu/drm/tiny/bochs.c
+++ b/drivers/gpu/drm/tiny/bochs.c
@@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
        if (pci_request_region(pdev, 0, "bochs-drm") != 0)
                DRM_WARN("Cannot request framebuffer, boot fb still active?\n");

-       bochs->fb_map = ioremap(addr, size);
+       bochs->fb_map = ioremap_wc(addr, size);
        if (bochs->fb_map == NULL) {
                DRM_ERROR("Cannot map framebuffer\n");
                return -ENOMEM;


> 
> > > WIP to find out why effective UC in fb_map range will make gdm to restart
> > > endlessly.
> > Not sure whether it's simply because UC is too slow.
> > 
> > T=Test execution time of a selftest in which guest writes to a GPA for
> >   0x1000000UL times
> > 
> >               | Sapphire Rapids XCC  | Coffee Lake-S
> > --------------|----------------------|-----------------
> > KVM UC+IPAT   |    T=0m4.530s        |  T=0m0.622s
> 
> Woah.  Have you tried testing MOVDIR64 and/or WT?  E.g. to see if the problem is
> with UC specifically, or if it occurs with any accesses that immediately write
> through to main memory.
> 
> > --------------|----------------------|-----------------
> > KVM WC+IPAT   |    T=0m0.149s        |  T=0m0.176s
> > --------------|----------------------|-----------------
> > KVM WB+IPAT   |    T=0m0.148s        |  T=0m0.148s
> > ------------------------------------------------------

I re-run all the tests and collected an averaged data (10 times each) as
below (previous data was just a single-run score):


T=Test execution time of a selftest in which guest writes to a GPA for
  0x1000000UL times with WRITE_ONCE

KVM memtype  | Sapphire Rapids XCC | Coffee Lake-S
-------------|---------------------|----------------
 WB+IPAT     |     T=0.1511s       |    T=0.1661s
-------------|---------------------|----------------
 WC+IPAT     |     T=0.1411s       |    T=0.1656s
-------------|---------------------|----------------
 WT+IPAT     |     T=3.7527s       |    T=0.6156s
-------------|---------------------|----------------
 WP+IPAT     |     T=4.4663s       |    T=0.6203s
-------------|---------------------|----------------
 UC+IPAT     |     T=3.4632s       |    T=0.5868s


T=Test execution time of a selftest in which guest writes to a GPA for
  0x1000000UL times with movdir64b.

(Coffee Lake-S has no feature movdir64).

KVM memtype  | Sapphire Rapids XCC | Coffee Lake-S
-------------|---------------------|----------------
 WB+IPAT     |     T=2.6142s       |       /     
-------------|---------------------|----------------
 WC+IPAT     |     T=2.8919s       |       /     
-------------|---------------------|----------------
 WT+IPAT     |     T=3.0966s       |       /      
-------------|---------------------|----------------
 WP+IPAT     |     T=2.4933s       |       /     
-------------|---------------------|----------------
 UC+IPAT     |     T=3.4606s       |       /     




^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-05  9:43                       ` Yan Zhao
@ 2024-09-09  5:30                         ` Yan Zhao
  2024-09-09 13:24                           ` Paolo Bonzini
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-09-09  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Vitaly Kuznetsov, Gerd Hoffmann,
	Paolo Bonzini, kvm, rcu, linux-kernel, Kevin Tian, Yiwei Zhang,
	Lai Jiangshan, Paul E. McKenney, Josh Triplett

On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote:
> On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote:
> > On Wed, Sep 04, 2024, Yan Zhao wrote:
> > > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> > > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > > > > Sean Christopherson <seanjc@google.com> writes:
> > > > > 
> > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > > > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > > > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > > > > >> Silver 4410Y".
> > > > > >
> > > > > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > > > > on another hardware issue?
> > > > > 
> > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > > > > least "Coffee Lake-S". Let me try to grab some random hardware around
> > > > > and I'll be back with my observations.
> > > > 
> > > > Update some new findings from my side:
> > > > 
> > > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> > > > from 0xfd000000 to 0xfe000000.
> > > > 
> > > > On "Sapphire Rapids XCC":
> > > > 
> > > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
> > > >    correctly. 
> > > >    i.e.
> > > >    if (gfn >= 0xfd000 && gfn < 0xfe000) {
> > > >    	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> > > >    }
> > > >    return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> > > > 
> > > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
> > > >    restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
> > > >    correctly in this case).
> > > > 
> > > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
> > > >    this fb_map range as WC, with
> > > >    iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> > > > 
> > > >    However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
> > > >    this fb_map has been reserved as uc- by ioremap().
> > > >    Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> > > > 
> > > >    So, with KVM setting WB (no IPAT) to this fb_map range, the effective
> > > >    memory type is UC- and installer/gdm restarts endlessly.
> > > > 
> > > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
> > > >    to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
> > > >    (didn't verify the installer's case as I can't update the driver in that case).
> > > > 
> > > >    The reason is that the ioremap_wc() called during starting GDM will no longer
> > > >    meet conflict and can map guest PAT as WC.
> > 
> > Huh.  The upside of this is that it sounds like there's nothing broken with WC
> > or self-snoop.
> Considering a different perspective, the fb_map range is used as frame buffer
> (vram), with the guest writing to this range and the host reading from it.
> If the issue were related to self-snooping, we would expect the VNC window to
> display distorted data. However, the observed behavior is that the GDM window
> shows up correctly for a sec and restarts over and over.
> 
> So, do you think we can simply fix this issue by calling ioremap_wc() for the
> frame buffer/vram range in bochs driver, as is commonly done in other gpu
> drivers?
> 
> --- a/drivers/gpu/drm/tiny/bochs.c
> +++ b/drivers/gpu/drm/tiny/bochs.c
> @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
>         if (pci_request_region(pdev, 0, "bochs-drm") != 0)
>                 DRM_WARN("Cannot request framebuffer, boot fb still active?\n");
> 
> -       bochs->fb_map = ioremap(addr, size);
> +       bochs->fb_map = ioremap_wc(addr, size);
>         if (bochs->fb_map == NULL) {
>                 DRM_ERROR("Cannot map framebuffer\n");
>                 return -ENOMEM;
> 
> 
> > 
> > > > WIP to find out why effective UC in fb_map range will make gdm to restart
> > > > endlessly.
> > > Not sure whether it's simply because UC is too slow.
> > > 
> > > T=Test execution time of a selftest in which guest writes to a GPA for
> > >   0x1000000UL times
> > > 
> > >               | Sapphire Rapids XCC  | Coffee Lake-S
> > > --------------|----------------------|-----------------
> > > KVM UC+IPAT   |    T=0m4.530s        |  T=0m0.622s
> > 
> > Woah.  Have you tried testing MOVDIR64 and/or WT?  E.g. to see if the problem is
> > with UC specifically, or if it occurs with any accesses that immediately write
> > through to main memory.
> > 
> > > --------------|----------------------|-----------------
> > > KVM WC+IPAT   |    T=0m0.149s        |  T=0m0.176s
> > > --------------|----------------------|-----------------
> > > KVM WB+IPAT   |    T=0m0.148s        |  T=0m0.148s
> > > ------------------------------------------------------
> 
> I re-run all the tests and collected an averaged data (10 times each) as
> below (previous data was just a single-run score):
> 
> 
> T=Test execution time of a selftest in which guest writes to a GPA for
>   0x1000000UL times with WRITE_ONCE
> 
> KVM memtype  | Sapphire Rapids XCC | Coffee Lake-S
> -------------|---------------------|----------------
>  WB+IPAT     |     T=0.1511s       |    T=0.1661s
> -------------|---------------------|----------------
>  WC+IPAT     |     T=0.1411s       |    T=0.1656s
> -------------|---------------------|----------------
>  WT+IPAT     |     T=3.7527s       |    T=0.6156s
> -------------|---------------------|----------------
>  WP+IPAT     |     T=4.4663s       |    T=0.6203s
> -------------|---------------------|----------------
>  UC+IPAT     |     T=3.4632s       |    T=0.5868s
> 
> 
> T=Test execution time of a selftest in which guest writes to a GPA for
>   0x1000000UL times with movdir64b.
> 
> (Coffee Lake-S has no feature movdir64).
> 
> KVM memtype  | Sapphire Rapids XCC | Coffee Lake-S
> -------------|---------------------|----------------
>  WB+IPAT     |     T=2.6142s       |       /     
> -------------|---------------------|----------------
>  WC+IPAT     |     T=2.8919s       |       /     
> -------------|---------------------|----------------
>  WT+IPAT     |     T=3.0966s       |       /      
> -------------|---------------------|----------------
>  WP+IPAT     |     T=2.4933s       |       /     
> -------------|---------------------|----------------
>  UC+IPAT     |     T=3.4606s       |       /     
>
Up to now, I think I have root caused this issue.

Status before this update:
In either ubuntu or centos, on "Sapphire Rapids XCC"
- gdm fails to launch gnome-shell when wayland is enabled, when
  effective memory type is UC/UC-.
- gdm is able launch gnome-shell correctly when wayland is enabled, when
  effective memory type is WB or WC.
- gdm is able launch gnome-shell correctly when wayland is not enabled, with
  any effective memory type.


Update:
1. I tried KVM memtype = WT + IPAT for this framebuffer range,
   gdm fails to launch gnome-shell when wayland is enabled.

   Since the only difference between WT and WB is that write in WT is slow,
   the failure should not be self-snoop issue.


2. The current bochs driver calls ioremap() to map framebuffer range.
   On x86 architectures, ioremap() maps VA with PAT=UC- and invokes
   memtype_reserve() to reserve the memory type as UC- for the physical range.
   This reservation can cause subsequent calls to ioremap_wc() to fail to map
   the VA with PAT=WC to the same framebuffer range in
   ttm_kmap_iter_linear_io_init().
   Consequently, the operation drm_gem_vram_bo_driver_move() become
   significantly slow on platforms where UC memory access is slow.

   When host KVM honors guest PAT memory types, the effective memory type        
   for this framebuffer range is                                                    
   - WC when ioremap_wc() is used in driver probing phase                           
   - UC- when ioremap() is used.

   I measured the data below for drm_gem_vram_bo_driver_move() which
   does memset to this framebuffer range with size 0x3e8000.

     ---------------------------------------------------------------
    |                               |      in bochs_hw_init()       |
    |                               |    ioremap()   | ioremap_wc() |
    |-------------------------------|----------------|--------------|
    |     cycles of                 |    2227.4M     |   17.8M      |
    | drm_gem_vram_bo_driver_move() |                |              |
    |-------------------------------|----------------|--------------|
    |     time of                   |    1.24s       |   0.01s      |
    | drm_gem_vram_bo_driver_move() |                |              |
     ---------------------------------------------------------------

    drm_gem_vram_bo_driver_move
       ttm_bo_move_memcpy()
           ttm_kmap_iter_linear_io_init()
	       iosys_map_set_vaddr_iomem(&iter_io->dmap,
	                                 ioremap_wc(mem->bus.offset,mem->size));
	   ttm_move_memcpy
	       memset_io or
	       drm_memcpy_from_wc


   If I comment out the memset_io() and drm_memcpy_from_wc() in
   ttm_move_memcpy(), drm_gem_vram_bo_driver_move() can be very fast and gdm is
   able to launch gnome-shell and login successfully, though sometime the
   screen is a little blurred. 

3. I sent a fix at [1] to let guest bochs driver map the framebuffer
   with PAT=WC for kernel access.
   [1] https://lore.kernel.org/all/20240909051529.26776-1-yan.y.zhao@intel.com/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-09  5:30                         ` Yan Zhao
@ 2024-09-09 13:24                           ` Paolo Bonzini
  2024-09-09 16:04                             ` Sean Christopherson
  2024-09-10  1:05                             ` Yan Zhao
  0 siblings, 2 replies; 61+ messages in thread
From: Paolo Bonzini @ 2024-09-09 13:24 UTC (permalink / raw)
  To: Yan Zhao, Sean Christopherson, Vitaly Kuznetsov, Gerd Hoffmann,
	kvm, rcu, linux-kernel, Kevin Tian, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett

On 9/9/24 07:30, Yan Zhao wrote:
> On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote:
>> On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote:
>>> On Wed, Sep 04, 2024, Yan Zhao wrote:
>>>> On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
>>>>> On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
>>>>>> Sean Christopherson <seanjc@google.com> writes:
>>>>>>
>>>>>>> On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
>>>>>>>> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
>>>>>>>> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
>>>>>>>> Silver 4410Y".
>>>>>>>
>>>>>>> Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
>>>>>>> on another hardware issue?
>>>>>>
>>>>>> Very possible, as according to Yan Zhao this doesn't reproduce on at
>>>>>> least "Coffee Lake-S". Let me try to grab some random hardware around
>>>>>> and I'll be back with my observations.
>>>>>
>>>>> Update some new findings from my side:
>>>>>
>>>>> BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
>>>>> from 0xfd000000 to 0xfe000000.
>>>>>
>>>>> On "Sapphire Rapids XCC":
>>>>>
>>>>> 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
>>>>>     correctly.
>>>>>     i.e.
>>>>>     if (gfn >= 0xfd000 && gfn < 0xfe000) {
>>>>>     	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>>>>>     }
>>>>>     return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
>>>>>
>>>>> 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
>>>>>     restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
>>>>>     correctly in this case).
>>>>>
>>>>> 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
>>>>>     this fb_map range as WC, with
>>>>>     iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
>>>>>
>>>>>     However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
>>>>>     this fb_map has been reserved as uc- by ioremap().
>>>>>     Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
>>>>>
>>>>>     So, with KVM setting WB (no IPAT) to this fb_map range, the effective
>>>>>     memory type is UC- and installer/gdm restarts endlessly.
>>>>>
>>>>> 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
>>>>>     to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
>>>>>     (didn't verify the installer's case as I can't update the driver in that case).
>>>>>
>>>>>     The reason is that the ioremap_wc() called during starting GDM will no longer
>>>>>     meet conflict and can map guest PAT as WC.
>>>
>>> Huh.  The upside of this is that it sounds like there's nothing broken with WC
>>> or self-snoop.
>> Considering a different perspective, the fb_map range is used as frame buffer
>> (vram), with the guest writing to this range and the host reading from it.
>> If the issue were related to self-snooping, we would expect the VNC window to
>> display distorted data. However, the observed behavior is that the GDM window
>> shows up correctly for a sec and restarts over and over.
>>
>> So, do you think we can simply fix this issue by calling ioremap_wc() for the
>> frame buffer/vram range in bochs driver, as is commonly done in other gpu
>> drivers?
>>
>> --- a/drivers/gpu/drm/tiny/bochs.c
>> +++ b/drivers/gpu/drm/tiny/bochs.c
>> @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
>>          if (pci_request_region(pdev, 0, "bochs-drm") != 0)
>>                  DRM_WARN("Cannot request framebuffer, boot fb still active?\n");
>>
>> -       bochs->fb_map = ioremap(addr, size);
>> +       bochs->fb_map = ioremap_wc(addr, size);
>>          if (bochs->fb_map == NULL) {
>>                  DRM_ERROR("Cannot map framebuffer\n");
>>                  return -ENOMEM;

While this is a fix for future kernels, it doesn't change the result for 
VMs already in existence.

I don't think there's an alternative to putting this behind a quirk.

Paolo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-09 13:24                           ` Paolo Bonzini
@ 2024-09-09 16:04                             ` Sean Christopherson
  2024-09-10  1:05                             ` Yan Zhao
  1 sibling, 0 replies; 61+ messages in thread
From: Sean Christopherson @ 2024-09-09 16:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Yan Zhao, Vitaly Kuznetsov, Gerd Hoffmann, kvm, rcu, linux-kernel,
	Kevin Tian, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett

On Mon, Sep 09, 2024, Paolo Bonzini wrote:
> On 9/9/24 07:30, Yan Zhao wrote:
> > On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote:
> > > On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote:
> > > > On Wed, Sep 04, 2024, Yan Zhao wrote:
> > > > > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> > > > > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > > > > > > Sean Christopherson <seanjc@google.com> writes:
> > > > > > > 
> > > > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > > > > > > > > FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > > > > > > > > but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > > > > > > > > Silver 4410Y".
> > > > > > > > 
> > > > > > > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > > > > > > on another hardware issue?
> > > > > > > 
> > > > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > > > > > > least "Coffee Lake-S". Let me try to grab some random hardware around
> > > > > > > and I'll be back with my observations.
> > > > > > 
> > > > > > Update some new findings from my side:
> > > > > > 
> > > > > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> > > > > > from 0xfd000000 to 0xfe000000.
> > > > > > 
> > > > > > On "Sapphire Rapids XCC":
> > > > > > 
> > > > > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
> > > > > >     correctly.
> > > > > >     i.e.
> > > > > >     if (gfn >= 0xfd000 && gfn < 0xfe000) {
> > > > > >     	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> > > > > >     }
> > > > > >     return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> > > > > > 
> > > > > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
> > > > > >     restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
> > > > > >     correctly in this case).
> > > > > > 
> > > > > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
> > > > > >     this fb_map range as WC, with
> > > > > >     iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> > > > > > 
> > > > > >     However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
> > > > > >     this fb_map has been reserved as uc- by ioremap().
> > > > > >     Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> > > > > > 
> > > > > >     So, with KVM setting WB (no IPAT) to this fb_map range, the effective
> > > > > >     memory type is UC- and installer/gdm restarts endlessly.
> > > > > > 
> > > > > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
> > > > > >     to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
> > > > > >     (didn't verify the installer's case as I can't update the driver in that case).
> > > > > > 
> > > > > >     The reason is that the ioremap_wc() called during starting GDM will no longer
> > > > > >     meet conflict and can map guest PAT as WC.
> > > > 
> > > > Huh.  The upside of this is that it sounds like there's nothing broken with WC
> > > > or self-snoop.
> > > Considering a different perspective, the fb_map range is used as frame buffer
> > > (vram), with the guest writing to this range and the host reading from it.
> > > If the issue were related to self-snooping, we would expect the VNC window to
> > > display distorted data. However, the observed behavior is that the GDM window
> > > shows up correctly for a sec and restarts over and over.
> > > 
> > > So, do you think we can simply fix this issue by calling ioremap_wc() for the
> > > frame buffer/vram range in bochs driver, as is commonly done in other gpu
> > > drivers?
> > > 
> > > --- a/drivers/gpu/drm/tiny/bochs.c
> > > +++ b/drivers/gpu/drm/tiny/bochs.c
> > > @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
> > >          if (pci_request_region(pdev, 0, "bochs-drm") != 0)
> > >                  DRM_WARN("Cannot request framebuffer, boot fb still active?\n");
> > > 
> > > -       bochs->fb_map = ioremap(addr, size);
> > > +       bochs->fb_map = ioremap_wc(addr, size);
> > >          if (bochs->fb_map == NULL) {
> > >                  DRM_ERROR("Cannot map framebuffer\n");
> > >                  return -ENOMEM;
> 
> While this is a fix for future kernels, it doesn't change the result for VMs
> already in existence.

I would prefer to bottom out on exactly whether or not the SPR/CLX behavior is
working as intended.  Maybe the ~8x slowdown is just a side effect of any Intel
multi-socket/node system, but I think we should get confirmation (inasmuch as
possible) that that is indeed the case.  E.g. if this is actually a bug in CLX+,
then the actions we need to take are different.

> I don't think there's an alternative to putting this behind a quirk.

This gets a bit weird, which is why I want to bottom out on whether or not CLX
and SPR are working as intended.  If non-coherent DMA is attached to the VM, then
even before this patch KVM would honor guest PAT.  I agree that we don't want to
break existing setups, but if CLX+SPR are working as intended, then this is
inarguably a bochs driver bug, and I would prefer to have the quirk explicitly
reference bochs-compatible devices, e.g. in the name and documentation, so that
userspace can disable the quirk by default and only leave it enabled if a bochs
device is being exposed to the guest.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-09 13:24                           ` Paolo Bonzini
  2024-09-09 16:04                             ` Sean Christopherson
@ 2024-09-10  1:05                             ` Yan Zhao
  1 sibling, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2024-09-10  1:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Gerd Hoffmann, kvm, rcu,
	linux-kernel, Kevin Tian, Yiwei Zhang, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett

On Mon, Sep 09, 2024 at 03:24:40PM +0200, Paolo Bonzini wrote:
> While this is a fix for future kernels, it doesn't change the result for VMs
> already in existence.
Though this is the truth, I have concerns that there may be other guest drivers
with improper PAT configurations that were previously masked by KVM's force-WB
setting. Now that we respect the guest's PAT settings, these misconfigurations
could lead to degraded performance, potentially perceived as errors, as was
observed in the previous VMX unit test and the current Bochs scenario.

> I don't think there's an alternative to putting this behind a quirk.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-09-03 16:20               ` Vitaly Kuznetsov
  2024-09-04  2:28                 ` Yan Zhao
@ 2024-09-04 11:47                 ` Vitaly Kuznetsov
  1 sibling, 0 replies; 61+ messages in thread
From: Vitaly Kuznetsov @ 2024-09-04 11:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, Gerd Hoffmann, Paolo Bonzini, kvm, rcu, linux-kernel,
	Kevin Tian, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett

Vitaly Kuznetsov <vkuznets@redhat.com> writes:

> Sean Christopherson <seanjc@google.com> writes:
>
>> On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
>>> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
>>> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
>>> Silver 4410Y".
>>
>> Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
>> on another hardware issue?
>
> Very possible, as according to Yan Zhao this doesn't reproduce on at
> least "Coffee Lake-S". Let me try to grab some random hardware around
> and I'll be back with my observations.

We have some interesting results :-)

In addition to Sapphire Rapids, the issue also reproduces on a Cascade
lake CPU (Intel(R) Xeon(R) Silver 4214 CPU) but does NOT reproduce on
Skylake (Intel(R) Xeon(R) Gold 5118 CPU). I don't have a lot of desktop
CPUs around, so can't say much.

AMD also doesn't seem to be affected, at leats AMD EPYC 7413 works fine.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-08-30  9:35   ` Vitaly Kuznetsov
  2024-08-30 11:05     ` Gerd Hoffmann
@ 2024-10-07 13:28     ` Linux regression tracking (Thorsten Leemhuis)
  2024-10-07 13:38       ` Vitaly Kuznetsov
  1 sibling, 1 reply; 61+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-10-07 13:28 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang,
	Lai Jiangshan, Paul E. McKenney, Josh Triplett, Gerd Hoffmann,
	Linux kernel regressions list

On 30.08.24 11:35, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
>> Unconditionally honor guest PAT on CPUs that support self-snoop, as
>> Intel has confirmed that CPUs that support self-snoop always snoop caches
>> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>> even in the presence of aliased memtypes, thus there is no need to trust
>> the guest behaves and only honor PAT as a last resort, as KVM does today.
>>
>> Honoring guest PAT is desirable for use cases where the guest has access
>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>> (mediated, for all intents and purposes) GPU is exposed to the guest, along
>> with buffers that are consumed directly by the physical GPU, i.e. which
>> can't be proxied by the host to ensure writes from the guest are performed
>> with the correct memory type for the GPU.
> 
> Necroposting!
> 
> Turns out that this change broke "bochs-display" driver in QEMU even
> when the guest is modern (don't ask me 'who the hell uses bochs for
> modern guests', it was basically a configuration error :-). E.g:
> [...]

This regression made it to the list of tracked regressions. It seems
this thread stalled a while ago. Was this ever fixed? Does not look like
it, but I might have missed something. Or is this a regression I should
just ignore for one reason or another?


Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-10-07 13:28     ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-10-07 13:38       ` Vitaly Kuznetsov
  2024-10-07 14:04         ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 61+ messages in thread
From: Vitaly Kuznetsov @ 2024-10-07 13:38 UTC (permalink / raw)
  To: Linux regression tracking (Thorsten Leemhuis),
	Sean Christopherson, Paolo Bonzini
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang,
	Lai Jiangshan, Paul E. McKenney, Josh Triplett, Gerd Hoffmann,
	Linux kernel regressions list

"Linux regression tracking (Thorsten Leemhuis)"
<regressions@leemhuis.info> writes:

> On 30.08.24 11:35, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> 
>>> Unconditionally honor guest PAT on CPUs that support self-snoop, as
>>> Intel has confirmed that CPUs that support self-snoop always snoop caches
>>> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>>> even in the presence of aliased memtypes, thus there is no need to trust
>>> the guest behaves and only honor PAT as a last resort, as KVM does today.
>>>
>>> Honoring guest PAT is desirable for use cases where the guest has access
>>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>>> (mediated, for all intents and purposes) GPU is exposed to the guest, along
>>> with buffers that are consumed directly by the physical GPU, i.e. which
>>> can't be proxied by the host to ensure writes from the guest are performed
>>> with the correct memory type for the GPU.
>> 
>> Necroposting!
>> 
>> Turns out that this change broke "bochs-display" driver in QEMU even
>> when the guest is modern (don't ask me 'who the hell uses bochs for
>> modern guests', it was basically a configuration error :-). E.g:
>> [...]
>
> This regression made it to the list of tracked regressions. It seems
> this thread stalled a while ago. Was this ever fixed? Does not look like
> it, but I might have missed something. Or is this a regression I should
> just ignore for one reason or another?
>

The regression was addressed in by reverting 377b2f359d1f in 6.11

commit 9d70f3fec14421e793ffbc0ec2f739b24e534900
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Sun Sep 15 02:49:33 2024 -0400

    Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"

Also, there's a (pending) DRM patch fixing it from the guest's side:
https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9388ccf69925223223c87355a417ba39b13a5e8e

-- 
Vitaly


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2024-10-07 13:38       ` Vitaly Kuznetsov
@ 2024-10-07 14:04         ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 61+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-10-07 14:04 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang,
	Lai Jiangshan, Paul E. McKenney, Josh Triplett, Gerd Hoffmann,
	Linux kernel regressions list

On 07.10.24 15:38, Vitaly Kuznetsov wrote:
> "Linux regression tracking (Thorsten Leemhuis)"
> <regressions@leemhuis.info> writes:
> 
>> On 30.08.24 11:35, Vitaly Kuznetsov wrote:
>>> Sean Christopherson <seanjc@google.com> writes:
>>>
>>>> Unconditionally honor guest PAT on CPUs that support self-snoop, as
>>>> Intel has confirmed that CPUs that support self-snoop always snoop caches
>>>> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>>>> even in the presence of aliased memtypes, thus there is no need to trust
>>>> the guest behaves and only honor PAT as a last resort, as KVM does today.
>>>>
>>>> Honoring guest PAT is desirable for use cases where the guest has access
>>>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>>>> (mediated, for all intents and purposes) GPU is exposed to the guest, along
>>>> with buffers that are consumed directly by the physical GPU, i.e. which
>>>> can't be proxied by the host to ensure writes from the guest are performed
>>>> with the correct memory type for the GPU.
>>>
>>> Necroposting!
>>>
>>> Turns out that this change broke "bochs-display" driver in QEMU even
>>> when the guest is modern (don't ask me 'who the hell uses bochs for
>>> modern guests', it was basically a configuration error :-). E.g:
>>> [...]
>>
>> This regression made it to the list of tracked regressions. It seems
>> this thread stalled a while ago. Was this ever fixed? Does not look like
>> it, but I might have missed something. Or is this a regression I should
>> just ignore for one reason or another?
>>
> 
> The regression was addressed in by reverting 377b2f359d1f in 6.11
> 
> commit 9d70f3fec14421e793ffbc0ec2f739b24e534900
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Sun Sep 15 02:49:33 2024 -0400
> 
>     Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"

Thx. Sorry, missed that, thx for pointing me towards it. I had looked
for things like that, but seems I messed up my lore query. Apologies for
the noise!

> Also, there's a (pending) DRM patch fixing it from the guest's side:
> https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9388ccf69925223223c87355a417ba39b13a5e8e

Great!

Ciao, Thorsten

P.S.:

#regzbot fix: 9d70f3fec14421e793ffbc0ec2f739b24e534900



^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT
  2024-03-09  1:09 [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Sean Christopherson
                   ` (4 preceding siblings ...)
  2024-03-09  1:09 ` [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Sean Christopherson
@ 2024-03-22  9:29 ` Ma, Yongwei
  2024-03-22 13:08 ` Yan Zhao
  2024-06-05 23:20 ` Sean Christopherson
  7 siblings, 0 replies; 61+ messages in thread
From: Ma, Yongwei @ 2024-03-22  9:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett
  Cc: kvm@vger.kernel.org, rcu@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tian, Kevin, Zhao, Yan Y,
	Yiwei Zhang

> First, rip out KVM's support for virtualizing guest MTRRs on VMX.  The code is
> costly to main, a drag on guest boot performance, imperfect, and not
> required for functional correctness with modern guest kernels.  Many details
> in patch 1's changelog.
> 
> With MTRR virtualization gone, always honor guest PAT on Intel CPUs that
> support self-snoop, as such CPUs are guaranteed to maintain coherency
> even if the guest is aliasing memtypes, e.g. if the host is using WB but the
> guest is using WC.  Honoring guest PAT is desirable for use cases where the
> guest must use WC when accessing memory that is DMA'd from a non-
> coherent device that does NOT bounce through VFIO, e.g. for mediated
> virtual GPUs.
> 
> The SRCU patch adds an API that is effectively documentation for the
> memory barrier in srcu_read_lock().  Intel CPUs with self-snoop require a
> memory barrier after VM-Exit to ensure coherency, and KVM always does a
> srcu_read_lock() before reading guest memory after VM-Exit.  Relying on
> SRCU to provide the barrier allows KVM to avoid emitting a redundant barrier
> of its own.
> 
> This series needs a _lot_ more testing; I arguably should have tagged it RFC,
> but I'm feeling lucky.
> 
> Sean Christopherson (3):
>   KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
>   KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1
>   KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
> 
> Yan Zhao (2):
>   srcu: Add an API for a memory barrier after SRCU read lock
>   KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
> 
>  Documentation/virt/kvm/api.rst        |   6 +-
>  Documentation/virt/kvm/x86/errata.rst |  18 +
>  arch/x86/include/asm/kvm_host.h       |  15 +-
>  arch/x86/kvm/mmu.h                    |   7 +-
>  arch/x86/kvm/mmu/mmu.c                |  35 +-
>  arch/x86/kvm/mtrr.c                   | 644 ++------------------------
>  arch/x86/kvm/vmx/vmx.c                |  40 +-
>  arch/x86/kvm/x86.c                    |  24 +-
>  arch/x86/kvm/x86.h                    |   4 -
>  include/linux/srcu.h                  |  14 +
>  10 files changed, 105 insertions(+), 702 deletions(-)
> 
> 
> base-commit: 964d0c614c7f71917305a5afdca9178fe8231434
> --
> 2.44.0.278.ge034bb2e1d-goog
> 
Verified iGPU passthrough(GVT-d) on Intel platforms, TGL Core(TM) i5-1135G7/ADL Core(TM) i7-12700/RPL/MTL Ultra 7 + Ubuntu22.04 LTS.
Both Linux Ubuntu 22.04 VM and Windows10 VM could boot up successfully. 
3D benchmark GLmark2 can run as expected in the guest VM.

Tested-by: Yongwei Ma <yongwei.ma@intel.com>

Best Regards,
Yongwei Ma

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT
  2024-03-09  1:09 [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Sean Christopherson
                   ` (5 preceding siblings ...)
  2024-03-22  9:29 ` [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Ma, Yongwei
@ 2024-03-22 13:08 ` Yan Zhao
  2024-03-25  6:56   ` Ma, XiangfeiX
  2024-06-05 23:20 ` Sean Christopherson
  7 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2024-03-22 13:08 UTC (permalink / raw)
  To: Sean Christopherson, xiangfeix.ma, xudong.hao
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm, rcu, linux-kernel, Kevin Tian, Yiwei Zhang

Xiangfei found out an failure in kvm unit test rdtsc_vmexit_diff_test
with below error log:
"FAIL: RDTSC to VM-exit delta too high in 100 of 100 iterations, last = 902
FAIL: Guest didn't run to completion."

Fixed it by adding below lines in the unit test rdtsc_vmexit_diff_test before
enter guest in my side.
vmcs_write(HOST_PAT, 0x6);
vmcs_clear_bits(EXI_CONTROLS, EXI_SAVE_PAT);
vmcs_set_bits(EXI_CONTROLS, EXI_LOAD_PAT);


^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT
  2024-03-22 13:08 ` Yan Zhao
@ 2024-03-25  6:56   ` Ma, XiangfeiX
  2024-03-25  8:02     ` Ma, XiangfeiX
  0 siblings, 1 reply; 61+ messages in thread
From: Ma, XiangfeiX @ 2024-03-25  6:56 UTC (permalink / raw)
  To: Zhao, Yan Y, Sean Christopherson, Hao, Xudong
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm@vger.kernel.org, rcu@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tian, Kevin, Yiwei Zhang

Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>

I have verified this method which can solve the issue.

-----Original Message-----
From: Zhao, Yan Y <yan.y.zhao@intel.com> 
Sent: Friday, March 22, 2024 9:08 PM
To: Sean Christopherson <seanjc@google.com>; Ma, XiangfeiX <xiangfeix.ma@intel.com>; Hao, Xudong <xudong.hao@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>; Lai Jiangshan <jiangshanlai@gmail.com>; Paul E. McKenney <paulmck@kernel.org>; Josh Triplett <josh@joshtriplett.org>; kvm@vger.kernel.org; rcu@vger.kernel.org; linux-kernel@vger.kernel.org; Tian, Kevin <kevin.tian@intel.com>; Yiwei Zhang <zzyiwei@google.com>
Subject: Re: [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT

Xiangfei found out an failure in kvm unit test rdtsc_vmexit_diff_test with below error log:
"FAIL: RDTSC to VM-exit delta too high in 100 of 100 iterations, last = 902
FAIL: Guest didn't run to completion."

Fixed it by adding below lines in the unit test rdtsc_vmexit_diff_test before enter guest in my side.
vmcs_write(HOST_PAT, 0x6);
vmcs_clear_bits(EXI_CONTROLS, EXI_SAVE_PAT); vmcs_set_bits(EXI_CONTROLS, EXI_LOAD_PAT);


^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT
  2024-03-25  6:56   ` Ma, XiangfeiX
@ 2024-03-25  8:02     ` Ma, XiangfeiX
  0 siblings, 0 replies; 61+ messages in thread
From: Ma, XiangfeiX @ 2024-03-25  8:02 UTC (permalink / raw)
  To: Zhao, Yan Y, Sean Christopherson, Hao, Xudong
  Cc: Paolo Bonzini, Lai Jiangshan, Paul E. McKenney, Josh Triplett,
	kvm@vger.kernel.org, rcu@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tian, Kevin, Yiwei Zhang


Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>

Testing environment is based on the EMR-2S3 platform + CentOS 9(kernel version: 6.8.0-rc4).
Test cases include cpu, amx, umip, ptvmx, IPIv, vtd, PMU, SGX, kmv-unit-tests, kvm selftests, etc. And workload test on the guest using Netperf(bridge) and SPECJBB(passthrough NIC).
Except for the known issue and the previously mentioned "rdtsc_vmexit_diff_test", no other issue found.

-----Original Message-----
From: Ma, XiangfeiX 
Sent: Monday, March 25, 2024 2:56 PM
To: Zhao, Yan Y <yan.y.zhao@intel.com>; Sean Christopherson <seanjc@google.com>; Hao, Xudong <xudong.hao@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>; Lai Jiangshan <jiangshanlai@gmail.com>; Paul E. McKenney <paulmck@kernel.org>; Josh Triplett <josh@joshtriplett.org>; kvm@vger.kernel.org; rcu@vger.kernel.org; linux-kernel@vger.kernel.org; Tian, Kevin <kevin.tian@intel.com>; Yiwei Zhang <zzyiwei@google.com>
Subject: RE: [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT

Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>

I have verified this method which can solve the issue.

-----Original Message-----
From: Zhao, Yan Y <yan.y.zhao@intel.com> 
Sent: Friday, March 22, 2024 9:08 PM
To: Sean Christopherson <seanjc@google.com>; Ma, XiangfeiX <xiangfeix.ma@intel.com>; Hao, Xudong <xudong.hao@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>; Lai Jiangshan <jiangshanlai@gmail.com>; Paul E. McKenney <paulmck@kernel.org>; Josh Triplett <josh@joshtriplett.org>; kvm@vger.kernel.org; rcu@vger.kernel.org; linux-kernel@vger.kernel.org; Tian, Kevin <kevin.tian@intel.com>; Yiwei Zhang <zzyiwei@google.com>
Subject: Re: [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT

Xiangfei found out an failure in kvm unit test rdtsc_vmexit_diff_test with below error log:
"FAIL: RDTSC to VM-exit delta too high in 100 of 100 iterations, last = 902
FAIL: Guest didn't run to completion."

Fixed it by adding below lines in the unit test rdtsc_vmexit_diff_test before enter guest in my side.
vmcs_write(HOST_PAT, 0x6);
vmcs_clear_bits(EXI_CONTROLS, EXI_SAVE_PAT); vmcs_set_bits(EXI_CONTROLS, EXI_LOAD_PAT);


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT
  2024-03-09  1:09 [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Sean Christopherson
                   ` (6 preceding siblings ...)
  2024-03-22 13:08 ` Yan Zhao
@ 2024-06-05 23:20 ` Sean Christopherson
  2024-06-06  0:03   ` Paul E. McKenney
  7 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2024-06-05 23:20 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Lai Jiangshan,
	Paul E. McKenney, Josh Triplett
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

On Fri, 08 Mar 2024 17:09:24 -0800, Sean Christopherson wrote:
> First, rip out KVM's support for virtualizing guest MTRRs on VMX.  The
> code is costly to main, a drag on guest boot performance, imperfect, and
> not required for functional correctness with modern guest kernels.  Many
> details in patch 1's changelog.
> 
> With MTRR virtualization gone, always honor guest PAT on Intel CPUs that
> support self-snoop, as such CPUs are guaranteed to maintain coherency
> even if the guest is aliasing memtypes, e.g. if the host is using WB but
> the guest is using WC.  Honoring guest PAT is desirable for use cases
> where the guest must use WC when accessing memory that is DMA'd from a
> non-coherent device that does NOT bounce through VFIO, e.g. for mediated
> virtual GPUs.
> 
> [...]

Applied to kvm-x86 mtrrs, to get as much testing as possible before a potential
merge in 6.11.

Paul, if you can take a gander at patch 3, it would be much appreciated.

Thanks!

[1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
      https://github.com/kvm-x86/linux/commit/0a7b73559b39
[2/5] KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1
      https://github.com/kvm-x86/linux/commit/e1548088ff54
[3/5] srcu: Add an API for a memory barrier after SRCU read lock
      https://github.com/kvm-x86/linux/commit/fcfe671e0879
[4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
      https://github.com/kvm-x86/linux/commit/eb8d8fc29286
[5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
      https://github.com/kvm-x86/linux/commit/95200f24b862

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT
  2024-06-05 23:20 ` Sean Christopherson
@ 2024-06-06  0:03   ` Paul E. McKenney
  0 siblings, 0 replies; 61+ messages in thread
From: Paul E. McKenney @ 2024-06-06  0:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Lai Jiangshan, Josh Triplett, kvm, rcu,
	linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang

On Wed, Jun 05, 2024 at 04:20:34PM -0700, Sean Christopherson wrote:
> On Fri, 08 Mar 2024 17:09:24 -0800, Sean Christopherson wrote:
> > First, rip out KVM's support for virtualizing guest MTRRs on VMX.  The
> > code is costly to main, a drag on guest boot performance, imperfect, and
> > not required for functional correctness with modern guest kernels.  Many
> > details in patch 1's changelog.
> > 
> > With MTRR virtualization gone, always honor guest PAT on Intel CPUs that
> > support self-snoop, as such CPUs are guaranteed to maintain coherency
> > even if the guest is aliasing memtypes, e.g. if the host is using WB but
> > the guest is using WC.  Honoring guest PAT is desirable for use cases
> > where the guest must use WC when accessing memory that is DMA'd from a
> > non-coherent device that does NOT bounce through VFIO, e.g. for mediated
> > virtual GPUs.
> > 
> > [...]
> 
> Applied to kvm-x86 mtrrs, to get as much testing as possible before a potential
> merge in 6.11.
> 
> Paul, if you can take a gander at patch 3, it would be much appreciated.
> 
> Thanks!
> 
> [1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
>       https://github.com/kvm-x86/linux/commit/0a7b73559b39
> [2/5] KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1
>       https://github.com/kvm-x86/linux/commit/e1548088ff54
> [3/5] srcu: Add an API for a memory barrier after SRCU read lock
>       https://github.com/kvm-x86/linux/commit/fcfe671e0879

Looks straightforward enough.  We could combine this with the existing
smp_mb__after_srcu_read_unlock(), but if we did that, someone would no
doubt come up with some clever optimization that provided a full barrier
in srcu_read_lock() but not in srcu_read_unlock() or vice versa.  So:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> [4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
>       https://github.com/kvm-x86/linux/commit/eb8d8fc29286
> [5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
>       https://github.com/kvm-x86/linux/commit/95200f24b862
> 
> --
> https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
@ 2025-04-10  1:13 Myrsky Lintu
  2025-04-10  5:12 ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Myrsky Lintu @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Linux regressions mailing list, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini
  Cc: kvm, rcu, linux-kernel, Kevin Tian, Yan Zhao, Yiwei Zhang,
	Lai Jiangshan, Paul E. McKenney, Josh Triplett, Gerd Hoffmann

Hello,

I am completely new to and uninformed about kernel development. I was 
pointed here from Mesa documentation for Venus (Vulkan encapsulation for 
KVM/QEMU): https://docs.mesa3d.org/drivers/venus.html

Based on my limited understanding of what has happened here, this patch 
series was partially reverted due to an issue with the Bochs DRM driver. 
A fix for that issue has been merged months ago according to the link 
provided in an earlier message. Since then work on this detail of KVM 
seems to have stalled.

Is it reasonable to ask here for this patch series to be evaluated and 
incorporated again?

My layperson's attempt at applying the series against 6.14.1 source code 
failed. In addition to the parts that appear to have already been 
incorporated there are some parts of the patch series that are rejected. 
I lack the knowledge to correct that.

Distro kernels currently ship without it which limits the usability of 
Venus on AMD and NVIDIA GPUs paired with Intel CPUs. Convincing 
individual distro maintainers of the necessity of this patch series 
without the specialized knowledge required for understanding what it 
does and performing that evaluation is quite hard. If upstream (kernel) 
would apply it now the distros would ship a kernel including the 
required changes to users, including me, without that multiplicated effort.

Thank you for your time. If this request is out of place here please 
forgive me for engaging this mailing list without a proper understanding 
of the list's scope.

On 2024-10-07 14:04:24, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 07.10.24 15:38, Vitaly Kuznetsov wrote:
>> "Linux regression tracking (Thorsten Leemhuis)"
>> <regressions@leemhuis.info> writes:
>>
>>> On 30.08.24 11:35, Vitaly Kuznetsov wrote:
>>>> Sean Christopherson <seanjc@google.com> writes:
>>>>
>>>>> Unconditionally honor guest PAT on CPUs that support self-snoop, as
>>>>> Intel has confirmed that CPUs that support self-snoop always snoop caches
>>>>> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>>>>> even in the presence of aliased memtypes, thus there is no need to trust
>>>>> the guest behaves and only honor PAT as a last resort, as KVM does today.
>>>>>
>>>>> Honoring guest PAT is desirable for use cases where the guest has access
>>>>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>>>>> (mediated, for all intents and purposes) GPU is exposed to the guest, along
>>>>> with buffers that are consumed directly by the physical GPU, i.e. which
>>>>> can't be proxied by the host to ensure writes from the guest are performed
>>>>> with the correct memory type for the GPU.
>>>>
>>>> Necroposting!
>>>>
>>>> Turns out that this change broke "bochs-display" driver in QEMU even
>>>> when the guest is modern (don't ask me 'who the hell uses bochs for
>>>> modern guests', it was basically a configuration error :-). E.g:
>>>> [...]
>>>
>>> This regression made it to the list of tracked regressions. It seems
>>> this thread stalled a while ago. Was this ever fixed? Does not look like
>>> it, but I might have missed something. Or is this a regression I should
>>> just ignore for one reason or another?
>>>
>>
>> The regression was addressed in by reverting 377b2f359d1f in 6.11
>>
>> commit 9d70f3fec14421e793ffbc0ec2f739b24e534900
>> Author: Paolo Bonzini <pbonzini@redhat.com>
>> Date:   Sun Sep 15 02:49:33 2024 -0400
>>
>>      Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
> 
> Thx. Sorry, missed that, thx for pointing me towards it. I had looked
> for things like that, but seems I messed up my lore query. Apologies for
> the noise!
> 
>> Also, there's a (pending) DRM patch fixing it from the guest's side:
>> https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9388ccf69925223223c87355a417ba39b13a5e8e
> 
> Great!
> 
> Ciao, Thorsten
> 
> P.S.:
> 
> #regzbot fix: 9d70f3fec14421e793ffbc0ec2f739b24e534900
> 
> 
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2025-04-10  1:13 [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Myrsky Lintu
@ 2025-04-10  5:12 ` Yan Zhao
  2025-04-10 10:05   ` Myrsky Lintu
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2025-04-10  5:12 UTC (permalink / raw)
  To: Myrsky Lintu
  Cc: Linux regressions mailing list, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, kvm, rcu, linux-kernel,
	Kevin Tian, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett, Gerd Hoffmann

Hi,

AFAIK, the commit c9c1e20b4c7d ("KVM: x86: Introduce Intel specific quirk
KVM_X86_QUIRK_IGNORE_GUEST_PAT") which re-allows honoring guest PAT on Intel's
platforms has been in kvm/queue now.

However, as the quirk is enabled by default, userspace(like QEMU) needs to turn
it off by code like "kvm_vm_enable_cap(kvm_state, KVM_CAP_DISABLE_QUIRKS2, 0,
KVM_X86_QUIRK_IGNORE_GUEST_PAT)" to honor guest PAT, according to the doc:

KVM_X86_QUIRK_IGNORE_GUEST_PAT   ...
                                 Userspace can disable the quirk to honor
                                 guest PAT if it knows that there is no such
                                 guest software, for example if it does not
                                 expose a bochs graphics device (which is
                                 known to have had a buggy driver).

Thanks
Yan

On Thu, Apr 10, 2025 at 01:13:18AM +0000, Myrsky Lintu wrote:
> Hello,
> 
> I am completely new to and uninformed about kernel development. I was 
> pointed here from Mesa documentation for Venus (Vulkan encapsulation for 
> KVM/QEMU): https://docs.mesa3d.org/drivers/venus.html
> 
> Based on my limited understanding of what has happened here, this patch 
> series was partially reverted due to an issue with the Bochs DRM driver. 
> A fix for that issue has been merged months ago according to the link 
> provided in an earlier message. Since then work on this detail of KVM 
> seems to have stalled.
> 
> Is it reasonable to ask here for this patch series to be evaluated and 
> incorporated again?
> 
> My layperson's attempt at applying the series against 6.14.1 source code 
> failed. In addition to the parts that appear to have already been 
> incorporated there are some parts of the patch series that are rejected. 
> I lack the knowledge to correct that.
> 
> Distro kernels currently ship without it which limits the usability of 
> Venus on AMD and NVIDIA GPUs paired with Intel CPUs. Convincing 
> individual distro maintainers of the necessity of this patch series 
> without the specialized knowledge required for understanding what it 
> does and performing that evaluation is quite hard. If upstream (kernel) 
> would apply it now the distros would ship a kernel including the 
> required changes to users, including me, without that multiplicated effort.
> 
> Thank you for your time. If this request is out of place here please 
> forgive me for engaging this mailing list without a proper understanding 
> of the list's scope.
> 
> On 2024-10-07 14:04:24, Linux regression tracking (Thorsten Leemhuis) wrote:
> > On 07.10.24 15:38, Vitaly Kuznetsov wrote:
> >> "Linux regression tracking (Thorsten Leemhuis)"
> >> <regressions@leemhuis.info> writes:
> >>
> >>> On 30.08.24 11:35, Vitaly Kuznetsov wrote:
> >>>> Sean Christopherson <seanjc@google.com> writes:
> >>>>
> >>>>> Unconditionally honor guest PAT on CPUs that support self-snoop, as
> >>>>> Intel has confirmed that CPUs that support self-snoop always snoop caches
> >>>>> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
> >>>>> even in the presence of aliased memtypes, thus there is no need to trust
> >>>>> the guest behaves and only honor PAT as a last resort, as KVM does today.
> >>>>>
> >>>>> Honoring guest PAT is desirable for use cases where the guest has access
> >>>>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
> >>>>> (mediated, for all intents and purposes) GPU is exposed to the guest, along
> >>>>> with buffers that are consumed directly by the physical GPU, i.e. which
> >>>>> can't be proxied by the host to ensure writes from the guest are performed
> >>>>> with the correct memory type for the GPU.
> >>>>
> >>>> Necroposting!
> >>>>
> >>>> Turns out that this change broke "bochs-display" driver in QEMU even
> >>>> when the guest is modern (don't ask me 'who the hell uses bochs for
> >>>> modern guests', it was basically a configuration error :-). E.g:
> >>>> [...]
> >>>
> >>> This regression made it to the list of tracked regressions. It seems
> >>> this thread stalled a while ago. Was this ever fixed? Does not look like
> >>> it, but I might have missed something. Or is this a regression I should
> >>> just ignore for one reason or another?
> >>>
> >>
> >> The regression was addressed in by reverting 377b2f359d1f in 6.11
> >>
> >> commit 9d70f3fec14421e793ffbc0ec2f739b24e534900
> >> Author: Paolo Bonzini <pbonzini@redhat.com>
> >> Date:   Sun Sep 15 02:49:33 2024 -0400
> >>
> >>      Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
> > 
> > Thx. Sorry, missed that, thx for pointing me towards it. I had looked
> > for things like that, but seems I messed up my lore query. Apologies for
> > the noise!
> > 
> >> Also, there's a (pending) DRM patch fixing it from the guest's side:
> >> https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9388ccf69925223223c87355a417ba39b13a5e8e
> > 
> > Great!
> > 
> > Ciao, Thorsten
> > 
> > P.S.:
> > 
> > #regzbot fix: 9d70f3fec14421e793ffbc0ec2f739b24e534900
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
  2025-04-10  5:12 ` Yan Zhao
@ 2025-04-10 10:05   ` Myrsky Lintu
  0 siblings, 0 replies; 61+ messages in thread
From: Myrsky Lintu @ 2025-04-10 10:05 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Linux regressions mailing list, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, kvm, rcu, linux-kernel,
	Kevin Tian, Yiwei Zhang, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett, Gerd Hoffmann

Thank you. I will try to bring this up with QEMU developers then.

On 2025-04-10 05:12:01, Yan Zhao wrote:
> Hi,
> 
> AFAIK, the commit c9c1e20b4c7d ("KVM: x86: Introduce Intel specific quirk
> KVM_X86_QUIRK_IGNORE_GUEST_PAT") which re-allows honoring guest PAT on Intel's
> platforms has been in kvm/queue now.
> 
> However, as the quirk is enabled by default, userspace(like QEMU) needs to turn
> it off by code like "kvm_vm_enable_cap(kvm_state, KVM_CAP_DISABLE_QUIRKS2, 0,
> KVM_X86_QUIRK_IGNORE_GUEST_PAT)" to honor guest PAT, according to the doc:
> 
> KVM_X86_QUIRK_IGNORE_GUEST_PAT   ...
>                                   Userspace can disable the quirk to honor
>                                   guest PAT if it knows that there is no such
>                                   guest software, for example if it does not
>                                   expose a bochs graphics device (which is
>                                   known to have had a buggy driver).
> 
> Thanks
> Yan
> 
> On Thu, Apr 10, 2025 at 01:13:18AM +0000, Myrsky Lintu wrote:
>> Hello,
>>
>> I am completely new to and uninformed about kernel development. I was
>> pointed here from Mesa documentation for Venus (Vulkan encapsulation for
>> KVM/QEMU): https://docs.mesa3d.org/drivers/venus.html
>>
>> Based on my limited understanding of what has happened here, this patch
>> series was partially reverted due to an issue with the Bochs DRM driver.
>> A fix for that issue has been merged months ago according to the link
>> provided in an earlier message. Since then work on this detail of KVM
>> seems to have stalled.
>>
>> Is it reasonable to ask here for this patch series to be evaluated and
>> incorporated again?
>>
>> My layperson's attempt at applying the series against 6.14.1 source code
>> failed. In addition to the parts that appear to have already been
>> incorporated there are some parts of the patch series that are rejected.
>> I lack the knowledge to correct that.
>>
>> Distro kernels currently ship without it which limits the usability of
>> Venus on AMD and NVIDIA GPUs paired with Intel CPUs. Convincing
>> individual distro maintainers of the necessity of this patch series
>> without the specialized knowledge required for understanding what it
>> does and performing that evaluation is quite hard. If upstream (kernel)
>> would apply it now the distros would ship a kernel including the
>> required changes to users, including me, without that multiplicated effort.
>>
>> Thank you for your time. If this request is out of place here please
>> forgive me for engaging this mailing list without a proper understanding
>> of the list's scope.
>>
>> On 2024-10-07 14:04:24, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> On 07.10.24 15:38, Vitaly Kuznetsov wrote:
>>>> "Linux regression tracking (Thorsten Leemhuis)"
>>>> <regressions@leemhuis.info> writes:
>>>>
>>>>> On 30.08.24 11:35, Vitaly Kuznetsov wrote:
>>>>>> Sean Christopherson <seanjc@google.com> writes:
>>>>>>
>>>>>>> Unconditionally honor guest PAT on CPUs that support self-snoop, as
>>>>>>> Intel has confirmed that CPUs that support self-snoop always snoop caches
>>>>>>> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>>>>>>> even in the presence of aliased memtypes, thus there is no need to trust
>>>>>>> the guest behaves and only honor PAT as a last resort, as KVM does today.
>>>>>>>
>>>>>>> Honoring guest PAT is desirable for use cases where the guest has access
>>>>>>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>>>>>>> (mediated, for all intents and purposes) GPU is exposed to the guest, along
>>>>>>> with buffers that are consumed directly by the physical GPU, i.e. which
>>>>>>> can't be proxied by the host to ensure writes from the guest are performed
>>>>>>> with the correct memory type for the GPU.
>>>>>>
>>>>>> Necroposting!
>>>>>>
>>>>>> Turns out that this change broke "bochs-display" driver in QEMU even
>>>>>> when the guest is modern (don't ask me 'who the hell uses bochs for
>>>>>> modern guests', it was basically a configuration error :-). E.g:
>>>>>> [...]
>>>>>
>>>>> This regression made it to the list of tracked regressions. It seems
>>>>> this thread stalled a while ago. Was this ever fixed? Does not look like
>>>>> it, but I might have missed something. Or is this a regression I should
>>>>> just ignore for one reason or another?
>>>>>
>>>>
>>>> The regression was addressed in by reverting 377b2f359d1f in 6.11
>>>>
>>>> commit 9d70f3fec14421e793ffbc0ec2f739b24e534900
>>>> Author: Paolo Bonzini <pbonzini@redhat.com>
>>>> Date:   Sun Sep 15 02:49:33 2024 -0400
>>>>
>>>>       Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
>>>
>>> Thx. Sorry, missed that, thx for pointing me towards it. I had looked
>>> for things like that, but seems I messed up my lore query. Apologies for
>>> the noise!
>>>
>>>> Also, there's a (pending) DRM patch fixing it from the guest's side:
>>>> https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9388ccf69925223223c87355a417ba39b13a5e8e
>>>
>>> Great!
>>>
>>> Ciao, Thorsten
>>>
>>> P.S.:
>>>
>>> #regzbot fix: 9d70f3fec14421e793ffbc0ec2f739b24e534900
>>>
>>>
>>>
>>
>>



^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2025-04-10 10:05 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-09  1:09 [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Sean Christopherson
2024-03-09  1:09 ` [PATCH 1/5] KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes Sean Christopherson
2024-03-11  7:44   ` Yan Zhao
2024-03-12  0:08     ` Sean Christopherson
2024-03-12  1:10   ` Dongli Zhang
2024-03-12 17:08     ` Sean Christopherson
2024-03-14 10:31       ` Dongli Zhang
2024-03-14 14:47         ` Sean Christopherson
2024-03-09  1:09 ` [PATCH 2/5] KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1 Sean Christopherson
2024-03-09  1:09 ` [PATCH 3/5] srcu: Add an API for a memory barrier after SRCU read lock Sean Christopherson
2024-03-09  1:09 ` [PATCH 4/5] KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path Sean Christopherson
2024-06-20 22:38   ` Paolo Bonzini
2024-06-20 23:42     ` Paul E. McKenney
2024-06-21  0:52     ` Yan Zhao
2024-03-09  1:09 ` [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Sean Christopherson
2024-03-11  1:16   ` Yan Zhao
2024-03-12  0:25     ` Sean Christopherson
2024-03-12  7:30       ` Tian, Kevin
2024-03-12 16:07         ` Sean Christopherson
2024-03-13  1:18           ` Yan Zhao
2024-03-13  8:52             ` Tian, Kevin
2024-03-13  8:55               ` Yan Zhao
2024-03-13 15:09                 ` Sean Christopherson
2024-03-14  0:12                   ` Yan Zhao
2024-03-14  1:00                     ` Sean Christopherson
2024-03-25  3:43   ` Chao Gao
2024-04-01 22:29     ` Sean Christopherson
2024-08-30  9:35   ` Vitaly Kuznetsov
2024-08-30 11:05     ` Gerd Hoffmann
2024-08-30 13:47       ` Vitaly Kuznetsov
2024-08-30 13:52         ` Sean Christopherson
2024-08-30 14:06           ` Vitaly Kuznetsov
2024-08-30 14:37             ` Vitaly Kuznetsov
2024-08-30 16:13               ` Sean Christopherson
2024-09-02  8:23                 ` Gerd Hoffmann
2024-09-02  1:44         ` Yan Zhao
2024-09-02  9:49           ` Vitaly Kuznetsov
2024-09-03  0:25             ` Yan Zhao
2024-09-03 15:30             ` Sean Christopherson
2024-09-03 16:20               ` Vitaly Kuznetsov
2024-09-04  2:28                 ` Yan Zhao
2024-09-04 12:17                   ` Yan Zhao
2024-09-05  0:41                     ` Sean Christopherson
2024-09-05  9:43                       ` Yan Zhao
2024-09-09  5:30                         ` Yan Zhao
2024-09-09 13:24                           ` Paolo Bonzini
2024-09-09 16:04                             ` Sean Christopherson
2024-09-10  1:05                             ` Yan Zhao
2024-09-04 11:47                 ` Vitaly Kuznetsov
2024-10-07 13:28     ` Linux regression tracking (Thorsten Leemhuis)
2024-10-07 13:38       ` Vitaly Kuznetsov
2024-10-07 14:04         ` Linux regression tracking (Thorsten Leemhuis)
2024-03-22  9:29 ` [PATCH 0/5] KVM: VMX: Drop MTRR virtualization, honor guest PAT Ma, Yongwei
2024-03-22 13:08 ` Yan Zhao
2024-03-25  6:56   ` Ma, XiangfeiX
2024-03-25  8:02     ` Ma, XiangfeiX
2024-06-05 23:20 ` Sean Christopherson
2024-06-06  0:03   ` Paul E. McKenney
  -- strict thread matches above, loose matches on Subject: below --
2025-04-10  1:13 [PATCH 5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Myrsky Lintu
2025-04-10  5:12 ` Yan Zhao
2025-04-10 10:05   ` Myrsky Lintu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).