A question about how the KVM emulates the effect of guest MTRRs on AMD platforms

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
@ 2023-08-12 23:04 Yibo Huang
  2023-10-27 23:13 ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Yibo Huang @ 2023-08-12 23:04 UTC (permalink / raw)
  To: kvm

Hi the KVM community,

I am sending this email to ask about how the KVM emulates the effect of guest MTRRs on AMD platforms.

Since there is no hardware support for guest MTRRs, the VMM can simulate their effect by altering the memory types in the EPT/NPT. From my understanding, this is exactly what the KVM does for Intel platforms. More specifically, in arch/x86/kvm/mmu/spte.c #make_spte(), the KVM tries to respect the guest MTRRs by calling #kvm_x86_ops.get_mt_mask() to get the memory types indicated by the guest MTRRs and applying that to the EPT. For Intel platforms, the implementation of #kvm_x86_ops.get_mt_mask() is #vmx_get_mt_mask(), which calls the #kvm_mtrr_get_guest_memory_type() to get the memory types indicated by the guest MTRRs.

However, on AMD platforms, the KVM does not implement #kvm_x86_ops.get_mt_mask() at all, so it just returns zero. Does it mean that the KVM does not use the NPT to emulate the effect of guest MTRRs on AMD platforms? I tried but failed to find out how the KVM does for AMD platforms.

Can someone help me understand how the KVM emulates the effect of guest MTRRs on AMD platforms? Thanks a lot!

Best,
Yibo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-08-12 23:04 A question about how the KVM emulates the effect of guest MTRRs on AMD platforms Yibo Huang
@ 2023-10-27 23:13 ` Sean Christopherson
  2023-10-30 12:16   ` Yan Zhao
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2023-10-27 23:13 UTC (permalink / raw)
  To: Yibo Huang; +Cc: kvm, Yan Zhao

+Yan

On Sat, Aug 12, 2023, Yibo Huang wrote:
> Hi the KVM community,
> 
> I am sending this email to ask about how the KVM emulates the effect of guest
> MTRRs on AMD platforms.
> 
> Since there is no hardware support for guest MTRRs, the VMM can simulate
> their effect by altering the memory types in the EPT/NPT. From my
> understanding, this is exactly what the KVM does for Intel platforms. More
> specifically, in arch/x86/kvm/mmu/spte.c #make_spte(), the KVM tries to
> respect the guest MTRRs by calling #kvm_x86_ops.get_mt_mask() to get the
> memory types indicated by the guest MTRRs and applying that to the EPT. For
> Intel platforms, the implementation of #kvm_x86_ops.get_mt_mask() is
> #vmx_get_mt_mask(), which calls the #kvm_mtrr_get_guest_memory_type() to get
> the memory types indicated by the guest MTRRs.

KVM doesn't always honor guest MTTRs, KVM only does all of this if there is a
passhtrough device with non-coherent DMA attached to the VM.  There's actually
an outstanding issue with virtio-gpu where non-coherent GPUs are flaky due to
KVM not stuffing the EPT memtype because KVM isn't aware of the non-coherent DMA.

> However, on AMD platforms, the KVM does not implement
> #kvm_x86_ops.get_mt_mask() at all, so it just returns zero. Does it mean that
> the KVM does not use the NPT to emulate the effect of guest MTRRs on AMD
> platforms? I tried but failed to find out how the KVM does for AMD platforms.

Correct.  The short answer is that SVM+NPT obviates the need to emulate guest
MTRRs for real world guest workloads.

The shortcomings of VMX+EPT are that (a) guest CR0.CD isn't virtualized by
hardware and (b) AFAIK, if the guest accesses memory with PAT=WC to memory that
the host has accessed with PAT=WB (and MTRR=WB), the CPU will *not* snoop caches
on the guest access.

SVM on the other hand fully virtualizes CR0.CD, and NPT is quite clever in how
it handles guest WC:

  A new memory type WC+ is introduced. WC+ is an uncacheable memory type, and
  combines writes in write-combining buffers like WC. Unlike WC (but like the CD
  memory type), accesses to WC+ memory also snoop the caches on all processors
  (including self-snooping the caches of the processor issuing the request) to
  maintain coherency. This ensures that cacheable writes are observed by WC+ accesses.

And VMRUN (and #VMEXIT) flush the WC buffers, e.g. if the guest is using WB and
the host is using WC, things will still work as expected (well, maybe not for
cases where the host is writing and the guest is reading from different CPUs).
Anyways, evidenced by the lack of bug reports over the last decade, for practical
purposes snooping the caches on guest WC accesses is sufficient.

Hrm, but typing all that out, I have absolutely no idea why VMX+EPT cares about
guest MTRRs.  Honoring guest PAT I totally get, but the guest MTRRs make no sense.
E.g. I have a very hard time believing a real world guest kernel mucks with the
MTRRs to setup DMA.  And again, this is supported by the absense of bug reports
on AMD.

Yan,

You've been digging into this code recently, am I forgetting something because
it's late on a Friday?  Or have we been making the very bad assumption that KVM
code from 10+ years ago actually makes sense?  I.e. for non-coherent DMA, can we
delete all of the MTRR insanity and simply clear IPAT?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-10-27 23:13 ` Sean Christopherson
@ 2023-10-30 12:16   ` Yan Zhao
  2023-10-30 19:24     ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2023-10-30 12:16 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yibo Huang, kvm

On Fri, Oct 27, 2023 at 04:13:36PM -0700, Sean Christopherson wrote:
> +Yan
> 
> On Sat, Aug 12, 2023, Yibo Huang wrote:
> > Hi the KVM community,
> > 
> > I am sending this email to ask about how the KVM emulates the effect of guest
> > MTRRs on AMD platforms.
> > 
> > Since there is no hardware support for guest MTRRs, the VMM can simulate
> > their effect by altering the memory types in the EPT/NPT. From my
> > understanding, this is exactly what the KVM does for Intel platforms. More
> > specifically, in arch/x86/kvm/mmu/spte.c #make_spte(), the KVM tries to
> > respect the guest MTRRs by calling #kvm_x86_ops.get_mt_mask() to get the
> > memory types indicated by the guest MTRRs and applying that to the EPT. For
> > Intel platforms, the implementation of #kvm_x86_ops.get_mt_mask() is
> > #vmx_get_mt_mask(), which calls the #kvm_mtrr_get_guest_memory_type() to get
> > the memory types indicated by the guest MTRRs.
> 
> KVM doesn't always honor guest MTTRs, KVM only does all of this if there is a
> passhtrough device with non-coherent DMA attached to the VM.  There's actually
> an outstanding issue with virtio-gpu where non-coherent GPUs are flaky due to
> KVM not stuffing the EPT memtype because KVM isn't aware of the non-coherent DMA.
> 
> > However, on AMD platforms, the KVM does not implement
> > #kvm_x86_ops.get_mt_mask() at all, so it just returns zero. Does it mean that
> > the KVM does not use the NPT to emulate the effect of guest MTRRs on AMD
> > platforms? I tried but failed to find out how the KVM does for AMD platforms.
> 
> Correct.  The short answer is that SVM+NPT obviates the need to emulate guest
> MTRRs for real world guest workloads.
> 
> The shortcomings of VMX+EPT are that (a) guest CR0.CD isn't virtualized by
> hardware and (b) AFAIK, if the guest accesses memory with PAT=WC to memory that
> the host has accessed with PAT=WB (and MTRR=WB), the CPU will *not* snoop caches
> on the guest access.
> 
> SVM on the other hand fully virtualizes CR0.CD, and NPT is quite clever in how
> it handles guest WC:
> 
>   A new memory type WC+ is introduced. WC+ is an uncacheable memory type, and
>   combines writes in write-combining buffers like WC. Unlike WC (but like the CD
>   memory type), accesses to WC+ memory also snoop the caches on all processors
>   (including self-snooping the caches of the processor issuing the request) to
>   maintain coherency. This ensures that cacheable writes are observed by WC+ accesses.
> 
> And VMRUN (and #VMEXIT) flush the WC buffers, e.g. if the guest is using WB and
> the host is using WC, things will still work as expected (well, maybe not for
> cases where the host is writing and the guest is reading from different CPUs).
> Anyways, evidenced by the lack of bug reports over the last decade, for practical
> purposes snooping the caches on guest WC accesses is sufficient.
> 
> Hrm, but typing all that out, I have absolutely no idea why VMX+EPT cares about
> guest MTRRs.  Honoring guest PAT I totally get, but the guest MTRRs make no sense.
I think honoring guest MTRRs is because VMX+EPT relies on guest to send clflush or
wbinvd in cases like EPT is WC/UC + guest PAT is WB for non-coherent DMA devices.
So, in order to let guest driver's view of memory type and host effective memory
type be consistent, current KVM programs EPT with the value of guest MTRRs.

If EPT only honors guest PAT and sets EPT to WB, while guest MTRR is WC or UC,
then if guest driver thinks the effective memory type is WC or UC, it will not
do the cache flush correctly.

But I don't see linux guest driver to check combination of guest MTRR + guest PAT
directly.

Instead, when linux guest driver wants to program a PAT, it checks guest MTRRs to see
if it's feasible.

remap_pfn_range
  reserve_pfn_range
    memtype_reserve
      pat_x_mtrr_type

So, before guest programs PAT to WB, it should find guest MTRR is WC/UC and return
WC/UC as PAT or just fail.

In this regard, I think honoring guest PAT only also makes sense.

> E.g. I have a very hard time believing a real world guest kernel mucks with the
> MTRRs to setup DMA.  And again, this is supported by the absense of bug reports
> on AMD.
> 
> 
> Yan,
> 
> You've been digging into this code recently, am I forgetting something because
> it's late on a Friday?  Or have we been making the very bad assumption that KVM
> code from 10+ years ago actually makes sense?  I.e. for non-coherent DMA, can we
> delete all of the MTRR insanity and simply clear IPAT?
Not sure if there are guest drivers can program PAT as WB but treat memory type
as UC.
In theory, honoring guest MTRRs is the most safe way.
Do you think a complete analyse of all corner cases are deserved?
I'm happy if we can remove all the MTRR stuffs in VMX :)




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-10-30 12:16   ` Yan Zhao
@ 2023-10-30 19:24     ` Sean Christopherson
       [not found]       ` <3E43ADC6-E817-411A-9EBF-B16142B9B478@cs.utexas.edu>
  2023-10-31 10:01       ` Yan Zhao
  0 siblings, 2 replies; 16+ messages in thread
From: Sean Christopherson @ 2023-10-30 19:24 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Yibo Huang, kvm

On Mon, Oct 30, 2023, Yan Zhao wrote:
> On Fri, Oct 27, 2023 at 04:13:36PM -0700, Sean Christopherson wrote:
> > E.g. I have a very hard time believing a real world guest kernel mucks with the
> > MTRRs to setup DMA.  And again, this is supported by the absense of bug reports
> > on AMD.
> > 
> > 
> > Yan,
> > 
> > You've been digging into this code recently, am I forgetting something because
> > it's late on a Friday?  Or have we been making the very bad assumption that KVM
> > code from 10+ years ago actually makes sense?  I.e. for non-coherent DMA, can we
> > delete all of the MTRR insanity and simply clear IPAT?
> Not sure if there are guest drivers can program PAT as WB but treat memory type
> as UC.
> In theory, honoring guest MTRRs is the most safe way.
> Do you think a complete analyse of all corner cases are deserved?

I 100% agree that honoring guest MTRRs is the safest, but KVM's current approach
make no sense, at all.  From a hardware virtualization perspective of guest MTRRs,
there is _nothing_ special about EPT.  Legacy shadowing paging doesn't magically
account for guest MTRRs, nor does NPT.

The only wrinkle is that NPT honors gCR0, i.e. actually puts the CPU caches into
no-fill mode, whereas VMX does nothing and forces KVM to (poorly) emulate that
behavior by forcing UC.

TL;DR of the below: Rather than try to make MTRR virtualization suck less for EPT,
I think we should delete that code entirely and take a KVM errata to formally
document that KVM doesn't virtualize guest MTRRs.  In addition to solving the
performance issues with zapping SPTEs for MTRR changes, that'll eliminate 600+
lines of complex code (the overlay shenanigans used for fixed MTRRs are downright
mean).

 arch/x86/include/asm/kvm_host.h |  15 +---
 arch/x86/kvm/mmu/mmu.c          |  16 ----
 arch/x86/kvm/mtrr.c             | 644 ++++++-------------------------------------------------------------------------------------------------------------------------------
 arch/x86/kvm/vmx/vmx.c          |  12 +--
 arch/x86/kvm/x86.c              |   1 -
 arch/x86/kvm/x86.h              |   4 -
 6 files changed, 36 insertions(+), 656 deletions(-)

Digging deeper through the history, this *mostly* appears to be the result of coming
to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling.

The zapping GFNs logic came from

  commit efdfe536d8c643391e19d5726b072f82964bfbdb
  Author: Xiao Guangrong <guangrong.xiao@linux.intel.com>
  Date:   Wed May 13 14:42:27 2015 +0800

    KVM: MMU: fix MTRR update

    Currently, whenever guest MTRR registers are changed
    kvm_mmu_reset_context is called to switch to the new root shadow page
    table, however, it's useless since:
    1) the cache type is not cached into shadow page's attribute so that
       the original root shadow page will be reused

    2) the cache type is set on the last spte, that means we should sync
       the last sptes when MTRR is changed

    This patch fixs this issue by drop all the spte in the gfn range which
    is being updated by MTRR

which was a fix for 

  commit 0bed3b568b68e5835ef5da888a372b9beabf7544
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Oct 9 16:01:54 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Wed Dec 31 16:51:44 2008 +0200

      KVM: Improve MTRR structure

      As well as reset mmu context when set MTRR.

(side topic, if anyone wonders why I am so particular about changelogs, the above
is exactly 

Anyways, the above was part of a "MTRR/PAT support for EPT" series that also added

+	if (mt_mask) {
+		mt_mask = get_memory_type(vcpu, gfn) <<
+			  kvm_x86_ops->get_mt_mask_shift();
+		spte |= mt_mask;
+	}

where get_memory_type() was a truly gnarly helper to retrive the guest MTRR memtype
for a given memtype.  And *very* subtly, at the time of that change, KVM *always*
set VMX_EPT_IGMT_BIT,

        kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
                VMX_EPT_WRITABLE_MASK |
                VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT |
                VMX_EPT_IGMT_BIT);

which came in via

  commit 928d4bf747e9c290b690ff515d8f81e8ee226d97
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Nov 6 14:55:45 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Tue Nov 11 21:00:37 2008 +0200

      KVM: VMX: Set IGMT bit in EPT entry

      There is a potential issue that, when guest using pagetable without vmexit when
      EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
      which would be inconsistent with host side and would cause host MCE due to
      inconsistent cache attribute.

      The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
      memory type to protect host (notice that all memory mapped by KVM should be WB).

Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added the whole
IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough
enabling, but Avi applied it earlier because it was a generic fix.

Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant
code, or rather lack thereof, is the handling of *host* MMIO.  That fix came in a
bit later, but given the author and timing, I think it's safe to say it was all
part of the same EPT+VT-d enabling mess.

  commit 2aaf69dcee864f4fb6402638dd2f263324ac839f
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Wed Jan 21 16:52:16 2009 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Sun Feb 15 02:47:37 2009 +0200

    KVM: MMU: Map device MMIO as UC in EPT

    Software are not allow to access device MMIO using cacheable memory type, the
    patch limit MMIO region with UC and WC(guest can select WC using PAT and
    PCD/PWT).

In addition to the host MMIO and IGMT issues, this code was obviously never tested
on NPT until much later, which lends further credence to my theory/argument that
this was all the result of misdiagnosed issues.

Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
Yang was trying to resolve issues with passthrough MMIO.

 * Sheng Yang 
  : Do you mean host(qemu) would access this memory and if we set it to guest 
  : MTRR, host access would be broken? We would cover this in our shadow MTRR 
  : patch, for we encountered this in video ram when doing some experiment with 
  : VGA assignment. 

And in the same thread, there's also what appears to be confirmation of Intel
running into issues with Windows XP related to a guest device driver mapping
DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
the fact that EPT and NPT both honor guest PAT by default.  /facepalm

 * Avi Kavity
  : Sheng Yang wrote:
  : > Yes... But it's easy to do with assigned devices' mmio, but what if guest 
  : > specific some non-mmio memory's memory type? E.g. we have met one issue in 
  : > Xen, that a assigned-device's XP driver specific one memory region as buffer, 
  : > and modify the memory type then do DMA.
  : >
  : > Only map MMIO space can be first step, but I guess we can modify assigned 
  : > memory region memory type follow guest's? 
  : >   
  : 
  : With ept/npt, we can't, since the memory type is in the guest's 
  : pagetable entries, and these are not accessible.

[*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com

So, for the most part, what I think happened is that 15 years ago, a few engineers
(a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough
device MMIO by emulating *guest* MTRRs.  Except for the below case, everything since
then has been a result of those two intertwined changes.

The one exception, which is actually yet more confirmation of all of the above,
is the revert of Paolo's attempt at "full" virtualization of guest MTRRs:

  commit 606decd67049217684e3cb5a54104d51ddd4ef35
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:12:47 2015 +0200

    Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"

    This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5.
    It was reported to cause Machine Check Exceptions (bug 104091).

...

  commit fd717f11015f673487ffc826e59b2bad69d20fe5
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:38:13 2015 +0200

    KVM: x86: apply guest MTRR virtualization on host reserved pages

    Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
    However, the guest could prefer a different page type than UC for
    such pages. A good example is that pass-throughed VGA frame buffer is
    not always UC as host expected.

    This patch enables full use of virtual guest MTRRs.

I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT"
and got the same result: machine checks, likely due to the guest MTRRs not being
trustworthy/sane at all times.

And FWIW, Paolo also tried to enable MTRR virtualization on NP, but that too got
reverted.  I read through the threads, and AFAICT no one ever found a smoking gun,
i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times
doesn't appear to have a definitive root cause.

  commit fc07e76ac7ffa3afd621a1c3858a503386a14281
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:20:22 2015 +0200

    Revert "KVM: SVM: use NPT page attributes"

    This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22.
    Initializing the mapping from MTRR to PAT values was reported to
    fail nondeterministically, and it also caused extremely slow boot
    (due to caching getting disabled---bug 103321) with assigned devices.

...

  commit 3c2e7f7de3240216042b61073803b61b9b3cfb22
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:32:17 2015 +0200

    KVM: SVM: use NPT page attributes

    Right now, NPT page attributes are not used, and the final page
    attribute depends solely on gPAT (which however is not synced
    correctly), the guest MTRRs and the guest page attributes.

    However, we can do better by mimicking what is done for VMX.
    In the absence of PCI passthrough, the guest PAT can be ignored
    and the page attributes can be just WB.  If passthrough is being
    used, instead, keep respecting the guest PAT, and emulate the guest
    MTRRs through the PAT field of the nested page tables.

    The only snag is that WP memory cannot be emulated correctly,
    because Linux's default PAT setting only includes the other types.

In other words, my reading of the tea leaves is that honoring guest MTRRs for VMX
was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not
forcing UC for host MMIO.  And while there *are* known cases where honoring guest
MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in
that case is to get WC instead of UC, i.e. at this point it's for performance,
not correctness.

Furthermore, the complete absense of MTRR virtualization on NPT and shadow paging
proves that while KVM theoretically can do better, it's by no means necessary for
correctnesss.

Lastly, I would argue that since kernels mostly rely on firmware to do MTRR setup,
and the host typically provides guest firmware, honoring guest MTRRs is effectively
honoring *host* userspace memtypes, which is also backwards, i.e. it would be far
better for host userspace to communicate its desired directly to KVM (or perhaps
indirectly via VMAs in the host kernel, just not through guest MTRRs).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
       [not found]       ` <3E43ADC6-E817-411A-9EBF-B16142B9B478@cs.utexas.edu>
@ 2023-10-30 21:52         ` Sean Christopherson
  2023-11-01  3:07           ` Yibo Huang
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2023-10-30 21:52 UTC (permalink / raw)
  To: Yibo Huang; +Cc: Yan Zhao, kvm

On Mon, Oct 30, 2023, Yibo Huang wrote:
> Well, I agree with Sean’s opinion that SVM+NPT obviates the need to emulate
> guest MTRRs for real-world guest workloads.  However, from my own experience,
> I think KVM does emulate the effect of guest MTRRs on AMD platforms.
> 
> Here's the reason:
> 2 months ago, I was trying to attach a QEMU ivshmem device to my VMs running
> on Intel and AMD machines.  Since ivshmem is an emulated memory-backed
> device, it should be cacheable to get the best performance.
> Interestingly, I found that the memory region associated with ivshmem (PCIe
> BAR 2 region) was cacheable on Intel machines, but not cacheable on AMD
> machines.
> After some digging, I found that this was because of the guest MTRRs - on AMD
> machines, BIOS or guest OS (not sure who did this) set the memory region of
> ivshmem as non-cacheable in guest MTRRs (but cacheable in guest PAT). This
> was supported by the fact that ivhsmem became cacheable after removing the
> corresponding guest MTRRs (reg02) on AMD machines (using "echo -n disable=2 >
> /proc/mtrr”)
> Additionally, the reason why ivshmem was cacheable on Intel machines was that
> BIOS or guest OS didn’t set ivshmem as uncacheable in guest MTRRs on Intel
> machines (not sure why though).

What test(s) did you run to determine whether or not the memory was truly cacheable?
KVM emulates the MTRR MSRs themselves, e.g. the guest can read and write MTRRs,
and the guest will _think_ memory has a certain memtype, but that doesn't necessarily
have any impact on the memtype used by the CPU.

> Below is the output of “cat /proc/mtrr” on my VMs running on AMD machines. By
> removing reg02, ivshmem BAR 2 region became cacheable.
> 
> 
> So in my opinion, the above phenomenon suggests that KVM does honor guest
> MTRRs on AMD platforms.

Heh, this isn't opinion.  Unless you're running a very specific 10-year old kernel,
or a custom KVM build, KVM simply out doesn't propagate guest MTRRs into NPT.

And unless your setup also has non-coherent DMA attached to the device, KVM doesn't
honor guest MTRRs for EPT either (AFAICT, QEMU ivshmem doesn't require VFIO).

It's definitely possible that disabling a guest MTRR resulted in memory becoming
cacheable, but unless there's some very, very magical code hiding, it's not because
KVM actually fully virtualizes guest MTRRs on AMD.

E.g. before commit 9a3768191d95 ("KVM: x86/mmu: Zap SPTEs on MTRR update iff guest
MTRRs are honored"), which hasn't even made its way to Linus (or Paolo's) tree yet,
KVM unnecessarily zapped all NPT entries on MTRR changes.  Zapping NPT entries
could have cleared some weird TLB state, or perhaps even wiped out buggy KVM NPT
entries.

And on AMD, hardware virtualizes gCR0.CD, i.e. puts the caches into no-fill mode
when guest CR0.CD=1.  But Intel CPUs completely ignore guest CR0.CD, i.e. punt it
to software, and under QEMU, for all intents and purposes KVM never honors guest
CR0.CD for VMX.  It's seems highly quite unlikely that something in the guest left
CR0.CD=1, but it's possible.  And then the guest kernel's process of toggling
CR0.CD when doing MTRR updates would end up clearing CR0.CD and thus re-enable
caching.

> The thing was that I could not find any KVM code related to emulating guest
> MTRRs on AMD platforms, which was the reason why I decided to send the
> initial email asking about it.
> 
> I found this in the AMD64 Architecture Programmer’s Manual Volumes 1–5 (page
> 553): 
> 
> "Table 15-19 shows how guest and host PAT types are combined into an
> effective PAT type. When interpreting this table, recall (a) that guest and
> host PAT types are not combined when nested paging is disabled and (b) that
> the intent is for the VMM to use its PAT type to simulate guest MTRRs.”
> 
> Does this mean that AMD expects the VMM to emulate the effect of guest MTRRs
> by altering the host PAT types?

Yes.  Which is exactly what KVM did in commit 3c2e7f7de324 ("KVM: SVM: use NPT
page attributes"), which was a reverted a few months after it was introduced.

> I am not sure if I misunderstood something. But I can reproduce the example I
> mentioned above if you would like to look into it.

Yes, it would be helpful to confirm what's going on.  

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-10-30 19:24     ` Sean Christopherson
       [not found]       ` <3E43ADC6-E817-411A-9EBF-B16142B9B478@cs.utexas.edu>
@ 2023-10-31 10:01       ` Yan Zhao
  2023-10-31 15:14         ` Sean Christopherson
  1 sibling, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2023-10-31 10:01 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yibo Huang, kvm

On Mon, Oct 30, 2023 at 12:24:02PM -0700, Sean Christopherson wrote:
> On Mon, Oct 30, 2023, Yan Zhao wrote:
> > On Fri, Oct 27, 2023 at 04:13:36PM -0700, Sean Christopherson wrote:
> > > E.g. I have a very hard time believing a real world guest kernel mucks with the
> > > MTRRs to setup DMA.  And again, this is supported by the absense of bug reports
> > > on AMD.
> > > 
> > > 
> > > Yan,
> > > 
> > > You've been digging into this code recently, am I forgetting something because
> > > it's late on a Friday?  Or have we been making the very bad assumption that KVM
> > > code from 10+ years ago actually makes sense?  I.e. for non-coherent DMA, can we
> > > delete all of the MTRR insanity and simply clear IPAT?
> > Not sure if there are guest drivers can program PAT as WB but treat memory type
> > as UC.
> > In theory, honoring guest MTRRs is the most safe way.
> > Do you think a complete analyse of all corner cases are deserved?
> 
> I 100% agree that honoring guest MTRRs is the safest, but KVM's current approach
> make no sense, at all.  From a hardware virtualization perspective of guest MTRRs,
> there is _nothing_ special about EPT.  Legacy shadowing paging doesn't magically
> account for guest MTRRs, nor does NPT.
> 
> The only wrinkle is that NPT honors gCR0, i.e. actually puts the CPU caches into
> no-fill mode, whereas VMX does nothing and forces KVM to (poorly) emulate that
> behavior by forcing UC.
> 
> TL;DR of the below: Rather than try to make MTRR virtualization suck less for EPT,
> I think we should delete that code entirely and take a KVM errata to formally
> document that KVM doesn't virtualize guest MTRRs.  In addition to solving the
> performance issues with zapping SPTEs for MTRR changes, that'll eliminate 600+
> lines of complex code (the overlay shenanigans used for fixed MTRRs are downright
> mean).
> 
>  arch/x86/include/asm/kvm_host.h |  15 +---
>  arch/x86/kvm/mmu/mmu.c          |  16 ----
>  arch/x86/kvm/mtrr.c             | 644 ++++++-------------------------------------------------------------------------------------------------------------------------------
>  arch/x86/kvm/vmx/vmx.c          |  12 +--
>  arch/x86/kvm/x86.c              |   1 -
>  arch/x86/kvm/x86.h              |   4 -
>  6 files changed, 36 insertions(+), 656 deletions(-)
> 
> Digging deeper through the history, this *mostly* appears to be the result of coming
> to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling.
> 
> The zapping GFNs logic came from
> 
>   commit efdfe536d8c643391e19d5726b072f82964bfbdb
>   Author: Xiao Guangrong <guangrong.xiao@linux.intel.com>
>   Date:   Wed May 13 14:42:27 2015 +0800
> 
>     KVM: MMU: fix MTRR update
>     
>     Currently, whenever guest MTRR registers are changed
>     kvm_mmu_reset_context is called to switch to the new root shadow page
>     table, however, it's useless since:
>     1) the cache type is not cached into shadow page's attribute so that
>        the original root shadow page will be reused
>     
>     2) the cache type is set on the last spte, that means we should sync
>        the last sptes when MTRR is changed
>     
>     This patch fixs this issue by drop all the spte in the gfn range which
>     is being updated by MTRR
> 
> which was a fix for 
> 
>   commit 0bed3b568b68e5835ef5da888a372b9beabf7544
>   Author:     Sheng Yang <sheng@linux.intel.com>
>   AuthorDate: Thu Oct 9 16:01:54 2008 +0800
>   Commit:     Avi Kivity <avi@redhat.com>
>   CommitDate: Wed Dec 31 16:51:44 2008 +0200
>   
>       KVM: Improve MTRR structure
>       
>       As well as reset mmu context when set MTRR.
> 
> (side topic, if anyone wonders why I am so particular about changelogs, the above
> is exactly 
> 
> Anyways, the above was part of a "MTRR/PAT support for EPT" series that also added
> 
> +	if (mt_mask) {
> +		mt_mask = get_memory_type(vcpu, gfn) <<
> +			  kvm_x86_ops->get_mt_mask_shift();
> +		spte |= mt_mask;
> +	}
> 
> where get_memory_type() was a truly gnarly helper to retrive the guest MTRR memtype
> for a given memtype.  And *very* subtly, at the time of that change, KVM *always*
> set VMX_EPT_IGMT_BIT,
> 
>         kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
>                 VMX_EPT_WRITABLE_MASK |
>                 VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT |
>                 VMX_EPT_IGMT_BIT);
> 
> which came in via
> 
>   commit 928d4bf747e9c290b690ff515d8f81e8ee226d97
>   Author:     Sheng Yang <sheng@linux.intel.com>
>   AuthorDate: Thu Nov 6 14:55:45 2008 +0800
>   Commit:     Avi Kivity <avi@redhat.com>
>   CommitDate: Tue Nov 11 21:00:37 2008 +0200
>   
>       KVM: VMX: Set IGMT bit in EPT entry
>       
>       There is a potential issue that, when guest using pagetable without vmexit when
>       EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
>       which would be inconsistent with host side and would cause host MCE due to
>       inconsistent cache attribute.
>       
>       The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
>       memory type to protect host (notice that all memory mapped by KVM should be WB).
> 
> Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added the whole
> IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough
> enabling, but Avi applied it earlier because it was a generic fix.
>
My feeling is that
Current memtype handling for non-coherent DMA is a compromise between
(a) security ("qemu mappings will use writeback and guest mapping will use guest
specified memory types")
(b) the effective memtype cannot be cacheable if guest thinks it's non-cacheable.

So, for MMIOs in non-coherent DMAs, mapping them as UC in EPT is understandable,
because other value like WB or WC is not preferred --
guest usually sets MMIOs' PAT to UC or WC, so "PAT=UC && EPT=WB" or
"PAT=UC && EPT=WC" are not preferred according to SDM due to page aliasing.
And VFIO maps the MMIOs to UC in host.
(With pass-through GPU in my env, the MMIOs' guest MTRR is UC,
 I can observe host hang if I program its EPT type to
 - WB+IPAT or
 - WC
 )

For guest RAM, looks honoring guest MTRRs just mitigates the page aliasing
problem.
E.g. if guest PAT=UC because its MTRR=UC, setting EPT type=UC can avoid
"guest PAT=UC && EPT=WB", which is not recommended in SDM.
But it still breaks (a) if guest PAT is UC.
Also, honoring guest MTRRs in EPT is friendly to old systems that do not enable
PAT. I guess :)
But I agree, in common cases, honoring guest MTRRs or not looks no big difference.
(And I'm not lucky enough to reproduce page-aliasing-caused MCE yet in my
environment).

For CR0_CD=1,
- w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a).
- w/  KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b);
                                   with IPAT=0, it may breaks (a), but meets (b)


> Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant
> code, or rather lack thereof, is the handling of *host* MMIO.  That fix came in a
> bit later, but given the author and timing, I think it's safe to say it was all
> part of the same EPT+VT-d enabling mess.
> 
>   commit 2aaf69dcee864f4fb6402638dd2f263324ac839f
>   Author:     Sheng Yang <sheng@linux.intel.com>
>   AuthorDate: Wed Jan 21 16:52:16 2009 +0800
>   Commit:     Avi Kivity <avi@redhat.com>
>   CommitDate: Sun Feb 15 02:47:37 2009 +0200
> 
>     KVM: MMU: Map device MMIO as UC in EPT
>     
>     Software are not allow to access device MMIO using cacheable memory type, the
>     patch limit MMIO region with UC and WC(guest can select WC using PAT and
>     PCD/PWT).
> 
> In addition to the host MMIO and IGMT issues, this code was obviously never tested
> on NPT until much later, which lends further credence to my theory/argument that
> this was all the result of misdiagnosed issues.
> 
> Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
> Yang was trying to resolve issues with passthrough MMIO.
> 
>  * Sheng Yang 
>   : Do you mean host(qemu) would access this memory and if we set it to guest 
>   : MTRR, host access would be broken? We would cover this in our shadow MTRR 
>   : patch, for we encountered this in video ram when doing some experiment with 
>   : VGA assignment. 
> 
> And in the same thread, there's also what appears to be confirmation of Intel
> running into issues with Windows XP related to a guest device driver mapping
> DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
> SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
> the fact that EPT and NPT both honor guest PAT by default.  /facepalm

My interpretation is that the since guest PATs are in guest page tables,
while with EPT/NPT, guest page tables are not shadowed, it's not easy to
check guest PATs  to disallow host QEMU access to non-WB guest RAM.

The credence is with Avi's following word:
"Looks like a conflict between the requirements of a hypervisor 
supporting device assignment, and the memory type constraints of mapping 
everything with the same memory type.  As far as I can see, the only 
solution is not to map guest memory in the hypervisor, and do all 
accesses via dma.  This is easy for virtual disk, somewhat harder for 
virtual networking (need a dma engine or a multiqueue device).

Since qemu will only access memory on demand, we don't actually have to 
unmap guest memory, only to ensure that qemu doesn't touch it.  Things 
like live migration and page sharing won't work, but they aren't 
expected to with device assignment anyway."


>  
>  * Avi Kavity
>   : Sheng Yang wrote:
>   : > Yes... But it's easy to do with assigned devices' mmio, but what if guest 
>   : > specific some non-mmio memory's memory type? E.g. we have met one issue in 
>   : > Xen, that a assigned-device's XP driver specific one memory region as buffer, 
>   : > and modify the memory type then do DMA.
>   : >
>   : > Only map MMIO space can be first step, but I guess we can modify assigned 
>   : > memory region memory type follow guest's? 
>   : >   
>   : 
>   : With ept/npt, we can't, since the memory type is in the guest's 
>   : pagetable entries, and these are not accessible
> 
> [*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com
> 
> So, for the most part, what I think happened is that 15 years ago, a few engineers
> (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough
> device MMIO by emulating *guest* MTRRs.  Except for the below case, everything since
> then has been a result of those two intertwined changes.
> 
> The one exception, which is actually yet more confirmation of all of the above,
> is the revert of Paolo's attempt at "full" virtualization of guest MTRRs:
> 
>   commit 606decd67049217684e3cb5a54104d51ddd4ef35
>   Author: Paolo Bonzini <pbonzini@redhat.com>
>   Date:   Thu Oct 1 13:12:47 2015 +0200
> 
>     Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"
>     
>     This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5.
>     It was reported to cause Machine Check Exceptions (bug 104091).
> 
> ...
> 
>   commit fd717f11015f673487ffc826e59b2bad69d20fe5
>   Author: Paolo Bonzini <pbonzini@redhat.com>
>   Date:   Tue Jul 7 14:38:13 2015 +0200
> 
>     KVM: x86: apply guest MTRR virtualization on host reserved pages
>     
>     Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
>     However, the guest could prefer a different page type than UC for
>     such pages. A good example is that pass-throughed VGA frame buffer is
>     not always UC as host expected.
>     
>     This patch enables full use of virtual guest MTRRs.
> 
> I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT"
> and got the same result: machine checks, likely due to the guest MTRRs not being
> trustworthy/sane at all times.
> 
> And FWIW, Paolo also tried to enable MTRR virtualization on NP, but that too got
> reverted.  I read through the threads, and AFAICT no one ever found a smoking gun,
> i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times
> doesn't appear to have a definitive root cause.
> 
>   commit fc07e76ac7ffa3afd621a1c3858a503386a14281
>   Author: Paolo Bonzini <pbonzini@redhat.com>
>   Date:   Thu Oct 1 13:20:22 2015 +0200
> 
>     Revert "KVM: SVM: use NPT page attributes"
>     
>     This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22.
>     Initializing the mapping from MTRR to PAT values was reported to
>     fail nondeterministically, and it also caused extremely slow boot
>     (due to caching getting disabled---bug 103321) with assigned devices.
>
> ...
> 
>   commit 3c2e7f7de3240216042b61073803b61b9b3cfb22
>   Author: Paolo Bonzini <pbonzini@redhat.com>
>   Date:   Tue Jul 7 14:32:17 2015 +0200
> 
>     KVM: SVM: use NPT page attributes
>     
>     Right now, NPT page attributes are not used, and the final page
>     attribute depends solely on gPAT (which however is not synced
>     correctly), the guest MTRRs and the guest page attributes.
>     
>     However, we can do better by mimicking what is done for VMX.
>     In the absence of PCI passthrough, the guest PAT can be ignored
>     and the page attributes can be just WB.  If passthrough is being
>     used, instead, keep respecting the guest PAT, and emulate the guest
>     MTRRs through the PAT field of the nested page tables.
>     
>     The only snag is that WP memory cannot be emulated correctly,
>     because Linux's default PAT setting only includes the other types.
> 
> In other words, my reading of the tea leaves is that honoring guest MTRRs for VMX
> was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not
> forcing UC for host MMIO.  And while there *are* known cases where honoring guest
> MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in
> that case is to get WC instead of UC, i.e. at this point it's for performance,
> not correctness.
> 
> Furthermore, the complete absense of MTRR virtualization on NPT and shadow paging
> proves that while KVM theoretically can do better, it's by no means necessary for
> correctnesss.
> 
> Lastly, I would argue that since kernels mostly rely on firmware to do MTRR setup,
> and the host typically provides guest firmware, honoring guest MTRRs is effectively
> honoring *host* userspace memtypes, which is also backwards, i.e. it would be far
> better for host userspace to communicate its desired directly to KVM (or perhaps
> indirectly via VMAs in the host kernel, just not through guest MTRRs).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-10-31 10:01       ` Yan Zhao
@ 2023-10-31 15:14         ` Sean Christopherson
  2023-11-01  3:53           ` Huang, Kai
  2023-11-01  9:08           ` Yan Zhao
  0 siblings, 2 replies; 16+ messages in thread
From: Sean Christopherson @ 2023-10-31 15:14 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Yibo Huang, kvm

On Tue, Oct 31, 2023, Yan Zhao wrote:
> On Mon, Oct 30, 2023 at 12:24:02PM -0700, Sean Christopherson wrote:
> > On Mon, Oct 30, 2023, Yan Zhao wrote:
> > Digging deeper through the history, this *mostly* appears to be the result of coming
> > to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling.

...

> > Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added the whole
> > IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough
> > enabling, but Avi applied it earlier because it was a generic fix.
> >
> My feeling is that
> Current memtype handling for non-coherent DMA is a compromise between
> (a) security ("qemu mappings will use writeback and guest mapping will use guest
> specified memory types")
> (b) the effective memtype cannot be cacheable if guest thinks it's non-cacheable.

And correctness.  E.g. accessing memory with conficting memtypes could cause guest
data corruption, which isn't strictly the same as (a).

> So, for MMIOs in non-coherent DMAs, mapping them as UC in EPT is understandable,
> because other value like WB or WC is not preferred --
> guest usually sets MMIOs' PAT to UC or WC, so "PAT=UC && EPT=WB" or
> "PAT=UC && EPT=WC" are not preferred according to SDM due to page aliasing.
> And VFIO maps the MMIOs to UC in host.
> (With pass-through GPU in my env, the MMIOs' guest MTRR is UC,
>  I can observe host hang if I program its EPT type to
>  - WB+IPAT or
>  - WC
>  )

Yes, but all of that simply confirms that it's KVM's responsibility to map host
MMIO as UC.  The hangs you observe likely have nothing to do with memory aliasing,
and everything to do with accessing real MMIO with incompatible memtypes.

> For guest RAM, looks honoring guest MTRRs just mitigates the page aliasing
> problem.
> E.g. if guest PAT=UC because its MTRR=UC, setting EPT type=UC can avoid
> "guest PAT=UC && EPT=WB", which is not recommended in SDM.
> But it still breaks (a) if guest PAT is UC.
> Also, honoring guest MTRRs in EPT is friendly to old systems that do not enable
> PAT. I guess :)

LOL, no way.  The PAT can't be disabled, and the default PAT combinations are
backwards compatible with legacy PCD+PWT.  The only way for this to provide value
is if someone is virtualizing a pre-Pentium Pro CPU, doing device passthrough,
and *only* doing so on hardware with EPT.

> But I agree, in common cases, honoring guest MTRRs or not looks no big difference.
> (And I'm not lucky enough to reproduce page-aliasing-caused MCE yet in my
> environment).

FWIW, I don't think that page aliasing with WC/UC actually causes machine checks.
What does result in #MC (assuming things haven't changed in the last few years)
is accessing MMIO using WB and other cacheable memtypes, e.g. map the host APIC
with WB and you should see #MCs.  I suspect this is what people encountered years
ago when KVM attempted to honored guest MTRRs at all times.  E.g. the "full" MTRR
virtualization patch that got reverted deliberately allowed the guest to control
the memtype for host MMIO.

The SDM makes aliasing sound super scary, but then has footnotes where it explicitly
requires the CPU to play nice with aliasing, e.g. if MTRRs are *not* UC but the
effective memtype is UC, then the CPU is *required* to snoop caches:

  2. The UC attribute came from the page-table or page-directory entry and
     processors are required to check their caches because the data may be cached
     due to page aliasing, which is not recommended.

Lack of snooping can effectively cause data corruption and ordering issues, but
at least for WC/UC vs. WB I don't think there are actual #MC problems with aliasing.

> For CR0_CD=1,
> - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a).
> - w/  KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b);
>                                    with IPAT=0, it may breaks (a), but meets (b)

CR0.CD=1 is a mess above and beyond memtypes.  Huh.  It's even worse than I thought,
because according to the SDM, Atom CPUs don't support no-fill mode:

  3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor,
     caching is disabled.

Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK,
it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way
to emulate no-fill mode.

> > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
> > Yang was trying to resolve issues with passthrough MMIO.
> > 
> >  * Sheng Yang 
> >   : Do you mean host(qemu) would access this memory and if we set it to guest 
> >   : MTRR, host access would be broken? We would cover this in our shadow MTRR 
> >   : patch, for we encountered this in video ram when doing some experiment with 
> >   : VGA assignment. 
> > 
> > And in the same thread, there's also what appears to be confirmation of Intel
> > running into issues with Windows XP related to a guest device driver mapping
> > DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
> > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
> > the fact that EPT and NPT both honor guest PAT by default.  /facepalm
> 
> My interpretation is that the since guest PATs are in guest page tables,
> while with EPT/NPT, guest page tables are not shadowed, it's not easy to
> check guest PATs  to disallow host QEMU access to non-WB guest RAM.

Ah, yeah, your interpretation makes sense.

The best idea I can think of to support things like this is to have KVM grab the
effective PAT memtype from the host userspace page tables, shove that into the
EPT/NPT memtype, and then ignore guest PAT.  I don't if that would actually work
though.

> The credence is with Avi's following word:
> "Looks like a conflict between the requirements of a hypervisor 
> supporting device assignment, and the memory type constraints of mapping 
> everything with the same memory type.  As far as I can see, the only 
> solution is not to map guest memory in the hypervisor, and do all 
> accesses via dma.  This is easy for virtual disk, somewhat harder for 
> virtual networking (need a dma engine or a multiqueue device).
> 
> Since qemu will only access memory on demand, we don't actually have to 
> unmap guest memory, only to ensure that qemu doesn't touch it.  Things 
> like live migration and page sharing won't work, but they aren't 
> expected to with device assignment anyway."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-10-30 21:52         ` Sean Christopherson
@ 2023-11-01  3:07           ` Yibo Huang
  0 siblings, 0 replies; 16+ messages in thread
From: Yibo Huang @ 2023-11-01  3:07 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yan Zhao, kvm

> Yes, it would be helpful to confirm what's going on.  


Sean was right. It turns out that the actual cause of my example was that when doing ioremap, 
the guest OS will configure the PAT based on the value of guest MTRRs.
The key function is #pat_x_mtrr_type().

In my example, the ivshmem driver tried to ioremap the PCI BAR 2 region as WB.
However, for some reason during the VM boot process, OVMF (the BIOS I was using) 
set the corresponding guest MTRRs as UC (Interestingly, SeaBIOS doesn't do this).
Therefore,  #pat_x_mtrr_type() determined that the actual memory type was UC.
As a result, the guest OS set the corresponding PAT as UC.
This was why ivshmem was not cacheable before removing the guest MTRR entry.

After removing the guest MTRR entry, #pat_x_mtrr_type() would return WB. 
So this was why ivshmem became cacheable after removing the guest MTRR entry.

> What test(s) did you run to determine whether or not the memory was truly cacheable?
> KVM emulates the MTRR MSRs themselves, e.g. the guest can read and write MTRRs,
> and the guest will _think_ memory has a certain memtype, but that doesn't necessarily
> have any impact on the memtype used by the CPU.

Thanks for the clarification. I used a memcpy benchmark (size 500M) to determine whether or not the memory was cacheable.
When the memory was not cacheable,  the benchmark took several seconds to finish.
When the memory was cacheable, the benchmark took several milliseconds to finish.


> Heh, this isn't opinion.  Unless you're running a very specific 10-year old kernel,
> or a custom KVM build, KVM simply out doesn't propagate guest MTRRs into NPT.
> 
> And unless your setup also has non-coherent DMA attached to the device, KVM doesn't
> honor guest MTRRs for EPT either (AFAICT, QEMU ivshmem doesn't require VFIO).
> 
> It's definitely possible that disabling a guest MTRR resulted in memory becoming
> cacheable, but unless there's some very, very magical code hiding, it's not because
> KVM actually fully virtualizes guest MTRRs on AMD.
> 
> E.g. before commit 9a3768191d95 ("KVM: x86/mmu: Zap SPTEs on MTRR update iff guest
> MTRRs are honored"), which hasn't even made its way to Linus (or Paolo's) tree yet,
> KVM unnecessarily zapped all NPT entries on MTRR changes.  Zapping NPT entries
> could have cleared some weird TLB state, or perhaps even wiped out buggy KVM NPT
> entries.
> 
> And on AMD, hardware virtualizes gCR0.CD, i.e. puts the caches into no-fill mode
> when guest CR0.CD=1.  But Intel CPUs completely ignore guest CR0.CD, i.e. punt it
> to software, and under QEMU, for all intents and purposes KVM never honors guest
> CR0.CD for VMX.  It's seems highly quite unlikely that something in the guest left
> CR0.CD=1, but it's possible.  And then the guest kernel's process of toggling
> CR0.CD when doing MTRR updates would end up clearing CR0.CD and thus re-enable
> caching.
> 
>> The thing was that I could not find any KVM code related to emulating guest
>> MTRRs on AMD platforms, which was the reason why I decided to send the
>> initial email asking about it.
>> 
>> I found this in the AMD64 Architecture Programmer’s Manual Volumes 1–5 (page
>> 553): 
>> 
>> "Table 15-19 shows how guest and host PAT types are combined into an
>> effective PAT type. When interpreting this table, recall (a) that guest and
>> host PAT types are not combined when nested paging is disabled and (b) that
>> the intent is for the VMM to use its PAT type to simulate guest MTRRs.”
>> 
>> Does this mean that AMD expects the VMM to emulate the effect of guest MTRRs
>> by altering the host PAT types?
> 
> Yes.  Which is exactly what KVM did in commit 3c2e7f7de324 ("KVM: SVM: use NPT
> page attributes"), which was a reverted a few months after it was introduced.

Again, thanks for the clarification!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-10-31 15:14         ` Sean Christopherson
@ 2023-11-01  3:53           ` Huang, Kai
  2023-11-01  9:08           ` Yan Zhao
  1 sibling, 0 replies; 16+ messages in thread
From: Huang, Kai @ 2023-11-01  3:53 UTC (permalink / raw)
  To: Christopherson,, Sean, Zhao, Yan Y
  Cc: ybhuang@cs.utexas.edu, kvm@vger.kernel.org

On Tue, 2023-10-31 at 08:14 -0700, Sean Christopherson wrote:
> > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
> > > Yang was trying to resolve issues with passthrough MMIO.
> > > 
> > >   * Sheng Yang 
> > >    : Do you mean host(qemu) would access this memory and if we set it to guest 
> > >    : MTRR, host access would be broken? We would cover this in our shadow MTRR 
> > >    : patch, for we encountered this in video ram when doing some experiment with 
> > >    : VGA assignment. 

Not sure what does the "VGA assignment" mean, but I have a doubt whether it was
about passthrough MMIO.  Theoretically if it was passthrough MMIO, the host
shouldn't need to access it.  But the above text seems the issue were host/guest
accessing memory together.

If we are talking about the video ram here (e.g., for framebuffer) IIUC it isn't
passthrough MMIO, but just some memory that used by guest as video ram.  KVM
needs to periodically write protect it (and clear dirty) so that Qemu can be
aware of exact what video ram has been updated to correctly emulate the video
ram, i.e., showing on the console of the VM.

So I guess the issue was both host and guest access of video ram, while guest
sets its memory type to WC or UC.

But IIUC host only *reads* from video ram, but never *writes*, thus I don't see
there's any real problem if host is accessing via WB and guest is accessing via
WC or UC.

AMD SDM:

	VMRUN and #VMEXIT flush the write combiners. This ensures that all 
	writes to WC memory by the guest are visible to the host (or
	vice-versa) regardless of memory type. (It does not ensure that
	cacheable writes by one agent are properly observed by WC reads or 
	writes by the other agent.)

> > > 
> > > And in the same thread, there's also what appears to be confirmation of Intel
> > > running into issues with Windows XP related to a guest device driver mapping
> > > DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
> > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
> > > the fact that EPT and NPT both honor guest PAT by default.  /facepalm

I think Avi was not talking about guest PAT but guest MTRR, which is not honored
by NPT/EPT at all. ?

> > 
> > My interpretation is that the since guest PATs are in guest page tables,
> > while with EPT/NPT, guest page tables are not shadowed, it's not easy to
> > check guest PATs  to disallow host QEMU access to non-WB guest RAM.
> 
> Ah, yeah, your interpretation makes sense.
> 
> The best idea I can think of to support things like this is to have KVM grab the
> effective PAT memtype from the host userspace page tables, shove that into the
> EPT/NPT memtype, and then ignore guest PAT.  I don't if that would actually work
> though.

I think you are assuming "host userspace page tables" will always have the same
memory type in guest's MTRR?

I am not sure whether it will always be the case.  I haven't checked the Qemu
code, but theoretically, for things like video ram, the guest can have its
memory as WC/UC in MTRR but host can map it as WB perfectly, because host only
needs to read from it.

I think we can just get rid of guest MTRR staff completely, i.e. enforce KVM to
expose 0 fixed and dynamic MTRRs.  Then we don't need to "look at memory type
from host userspace page tables", but simply set WB to NPT/EPT.

The reason is as you said NPT/EPT honor guest's PAT by default.  If guest wants
WC then it sets WC to PAT then guest will access it using WC.  Host side for
passthrough MMIO host should never access it anyway, and for things like video
ram host will only read from it, thus it should be safe to map WB in the host.

Or do we need to consider host being able to write using WB some memory while it
is accessed as WC/UC in the guest?

And does kernel-direct mapping worth consideration?

Hmm.. But it's possible I am talking non-sense.. :-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-10-31 15:14         ` Sean Christopherson
  2023-11-01  3:53           ` Huang, Kai
@ 2023-11-01  9:08           ` Yan Zhao
  2023-11-06 22:34             ` Sean Christopherson
  1 sibling, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2023-11-01  9:08 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yibo Huang, kvm

On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Yan Zhao wrote:
> > On Mon, Oct 30, 2023 at 12:24:02PM -0700, Sean Christopherson wrote:
> > > On Mon, Oct 30, 2023, Yan Zhao wrote:
> > > Digging deeper through the history, this *mostly* appears to be the result of coming
> > > to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling.
> 
> ...
> 
> > > Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added the whole
> > > IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough
> > > enabling, but Avi applied it earlier because it was a generic fix.
> > >
> > My feeling is that
> > Current memtype handling for non-coherent DMA is a compromise between
> > (a) security ("qemu mappings will use writeback and guest mapping will use guest
> > specified memory types")
> > (b) the effective memtype cannot be cacheable if guest thinks it's non-cacheable.
> 
> And correctness.  E.g. accessing memory with conficting memtypes could cause guest
> data corruption, which isn't strictly the same as (a).
> 
> > So, for MMIOs in non-coherent DMAs, mapping them as UC in EPT is understandable,
> > because other value like WB or WC is not preferred --
> > guest usually sets MMIOs' PAT to UC or WC, so "PAT=UC && EPT=WB" or
> > "PAT=UC && EPT=WC" are not preferred according to SDM due to page aliasing.
> > And VFIO maps the MMIOs to UC in host.
> > (With pass-through GPU in my env, the MMIOs' guest MTRR is UC,
> >  I can observe host hang if I program its EPT type to
> >  - WB+IPAT or
> >  - WC
> >  )
> 
> Yes, but all of that simply confirms that it's KVM's responsibility to map host
> MMIO as UC.  The hangs you observe likely have nothing to do with memory aliasing,
> and everything to do with accessing real MMIO with incompatible memtypes.
Yes, you are right.
For EPT type = WC, the hang case is actually because pci_iomap() maps PAT
as UC- by default, then the effective memory type is WC, which is wrong.
If I force the driver to map with PAT=UC, then the driver works normal even
with EPT type = WC.

> 
> > For guest RAM, looks honoring guest MTRRs just mitigates the page aliasing
> > problem.
> > E.g. if guest PAT=UC because its MTRR=UC, setting EPT type=UC can avoid
> > "guest PAT=UC && EPT=WB", which is not recommended in SDM.
> > But it still breaks (a) if guest PAT is UC.
> > Also, honoring guest MTRRs in EPT is friendly to old systems that do not enable
> > PAT. I guess :)
> 
> LOL, no way.  The PAT can't be disabled, and the default PAT combinations are
> backwards compatible with legacy PCD+PWT.  The only way for this to provide value
> is if someone is virtualizing a pre-Pentium Pro CPU, doing device passthrough,
> and *only* doing so on hardware with EPT.
> 
> > But I agree, in common cases, honoring guest MTRRs or not looks no big difference.
> > (And I'm not lucky enough to reproduce page-aliasing-caused MCE yet in my
> > environment).
> 
> FWIW, I don't think that page aliasing with WC/UC actually causes machine checks.
> What does result in #MC (assuming things haven't changed in the last few years)
> is accessing MMIO using WB and other cacheable memtypes, e.g. map the host APIC
> with WB and you should see #MCs.  I suspect this is what people encountered years
> ago when KVM attempted to honored guest MTRRs at all times.  E.g. the "full" MTRR
> virtualization patch that got reverted deliberately allowed the guest to control
> the memtype for host MMIO.
> 
> The SDM makes aliasing sound super scary, but then has footnotes where it explicitly
> requires the CPU to play nice with aliasing, e.g. if MTRRs are *not* UC but the
> effective memtype is UC, then the CPU is *required* to snoop caches:
>
Yes, I tried below combinations, none of them can trigger #MC.
- effective memory type for guest access is WC, and that for host access is UC
- effective memory type for guest access is UC, and that for host access is WC
- effective memory type for guest access is UC, and that for host access is WB


>   2. The UC attribute came from the page-table or page-directory entry and
>      processors are required to check their caches because the data may be cached
>      due to page aliasing, which is not recommended.
> 
> Lack of snooping can effectively cause data corruption and ordering issues, but
> at least for WC/UC vs. WB I don't think there are actual #MC problems with aliasing.
> 
Even no #MC on guest RAM?
E.g. what if guest effective memory type is UC/WC, and host effective memory type
is WB?
(I tried in my machines with guest PAT=WC + host PAT=WB, looks no #MC, but I'm not sure
if anything I'm missing and it's only in my specific environment.)

If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even
without non-coherent DMA?

> > For CR0_CD=1,
> > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a).
> > - w/  KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b);
> >                                    with IPAT=0, it may breaks (a), but meets (b)
> 
> CR0.CD=1 is a mess above and beyond memtypes.  Huh.  It's even worse than I thought,
> because according to the SDM, Atom CPUs don't support no-fill mode:
> 
>   3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor,
>      caching is disabled.
> 
> Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK,
> it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way
> to emulate no-fill mode.
> 
> > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
> > > Yang was trying to resolve issues with passthrough MMIO.
> > > 
> > >  * Sheng Yang 
> > >   : Do you mean host(qemu) would access this memory and if we set it to guest 
> > >   : MTRR, host access would be broken? We would cover this in our shadow MTRR 
> > >   : patch, for we encountered this in video ram when doing some experiment with 
> > >   : VGA assignment. 
> > > 
> > > And in the same thread, there's also what appears to be confirmation of Intel
> > > running into issues with Windows XP related to a guest device driver mapping
> > > DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
> > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
> > > the fact that EPT and NPT both honor guest PAT by default.  /facepalm
> > 
> > My interpretation is that the since guest PATs are in guest page tables,
> > while with EPT/NPT, guest page tables are not shadowed, it's not easy to
> > check guest PATs  to disallow host QEMU access to non-WB guest RAM.
> 
> Ah, yeah, your interpretation makes sense.
> 
> The best idea I can think of to support things like this is to have KVM grab the
> effective PAT memtype from the host userspace page tables, shove that into the
> EPT/NPT memtype, and then ignore guest PAT.  I don't if that would actually work
> though.
Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some
others as WC, even they belong to the same BAR.
I don't think host can know which one to choose in advance.
I think it should be also true to RAM range, guest can do memremap to a memory
type that host doesn't know beforehand.

> 
> > The credence is with Avi's following word:
> > "Looks like a conflict between the requirements of a hypervisor 
> > supporting device assignment, and the memory type constraints of mapping 
> > everything with the same memory type.  As far as I can see, the only 
> > solution is not to map guest memory in the hypervisor, and do all 
> > accesses via dma.  This is easy for virtual disk, somewhat harder for 
> > virtual networking (need a dma engine or a multiqueue device).
> > 
> > Since qemu will only access memory on demand, we don't actually have to 
> > unmap guest memory, only to ensure that qemu doesn't touch it.  Things 
> > like live migration and page sharing won't work, but they aren't 
> > expected to with device assignment anyway."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-11-01  9:08           ` Yan Zhao
@ 2023-11-06 22:34             ` Sean Christopherson
  2023-11-07  9:26               ` Yan Zhao
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2023-11-06 22:34 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Yibo Huang, kvm

On Wed, Nov 01, 2023, Yan Zhao wrote:
> On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote:
> > FWIW, I don't think that page aliasing with WC/UC actually causes machine checks.
> > What does result in #MC (assuming things haven't changed in the last few years)
> > is accessing MMIO using WB and other cacheable memtypes, e.g. map the host APIC
> > with WB and you should see #MCs.  I suspect this is what people encountered years
> > ago when KVM attempted to honored guest MTRRs at all times.  E.g. the "full" MTRR
> > virtualization patch that got reverted deliberately allowed the guest to control
> > the memtype for host MMIO.
> > 
> > The SDM makes aliasing sound super scary, but then has footnotes where it explicitly
> > requires the CPU to play nice with aliasing, e.g. if MTRRs are *not* UC but the
> > effective memtype is UC, then the CPU is *required* to snoop caches:
> >
> Yes, I tried below combinations, none of them can trigger #MC.
> - effective memory type for guest access is WC, and that for host access is UC
> - effective memory type for guest access is UC, and that for host access is WC
> - effective memory type for guest access is UC, and that for host access is WB
> 
> >   2. The UC attribute came from the page-table or page-directory entry and
> >      processors are required to check their caches because the data may be cached
> >      due to page aliasing, which is not recommended.
> > 
> > Lack of snooping can effectively cause data corruption and ordering issues, but
> > at least for WC/UC vs. WB I don't think there are actual #MC problems with aliasing.
> > 
> Even no #MC on guest RAM?
> E.g. what if guest effective memory type is UC/WC, and host effective memory type
> is WB?
> (I tried in my machines with guest PAT=WC + host PAT=WB, looks no #MC, but I'm not sure
> if anything I'm missing and it's only in my specific environment.)
> 
> If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even
> without non-coherent DMA?

No, there are snooping/ordering issues on Intel, and to a lesser extent AMD.  AMD's
WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and
#VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT
(and vice versa).  That may or may not be sufficient for multi-threaded use cases,
but I've no idea if there is actually anything to worry about on that front.  I
think there's also a flaw with guest using UC, which IIUC doesn't snoop caches,
i.e. the guest could get stale data.

AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide
something similar to safely let the guest control memtypes.  Arguably, KVM should
have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust.

But even then, there's still the question of why, i.e. what would be the benefit
of letting the guest control memtypes when it's not required for functional
correctness, and would that benefit outweight the cost.

> > > For CR0_CD=1,
> > > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a).
> > > - w/  KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b);
> > >                                    with IPAT=0, it may breaks (a), but meets (b)
> > 
> > CR0.CD=1 is a mess above and beyond memtypes.  Huh.  It's even worse than I thought,
> > because according to the SDM, Atom CPUs don't support no-fill mode:
> > 
> >   3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor,
> >      caching is disabled.
> > 
> > Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK,
> > it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way
> > to emulate no-fill mode.
> > 
> > > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
> > > > Yang was trying to resolve issues with passthrough MMIO.
> > > > 
> > > >  * Sheng Yang 
> > > >   : Do you mean host(qemu) would access this memory and if we set it to guest 
> > > >   : MTRR, host access would be broken? We would cover this in our shadow MTRR 
> > > >   : patch, for we encountered this in video ram when doing some experiment with 
> > > >   : VGA assignment. 
> > > > 
> > > > And in the same thread, there's also what appears to be confirmation of Intel
> > > > running into issues with Windows XP related to a guest device driver mapping
> > > > DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
> > > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
> > > > the fact that EPT and NPT both honor guest PAT by default.  /facepalm
> > > 
> > > My interpretation is that the since guest PATs are in guest page tables,
> > > while with EPT/NPT, guest page tables are not shadowed, it's not easy to
> > > check guest PATs  to disallow host QEMU access to non-WB guest RAM.
> > 
> > Ah, yeah, your interpretation makes sense.
> > 
> > The best idea I can think of to support things like this is to have KVM grab the
> > effective PAT memtype from the host userspace page tables, shove that into the
> > EPT/NPT memtype, and then ignore guest PAT.  I don't if that would actually work
> > though.
> Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some
> others as WC, even they belong to the same BAR.
> I don't think host can know which one to choose in advance.
> I think it should be also true to RAM range, guest can do memremap to a memory
> type that host doesn't know beforehand.

The goal wouldn't be to honor guest memtype, it would be to ensure correctness.
E.g. guest can do memremap all it wants, and KVM will always ignore the guest's
memtype.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-11-06 22:34             ` Sean Christopherson
@ 2023-11-07  9:26               ` Yan Zhao
  2023-11-07 18:06                 ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2023-11-07  9:26 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yibo Huang, kvm

On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote:
> On Wed, Nov 01, 2023, Yan Zhao wrote:
> > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote:

> > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even
> > without non-coherent DMA?
> 
> No, there are snooping/ordering issues on Intel, and to a lesser extent AMD.  AMD's
> WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and
> #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT
> (and vice versa).  That may or may not be sufficient for multi-threaded use cases,
> but I've no idea if there is actually anything to worry about on that front.  I
> think there's also a flaw with guest using UC, which IIUC doesn't snoop caches,
> i.e. the guest could get stale data.
> 
> AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide
> something similar to safely let the guest control memtypes.  Arguably, KVM should
> have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust.
> 
> But even then, there's still the question of why, i.e. what would be the benefit
> of letting the guest control memtypes when it's not required for functional
> correctness, and would that benefit outweight the cost.

Ok, so for a coherent device , if it's assigned together with a non-coherent
device, and if there's a page with host PAT = WB and guest PAT=UC, we need to
ensure the host write is flushed before guest read/write and guest DMA though no
need to worry about #MC, right?

> 
> > > > For CR0_CD=1,
> > > > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a).
> > > > - w/  KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b);
> > > >                                    with IPAT=0, it may breaks (a), but meets (b)
> > > 
> > > CR0.CD=1 is a mess above and beyond memtypes.  Huh.  It's even worse than I thought,
> > > because according to the SDM, Atom CPUs don't support no-fill mode:
> > > 
> > >   3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor,
> > >      caching is disabled.
> > > 
> > > Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK,
> > > it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way
> > > to emulate no-fill mode.
> > > 
> > > > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
> > > > > Yang was trying to resolve issues with passthrough MMIO.
> > > > > 
> > > > >  * Sheng Yang 
> > > > >   : Do you mean host(qemu) would access this memory and if we set it to guest 
> > > > >   : MTRR, host access would be broken? We would cover this in our shadow MTRR 
> > > > >   : patch, for we encountered this in video ram when doing some experiment with 
> > > > >   : VGA assignment. 
> > > > > 
> > > > > And in the same thread, there's also what appears to be confirmation of Intel
> > > > > running into issues with Windows XP related to a guest device driver mapping
> > > > > DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
> > > > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
> > > > > the fact that EPT and NPT both honor guest PAT by default.  /facepalm
> > > > 
> > > > My interpretation is that the since guest PATs are in guest page tables,
> > > > while with EPT/NPT, guest page tables are not shadowed, it's not easy to
> > > > check guest PATs  to disallow host QEMU access to non-WB guest RAM.
> > > 
> > > Ah, yeah, your interpretation makes sense.
> > > 
> > > The best idea I can think of to support things like this is to have KVM grab the
> > > effective PAT memtype from the host userspace page tables, shove that into the
> > > EPT/NPT memtype, and then ignore guest PAT.  I don't if that would actually work
> > > though.
> > Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some
> > others as WC, even they belong to the same BAR.
> > I don't think host can know which one to choose in advance.
> > I think it should be also true to RAM range, guest can do memremap to a memory
> > type that host doesn't know beforehand.
> 
> The goal wouldn't be to honor guest memtype, it would be to ensure correctness.
> E.g. guest can do memremap all it wants, and KVM will always ignore the guest's
> memtype.
AFAIK, some GPUs with TTM driver may call set_pages_array_uc() to convert pages
to PAT=UC-(e.g. for doorbell). Intel i915 also could vmap a page with PAT=WC
(e.g. for some command buffer, see i915_gem_object_map_page()).
It's not easy for host to know which guest pages are allocated by guest driver
for such UC/WC conversion, and it should have problem to map such pages as "WB +
ignore guest PAT" if the device is non-coherent.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-11-07  9:26               ` Yan Zhao
@ 2023-11-07 18:06                 ` Sean Christopherson
  2023-11-08  4:32                   ` Yan Zhao
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2023-11-07 18:06 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Yibo Huang, kvm

On Tue, Nov 07, 2023, Yan Zhao wrote:
> On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote:
> > On Wed, Nov 01, 2023, Yan Zhao wrote:
> > > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote:
> 
> > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even
> > > without non-coherent DMA?
> > 
> > No, there are snooping/ordering issues on Intel, and to a lesser extent AMD.  AMD's
> > WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and
> > #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT
> > (and vice versa).  That may or may not be sufficient for multi-threaded use cases,
> > but I've no idea if there is actually anything to worry about on that front.  I
> > think there's also a flaw with guest using UC, which IIUC doesn't snoop caches,
> > i.e. the guest could get stale data.
> > 
> > AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide
> > something similar to safely let the guest control memtypes.  Arguably, KVM should
> > have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust.
> > 
> > But even then, there's still the question of why, i.e. what would be the benefit
> > of letting the guest control memtypes when it's not required for functional
> > correctness, and would that benefit outweight the cost.
> 
> Ok, so for a coherent device , if it's assigned together with a non-coherent
> device, and if there's a page with host PAT = WB and guest PAT=UC, we need to
> ensure the host write is flushed before guest read/write and guest DMA though no
> need to worry about #MC, right?

It's not even about devices, it applies to all non-MMIO memory, i.e. unless the
host forces UC for a given page, there's potential for WB vs. WC/UC issues.

> > > > > For CR0_CD=1,
> > > > > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a).
> > > > > - w/  KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b);
> > > > >                                    with IPAT=0, it may breaks (a), but meets (b)
> > > > 
> > > > CR0.CD=1 is a mess above and beyond memtypes.  Huh.  It's even worse than I thought,
> > > > because according to the SDM, Atom CPUs don't support no-fill mode:
> > > > 
> > > >   3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor,
> > > >      caching is disabled.
> > > > 
> > > > Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK,
> > > > it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way
> > > > to emulate no-fill mode.
> > > > 
> > > > > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
> > > > > > Yang was trying to resolve issues with passthrough MMIO.
> > > > > > 
> > > > > >  * Sheng Yang 
> > > > > >   : Do you mean host(qemu) would access this memory and if we set it to guest 
> > > > > >   : MTRR, host access would be broken? We would cover this in our shadow MTRR 
> > > > > >   : patch, for we encountered this in video ram when doing some experiment with 
> > > > > >   : VGA assignment. 
> > > > > > 
> > > > > > And in the same thread, there's also what appears to be confirmation of Intel
> > > > > > running into issues with Windows XP related to a guest device driver mapping
> > > > > > DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
> > > > > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
> > > > > > the fact that EPT and NPT both honor guest PAT by default.  /facepalm
> > > > > 
> > > > > My interpretation is that the since guest PATs are in guest page tables,
> > > > > while with EPT/NPT, guest page tables are not shadowed, it's not easy to
> > > > > check guest PATs  to disallow host QEMU access to non-WB guest RAM.
> > > > 
> > > > Ah, yeah, your interpretation makes sense.
> > > > 
> > > > The best idea I can think of to support things like this is to have KVM grab the
> > > > effective PAT memtype from the host userspace page tables, shove that into the
> > > > EPT/NPT memtype, and then ignore guest PAT.  I don't if that would actually work
> > > > though.
> > > Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some
> > > others as WC, even they belong to the same BAR.
> > > I don't think host can know which one to choose in advance.
> > > I think it should be also true to RAM range, guest can do memremap to a memory
> > > type that host doesn't know beforehand.
> > 
> > The goal wouldn't be to honor guest memtype, it would be to ensure correctness.
> > E.g. guest can do memremap all it wants, and KVM will always ignore the guest's
> > memtype.
> AFAIK, some GPUs with TTM driver may call set_pages_array_uc() to convert pages
> to PAT=UC-(e.g. for doorbell). Intel i915 also could vmap a page with PAT=WC
> (e.g. for some command buffer, see i915_gem_object_map_page()).
> It's not easy for host to know which guest pages are allocated by guest driver
> for such UC/WC conversion, and it should have problem to map such pages as "WB +
> ignore guest PAT" if the device is non-coherent.

Ah, right, I was thinking specifically of virtio-gpu, where there is more explicit
coordination between guest and host regarding the buffers.  Drat.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-11-07 18:06                 ` Sean Christopherson
@ 2023-11-08  4:32                   ` Yan Zhao
  2023-11-10 17:09                     ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2023-11-08  4:32 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yibo Huang, kvm

On Tue, Nov 07, 2023 at 10:06:02AM -0800, Sean Christopherson wrote:
> On Tue, Nov 07, 2023, Yan Zhao wrote:
> > On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote:
> > > On Wed, Nov 01, 2023, Yan Zhao wrote:
> > > > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote:
> > 
> > > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even
> > > > without non-coherent DMA?
> > > 
> > > No, there are snooping/ordering issues on Intel, and to a lesser extent AMD.  AMD's
> > > WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and
> > > #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT
> > > (and vice versa).  That may or may not be sufficient for multi-threaded use cases,
> > > but I've no idea if there is actually anything to worry about on that front.  I
> > > think there's also a flaw with guest using UC, which IIUC doesn't snoop caches,
> > > i.e. the guest could get stale data.
> > > 
> > > AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide
> > > something similar to safely let the guest control memtypes.  Arguably, KVM should
> > > have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust.
> > > 
> > > But even then, there's still the question of why, i.e. what would be the benefit
> > > of letting the guest control memtypes when it's not required for functional
> > > correctness, and would that benefit outweight the cost.
> > 
> > Ok, so for a coherent device , if it's assigned together with a non-coherent
> > device, and if there's a page with host PAT = WB and guest PAT=UC, we need to
> > ensure the host write is flushed before guest read/write and guest DMA though no
> > need to worry about #MC, right?
> 
> It's not even about devices, it applies to all non-MMIO memory, i.e. unless the
> host forces UC for a given page, there's potential for WB vs. WC/UC issues.
Do you think we can have KVM to expose an ioctl for QEMU to call in QEMU's
invalidate_and_set_dirty() or in cpu_physical_memory_set_dirty_range()?

In this ioctl, it can do nothing if non-coherent DMA is not attached and
call clflush otherwise.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-11-08  4:32                   ` Yan Zhao
@ 2023-11-10 17:09                     ` Sean Christopherson
  2023-11-13  8:07                       ` Yan Zhao
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2023-11-10 17:09 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Yibo Huang, kvm

On Wed, Nov 08, 2023, Yan Zhao wrote:
> On Tue, Nov 07, 2023 at 10:06:02AM -0800, Sean Christopherson wrote:
> > On Tue, Nov 07, 2023, Yan Zhao wrote:
> > > On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote:
> > > > On Wed, Nov 01, 2023, Yan Zhao wrote:
> > > > > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote:
> > > 
> > > > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even
> > > > > without non-coherent DMA?
> > > > 
> > > > No, there are snooping/ordering issues on Intel, and to a lesser extent AMD.  AMD's
> > > > WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and
> > > > #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT
> > > > (and vice versa).  That may or may not be sufficient for multi-threaded use cases,
> > > > but I've no idea if there is actually anything to worry about on that front.  I
> > > > think there's also a flaw with guest using UC, which IIUC doesn't snoop caches,
> > > > i.e. the guest could get stale data.
> > > > 
> > > > AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide
> > > > something similar to safely let the guest control memtypes.  Arguably, KVM should
> > > > have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust.
> > > > 
> > > > But even then, there's still the question of why, i.e. what would be the benefit
> > > > of letting the guest control memtypes when it's not required for functional
> > > > correctness, and would that benefit outweight the cost.
> > > 
> > > Ok, so for a coherent device , if it's assigned together with a non-coherent
> > > device, and if there's a page with host PAT = WB and guest PAT=UC, we need to
> > > ensure the host write is flushed before guest read/write and guest DMA though no
> > > need to worry about #MC, right?
> > 
> > It's not even about devices, it applies to all non-MMIO memory, i.e. unless the
> > host forces UC for a given page, there's potential for WB vs. WC/UC issues.
> Do you think we can have KVM to expose an ioctl for QEMU to call in QEMU's
> invalidate_and_set_dirty() or in cpu_physical_memory_set_dirty_range()?
> 
> In this ioctl, it can do nothing if non-coherent DMA is not attached and
> call clflush otherwise.

Why add an ioctl()?  Userspace can do CLFLUSH{OPT} directly.  If it would fix a
real problem, then adding some way for userspace to query whether or not there
is non-coherent DMA would be reasonable, though that seems like something that
should be in VFIO (if it's not already there).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms
  2023-11-10 17:09                     ` Sean Christopherson
@ 2023-11-13  8:07                       ` Yan Zhao
  0 siblings, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2023-11-13  8:07 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yibo Huang, kvm

On Fri, Nov 10, 2023 at 09:09:33AM -0800, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Yan Zhao wrote:
> > On Tue, Nov 07, 2023 at 10:06:02AM -0800, Sean Christopherson wrote:
> > > On Tue, Nov 07, 2023, Yan Zhao wrote:
> > > > On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote:
> > > > > On Wed, Nov 01, 2023, Yan Zhao wrote:
> > > > > > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote:
> > > > 
> > > > > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even
> > > > > > without non-coherent DMA?
> > > > > 
> > > > > No, there are snooping/ordering issues on Intel, and to a lesser extent AMD.  AMD's
> > > > > WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and
> > > > > #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT
> > > > > (and vice versa).  That may or may not be sufficient for multi-threaded use cases,
> > > > > but I've no idea if there is actually anything to worry about on that front.  I
> > > > > think there's also a flaw with guest using UC, which IIUC doesn't snoop caches,
> > > > > i.e. the guest could get stale data.
> > > > > 
> > > > > AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide
> > > > > something similar to safely let the guest control memtypes.  Arguably, KVM should
> > > > > have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust.
> > > > > 
> > > > > But even then, there's still the question of why, i.e. what would be the benefit
> > > > > of letting the guest control memtypes when it's not required for functional
> > > > > correctness, and would that benefit outweight the cost.
> > > > 
> > > > Ok, so for a coherent device , if it's assigned together with a non-coherent
> > > > device, and if there's a page with host PAT = WB and guest PAT=UC, we need to
> > > > ensure the host write is flushed before guest read/write and guest DMA though no
> > > > need to worry about #MC, right?
> > > 
> > > It's not even about devices, it applies to all non-MMIO memory, i.e. unless the
> > > host forces UC for a given page, there's potential for WB vs. WC/UC issues.
> > Do you think we can have KVM to expose an ioctl for QEMU to call in QEMU's
> > invalidate_and_set_dirty() or in cpu_physical_memory_set_dirty_range()?
> > 
> > In this ioctl, it can do nothing if non-coherent DMA is not attached and
> > call clflush otherwise.
> 
> Why add an ioctl()?  Userspace can do CLFLUSH{OPT} directly.  If it would fix a
> real problem, then adding some way for userspace to query whether or not there
> is non-coherent DMA would be reasonable, though that seems like something that
> should be in VFIO (if it's not already there).
Ah, right. I previously thought KVM can further remove the clflush when TDP is
not enabled with an ioctl().

But it's not a real problem so far, as I didn't manage to devise a case to prove
the WB vs WC/UC issues. (i.e. in my devised cases, even with guest memory mapped
to WC, host still can get latest data with WB...) 

May come back later if it's proved to be a real issue in future. :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-11-13  8:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-12 23:04 A question about how the KVM emulates the effect of guest MTRRs on AMD platforms Yibo Huang
2023-10-27 23:13 ` Sean Christopherson
2023-10-30 12:16   ` Yan Zhao
2023-10-30 19:24     ` Sean Christopherson
     [not found]       ` <3E43ADC6-E817-411A-9EBF-B16142B9B478@cs.utexas.edu>
2023-10-30 21:52         ` Sean Christopherson
2023-11-01  3:07           ` Yibo Huang
2023-10-31 10:01       ` Yan Zhao
2023-10-31 15:14         ` Sean Christopherson
2023-11-01  3:53           ` Huang, Kai
2023-11-01  9:08           ` Yan Zhao
2023-11-06 22:34             ` Sean Christopherson
2023-11-07  9:26               ` Yan Zhao
2023-11-07 18:06                 ` Sean Christopherson
2023-11-08  4:32                   ` Yan Zhao
2023-11-10 17:09                     ` Sean Christopherson
2023-11-13  8:07                       ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox