* A question about how the KVM emulates the effect of guest MTRRs on AMD platforms @ 2023-08-12 23:04 Yibo Huang 2023-10-27 23:13 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Yibo Huang @ 2023-08-12 23:04 UTC (permalink / raw) To: kvm Hi the KVM community, I am sending this email to ask about how the KVM emulates the effect of guest MTRRs on AMD platforms. Since there is no hardware support for guest MTRRs, the VMM can simulate their effect by altering the memory types in the EPT/NPT. From my understanding, this is exactly what the KVM does for Intel platforms. More specifically, in arch/x86/kvm/mmu/spte.c #make_spte(), the KVM tries to respect the guest MTRRs by calling #kvm_x86_ops.get_mt_mask() to get the memory types indicated by the guest MTRRs and applying that to the EPT. For Intel platforms, the implementation of #kvm_x86_ops.get_mt_mask() is #vmx_get_mt_mask(), which calls the #kvm_mtrr_get_guest_memory_type() to get the memory types indicated by the guest MTRRs. However, on AMD platforms, the KVM does not implement #kvm_x86_ops.get_mt_mask() at all, so it just returns zero. Does it mean that the KVM does not use the NPT to emulate the effect of guest MTRRs on AMD platforms? I tried but failed to find out how the KVM does for AMD platforms. Can someone help me understand how the KVM emulates the effect of guest MTRRs on AMD platforms? Thanks a lot! Best, Yibo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-08-12 23:04 A question about how the KVM emulates the effect of guest MTRRs on AMD platforms Yibo Huang @ 2023-10-27 23:13 ` Sean Christopherson 2023-10-30 12:16 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2023-10-27 23:13 UTC (permalink / raw) To: Yibo Huang; +Cc: kvm, Yan Zhao +Yan On Sat, Aug 12, 2023, Yibo Huang wrote: > Hi the KVM community, > > I am sending this email to ask about how the KVM emulates the effect of guest > MTRRs on AMD platforms. > > Since there is no hardware support for guest MTRRs, the VMM can simulate > their effect by altering the memory types in the EPT/NPT. From my > understanding, this is exactly what the KVM does for Intel platforms. More > specifically, in arch/x86/kvm/mmu/spte.c #make_spte(), the KVM tries to > respect the guest MTRRs by calling #kvm_x86_ops.get_mt_mask() to get the > memory types indicated by the guest MTRRs and applying that to the EPT. For > Intel platforms, the implementation of #kvm_x86_ops.get_mt_mask() is > #vmx_get_mt_mask(), which calls the #kvm_mtrr_get_guest_memory_type() to get > the memory types indicated by the guest MTRRs. KVM doesn't always honor guest MTTRs, KVM only does all of this if there is a passhtrough device with non-coherent DMA attached to the VM. There's actually an outstanding issue with virtio-gpu where non-coherent GPUs are flaky due to KVM not stuffing the EPT memtype because KVM isn't aware of the non-coherent DMA. > However, on AMD platforms, the KVM does not implement > #kvm_x86_ops.get_mt_mask() at all, so it just returns zero. Does it mean that > the KVM does not use the NPT to emulate the effect of guest MTRRs on AMD > platforms? I tried but failed to find out how the KVM does for AMD platforms. Correct. The short answer is that SVM+NPT obviates the need to emulate guest MTRRs for real world guest workloads. The shortcomings of VMX+EPT are that (a) guest CR0.CD isn't virtualized by hardware and (b) AFAIK, if the guest accesses memory with PAT=WC to memory that the host has accessed with PAT=WB (and MTRR=WB), the CPU will *not* snoop caches on the guest access. SVM on the other hand fully virtualizes CR0.CD, and NPT is quite clever in how it handles guest WC: A new memory type WC+ is introduced. WC+ is an uncacheable memory type, and combines writes in write-combining buffers like WC. Unlike WC (but like the CD memory type), accesses to WC+ memory also snoop the caches on all processors (including self-snooping the caches of the processor issuing the request) to maintain coherency. This ensures that cacheable writes are observed by WC+ accesses. And VMRUN (and #VMEXIT) flush the WC buffers, e.g. if the guest is using WB and the host is using WC, things will still work as expected (well, maybe not for cases where the host is writing and the guest is reading from different CPUs). Anyways, evidenced by the lack of bug reports over the last decade, for practical purposes snooping the caches on guest WC accesses is sufficient. Hrm, but typing all that out, I have absolutely no idea why VMX+EPT cares about guest MTRRs. Honoring guest PAT I totally get, but the guest MTRRs make no sense. E.g. I have a very hard time believing a real world guest kernel mucks with the MTRRs to setup DMA. And again, this is supported by the absense of bug reports on AMD. Yan, You've been digging into this code recently, am I forgetting something because it's late on a Friday? Or have we been making the very bad assumption that KVM code from 10+ years ago actually makes sense? I.e. for non-coherent DMA, can we delete all of the MTRR insanity and simply clear IPAT? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-10-27 23:13 ` Sean Christopherson @ 2023-10-30 12:16 ` Yan Zhao 2023-10-30 19:24 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2023-10-30 12:16 UTC (permalink / raw) To: Sean Christopherson; +Cc: Yibo Huang, kvm On Fri, Oct 27, 2023 at 04:13:36PM -0700, Sean Christopherson wrote: > +Yan > > On Sat, Aug 12, 2023, Yibo Huang wrote: > > Hi the KVM community, > > > > I am sending this email to ask about how the KVM emulates the effect of guest > > MTRRs on AMD platforms. > > > > Since there is no hardware support for guest MTRRs, the VMM can simulate > > their effect by altering the memory types in the EPT/NPT. From my > > understanding, this is exactly what the KVM does for Intel platforms. More > > specifically, in arch/x86/kvm/mmu/spte.c #make_spte(), the KVM tries to > > respect the guest MTRRs by calling #kvm_x86_ops.get_mt_mask() to get the > > memory types indicated by the guest MTRRs and applying that to the EPT. For > > Intel platforms, the implementation of #kvm_x86_ops.get_mt_mask() is > > #vmx_get_mt_mask(), which calls the #kvm_mtrr_get_guest_memory_type() to get > > the memory types indicated by the guest MTRRs. > > KVM doesn't always honor guest MTTRs, KVM only does all of this if there is a > passhtrough device with non-coherent DMA attached to the VM. There's actually > an outstanding issue with virtio-gpu where non-coherent GPUs are flaky due to > KVM not stuffing the EPT memtype because KVM isn't aware of the non-coherent DMA. > > > However, on AMD platforms, the KVM does not implement > > #kvm_x86_ops.get_mt_mask() at all, so it just returns zero. Does it mean that > > the KVM does not use the NPT to emulate the effect of guest MTRRs on AMD > > platforms? I tried but failed to find out how the KVM does for AMD platforms. > > Correct. The short answer is that SVM+NPT obviates the need to emulate guest > MTRRs for real world guest workloads. > > The shortcomings of VMX+EPT are that (a) guest CR0.CD isn't virtualized by > hardware and (b) AFAIK, if the guest accesses memory with PAT=WC to memory that > the host has accessed with PAT=WB (and MTRR=WB), the CPU will *not* snoop caches > on the guest access. > > SVM on the other hand fully virtualizes CR0.CD, and NPT is quite clever in how > it handles guest WC: > > A new memory type WC+ is introduced. WC+ is an uncacheable memory type, and > combines writes in write-combining buffers like WC. Unlike WC (but like the CD > memory type), accesses to WC+ memory also snoop the caches on all processors > (including self-snooping the caches of the processor issuing the request) to > maintain coherency. This ensures that cacheable writes are observed by WC+ accesses. > > And VMRUN (and #VMEXIT) flush the WC buffers, e.g. if the guest is using WB and > the host is using WC, things will still work as expected (well, maybe not for > cases where the host is writing and the guest is reading from different CPUs). > Anyways, evidenced by the lack of bug reports over the last decade, for practical > purposes snooping the caches on guest WC accesses is sufficient. > > Hrm, but typing all that out, I have absolutely no idea why VMX+EPT cares about > guest MTRRs. Honoring guest PAT I totally get, but the guest MTRRs make no sense. I think honoring guest MTRRs is because VMX+EPT relies on guest to send clflush or wbinvd in cases like EPT is WC/UC + guest PAT is WB for non-coherent DMA devices. So, in order to let guest driver's view of memory type and host effective memory type be consistent, current KVM programs EPT with the value of guest MTRRs. If EPT only honors guest PAT and sets EPT to WB, while guest MTRR is WC or UC, then if guest driver thinks the effective memory type is WC or UC, it will not do the cache flush correctly. But I don't see linux guest driver to check combination of guest MTRR + guest PAT directly. Instead, when linux guest driver wants to program a PAT, it checks guest MTRRs to see if it's feasible. remap_pfn_range reserve_pfn_range memtype_reserve pat_x_mtrr_type So, before guest programs PAT to WB, it should find guest MTRR is WC/UC and return WC/UC as PAT or just fail. In this regard, I think honoring guest PAT only also makes sense. > E.g. I have a very hard time believing a real world guest kernel mucks with the > MTRRs to setup DMA. And again, this is supported by the absense of bug reports > on AMD. > > > Yan, > > You've been digging into this code recently, am I forgetting something because > it's late on a Friday? Or have we been making the very bad assumption that KVM > code from 10+ years ago actually makes sense? I.e. for non-coherent DMA, can we > delete all of the MTRR insanity and simply clear IPAT? Not sure if there are guest drivers can program PAT as WB but treat memory type as UC. In theory, honoring guest MTRRs is the most safe way. Do you think a complete analyse of all corner cases are deserved? I'm happy if we can remove all the MTRR stuffs in VMX :) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-10-30 12:16 ` Yan Zhao @ 2023-10-30 19:24 ` Sean Christopherson [not found] ` <3E43ADC6-E817-411A-9EBF-B16142B9B478@cs.utexas.edu> 2023-10-31 10:01 ` Yan Zhao 0 siblings, 2 replies; 16+ messages in thread From: Sean Christopherson @ 2023-10-30 19:24 UTC (permalink / raw) To: Yan Zhao; +Cc: Yibo Huang, kvm On Mon, Oct 30, 2023, Yan Zhao wrote: > On Fri, Oct 27, 2023 at 04:13:36PM -0700, Sean Christopherson wrote: > > E.g. I have a very hard time believing a real world guest kernel mucks with the > > MTRRs to setup DMA. And again, this is supported by the absense of bug reports > > on AMD. > > > > > > Yan, > > > > You've been digging into this code recently, am I forgetting something because > > it's late on a Friday? Or have we been making the very bad assumption that KVM > > code from 10+ years ago actually makes sense? I.e. for non-coherent DMA, can we > > delete all of the MTRR insanity and simply clear IPAT? > Not sure if there are guest drivers can program PAT as WB but treat memory type > as UC. > In theory, honoring guest MTRRs is the most safe way. > Do you think a complete analyse of all corner cases are deserved? I 100% agree that honoring guest MTRRs is the safest, but KVM's current approach make no sense, at all. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. The only wrinkle is that NPT honors gCR0, i.e. actually puts the CPU caches into no-fill mode, whereas VMX does nothing and forces KVM to (poorly) emulate that behavior by forcing UC. TL;DR of the below: Rather than try to make MTRR virtualization suck less for EPT, I think we should delete that code entirely and take a KVM errata to formally document that KVM doesn't virtualize guest MTRRs. In addition to solving the performance issues with zapping SPTEs for MTRR changes, that'll eliminate 600+ lines of complex code (the overlay shenanigans used for fixed MTRRs are downright mean). arch/x86/include/asm/kvm_host.h | 15 +--- arch/x86/kvm/mmu/mmu.c | 16 ---- arch/x86/kvm/mtrr.c | 644 ++++++------------------------------------------------------------------------------------------------------------------------------- arch/x86/kvm/vmx/vmx.c | 12 +-- arch/x86/kvm/x86.c | 1 - arch/x86/kvm/x86.h | 4 - 6 files changed, 36 insertions(+), 656 deletions(-) Digging deeper through the history, this *mostly* appears to be the result of coming to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling. The zapping GFNs logic came from commit efdfe536d8c643391e19d5726b072f82964bfbdb Author: Xiao Guangrong <guangrong.xiao@linux.intel.com> Date: Wed May 13 14:42:27 2015 +0800 KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR which was a fix for commit 0bed3b568b68e5835ef5da888a372b9beabf7544 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Oct 9 16:01:54 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Wed Dec 31 16:51:44 2008 +0200 KVM: Improve MTRR structure As well as reset mmu context when set MTRR. (side topic, if anyone wonders why I am so particular about changelogs, the above is exactly Anyways, the above was part of a "MTRR/PAT support for EPT" series that also added + if (mt_mask) { + mt_mask = get_memory_type(vcpu, gfn) << + kvm_x86_ops->get_mt_mask_shift(); + spte |= mt_mask; + } where get_memory_type() was a truly gnarly helper to retrive the guest MTRR memtype for a given memtype. And *very* subtly, at the time of that change, KVM *always* set VMX_EPT_IGMT_BIT, kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT | VMX_EPT_IGMT_BIT); which came in via commit 928d4bf747e9c290b690ff515d8f81e8ee226d97 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Nov 6 14:55:45 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Tue Nov 11 21:00:37 2008 +0200 KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough enabling, but Avi applied it earlier because it was a generic fix. Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant code, or rather lack thereof, is the handling of *host* MMIO. That fix came in a bit later, but given the author and timing, I think it's safe to say it was all part of the same EPT+VT-d enabling mess. commit 2aaf69dcee864f4fb6402638dd2f263324ac839f Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Wed Jan 21 16:52:16 2009 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Sun Feb 15 02:47:37 2009 +0200 KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). In addition to the host MMIO and IGMT issues, this code was obviously never tested on NPT until much later, which lends further credence to my theory/argument that this was all the result of misdiagnosed issues. Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng Yang was trying to resolve issues with passthrough MMIO. * Sheng Yang : Do you mean host(qemu) would access this memory and if we set it to guest : MTRR, host access would be broken? We would cover this in our shadow MTRR : patch, for we encountered this in video ram when doing some experiment with : VGA assignment. And in the same thread, there's also what appears to be confirmation of Intel running into issues with Windows XP related to a guest device driver mapping DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks the fact that EPT and NPT both honor guest PAT by default. /facepalm * Avi Kavity : Sheng Yang wrote: : > Yes... But it's easy to do with assigned devices' mmio, but what if guest : > specific some non-mmio memory's memory type? E.g. we have met one issue in : > Xen, that a assigned-device's XP driver specific one memory region as buffer, : > and modify the memory type then do DMA. : > : > Only map MMIO space can be first step, but I guess we can modify assigned : > memory region memory type follow guest's? : > : : With ept/npt, we can't, since the memory type is in the guest's : pagetable entries, and these are not accessible. [*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com So, for the most part, what I think happened is that 15 years ago, a few engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough device MMIO by emulating *guest* MTRRs. Except for the below case, everything since then has been a result of those two intertwined changes. The one exception, which is actually yet more confirmation of all of the above, is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: commit 606decd67049217684e3cb5a54104d51ddd4ef35 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:12:47 2015 +0200 Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5. It was reported to cause Machine Check Exceptions (bug 104091). ... commit fd717f11015f673487ffc826e59b2bad69d20fe5 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:38:13 2015 +0200 KVM: x86: apply guest MTRR virtualization on host reserved pages Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. However, the guest could prefer a different page type than UC for such pages. A good example is that pass-throughed VGA frame buffer is not always UC as host expected. This patch enables full use of virtual guest MTRRs. I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" and got the same result: machine checks, likely due to the guest MTRRs not being trustworthy/sane at all times. And FWIW, Paolo also tried to enable MTRR virtualization on NP, but that too got reverted. I read through the threads, and AFAICT no one ever found a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times doesn't appear to have a definitive root cause. commit fc07e76ac7ffa3afd621a1c3858a503386a14281 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:20:22 2015 +0200 Revert "KVM: SVM: use NPT page attributes" This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22. Initializing the mapping from MTRR to PAT values was reported to fail nondeterministically, and it also caused extremely slow boot (due to caching getting disabled---bug 103321) with assigned devices. ... commit 3c2e7f7de3240216042b61073803b61b9b3cfb22 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:32:17 2015 +0200 KVM: SVM: use NPT page attributes Right now, NPT page attributes are not used, and the final page attribute depends solely on gPAT (which however is not synced correctly), the guest MTRRs and the guest page attributes. However, we can do better by mimicking what is done for VMX. In the absence of PCI passthrough, the guest PAT can be ignored and the page attributes can be just WB. If passthrough is being used, instead, keep respecting the guest PAT, and emulate the guest MTRRs through the PAT field of the nested page tables. The only snag is that WP memory cannot be emulated correctly, because Linux's default PAT setting only includes the other types. In other words, my reading of the tea leaves is that honoring guest MTRRs for VMX was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host MMIO. And while there *are* known cases where honoring guest MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in that case is to get WC instead of UC, i.e. at this point it's for performance, not correctness. Furthermore, the complete absense of MTRR virtualization on NPT and shadow paging proves that while KVM theoretically can do better, it's by no means necessary for correctnesss. Lastly, I would argue that since kernels mostly rely on firmware to do MTRR setup, and the host typically provides guest firmware, honoring guest MTRRs is effectively honoring *host* userspace memtypes, which is also backwards, i.e. it would be far better for host userspace to communicate its desired directly to KVM (or perhaps indirectly via VMAs in the host kernel, just not through guest MTRRs). ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <3E43ADC6-E817-411A-9EBF-B16142B9B478@cs.utexas.edu>]
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms [not found] ` <3E43ADC6-E817-411A-9EBF-B16142B9B478@cs.utexas.edu> @ 2023-10-30 21:52 ` Sean Christopherson 2023-11-01 3:07 ` Yibo Huang 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2023-10-30 21:52 UTC (permalink / raw) To: Yibo Huang; +Cc: Yan Zhao, kvm On Mon, Oct 30, 2023, Yibo Huang wrote: > Well, I agree with Sean’s opinion that SVM+NPT obviates the need to emulate > guest MTRRs for real-world guest workloads. However, from my own experience, > I think KVM does emulate the effect of guest MTRRs on AMD platforms. > > Here's the reason: > 2 months ago, I was trying to attach a QEMU ivshmem device to my VMs running > on Intel and AMD machines. Since ivshmem is an emulated memory-backed > device, it should be cacheable to get the best performance. > Interestingly, I found that the memory region associated with ivshmem (PCIe > BAR 2 region) was cacheable on Intel machines, but not cacheable on AMD > machines. > After some digging, I found that this was because of the guest MTRRs - on AMD > machines, BIOS or guest OS (not sure who did this) set the memory region of > ivshmem as non-cacheable in guest MTRRs (but cacheable in guest PAT). This > was supported by the fact that ivhsmem became cacheable after removing the > corresponding guest MTRRs (reg02) on AMD machines (using "echo -n disable=2 > > /proc/mtrr”) > Additionally, the reason why ivshmem was cacheable on Intel machines was that > BIOS or guest OS didn’t set ivshmem as uncacheable in guest MTRRs on Intel > machines (not sure why though). What test(s) did you run to determine whether or not the memory was truly cacheable? KVM emulates the MTRR MSRs themselves, e.g. the guest can read and write MTRRs, and the guest will _think_ memory has a certain memtype, but that doesn't necessarily have any impact on the memtype used by the CPU. > Below is the output of “cat /proc/mtrr” on my VMs running on AMD machines. By > removing reg02, ivshmem BAR 2 region became cacheable. > > > So in my opinion, the above phenomenon suggests that KVM does honor guest > MTRRs on AMD platforms. Heh, this isn't opinion. Unless you're running a very specific 10-year old kernel, or a custom KVM build, KVM simply out doesn't propagate guest MTRRs into NPT. And unless your setup also has non-coherent DMA attached to the device, KVM doesn't honor guest MTRRs for EPT either (AFAICT, QEMU ivshmem doesn't require VFIO). It's definitely possible that disabling a guest MTRR resulted in memory becoming cacheable, but unless there's some very, very magical code hiding, it's not because KVM actually fully virtualizes guest MTRRs on AMD. E.g. before commit 9a3768191d95 ("KVM: x86/mmu: Zap SPTEs on MTRR update iff guest MTRRs are honored"), which hasn't even made its way to Linus (or Paolo's) tree yet, KVM unnecessarily zapped all NPT entries on MTRR changes. Zapping NPT entries could have cleared some weird TLB state, or perhaps even wiped out buggy KVM NPT entries. And on AMD, hardware virtualizes gCR0.CD, i.e. puts the caches into no-fill mode when guest CR0.CD=1. But Intel CPUs completely ignore guest CR0.CD, i.e. punt it to software, and under QEMU, for all intents and purposes KVM never honors guest CR0.CD for VMX. It's seems highly quite unlikely that something in the guest left CR0.CD=1, but it's possible. And then the guest kernel's process of toggling CR0.CD when doing MTRR updates would end up clearing CR0.CD and thus re-enable caching. > The thing was that I could not find any KVM code related to emulating guest > MTRRs on AMD platforms, which was the reason why I decided to send the > initial email asking about it. > > I found this in the AMD64 Architecture Programmer’s Manual Volumes 1–5 (page > 553): > > "Table 15-19 shows how guest and host PAT types are combined into an > effective PAT type. When interpreting this table, recall (a) that guest and > host PAT types are not combined when nested paging is disabled and (b) that > the intent is for the VMM to use its PAT type to simulate guest MTRRs.” > > Does this mean that AMD expects the VMM to emulate the effect of guest MTRRs > by altering the host PAT types? Yes. Which is exactly what KVM did in commit 3c2e7f7de324 ("KVM: SVM: use NPT page attributes"), which was a reverted a few months after it was introduced. > I am not sure if I misunderstood something. But I can reproduce the example I > mentioned above if you would like to look into it. Yes, it would be helpful to confirm what's going on. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-10-30 21:52 ` Sean Christopherson @ 2023-11-01 3:07 ` Yibo Huang 0 siblings, 0 replies; 16+ messages in thread From: Yibo Huang @ 2023-11-01 3:07 UTC (permalink / raw) To: Sean Christopherson; +Cc: Yan Zhao, kvm > Yes, it would be helpful to confirm what's going on. Sean was right. It turns out that the actual cause of my example was that when doing ioremap, the guest OS will configure the PAT based on the value of guest MTRRs. The key function is #pat_x_mtrr_type(). In my example, the ivshmem driver tried to ioremap the PCI BAR 2 region as WB. However, for some reason during the VM boot process, OVMF (the BIOS I was using) set the corresponding guest MTRRs as UC (Interestingly, SeaBIOS doesn't do this). Therefore, #pat_x_mtrr_type() determined that the actual memory type was UC. As a result, the guest OS set the corresponding PAT as UC. This was why ivshmem was not cacheable before removing the guest MTRR entry. After removing the guest MTRR entry, #pat_x_mtrr_type() would return WB. So this was why ivshmem became cacheable after removing the guest MTRR entry. > What test(s) did you run to determine whether or not the memory was truly cacheable? > KVM emulates the MTRR MSRs themselves, e.g. the guest can read and write MTRRs, > and the guest will _think_ memory has a certain memtype, but that doesn't necessarily > have any impact on the memtype used by the CPU. Thanks for the clarification. I used a memcpy benchmark (size 500M) to determine whether or not the memory was cacheable. When the memory was not cacheable, the benchmark took several seconds to finish. When the memory was cacheable, the benchmark took several milliseconds to finish. > Heh, this isn't opinion. Unless you're running a very specific 10-year old kernel, > or a custom KVM build, KVM simply out doesn't propagate guest MTRRs into NPT. > > And unless your setup also has non-coherent DMA attached to the device, KVM doesn't > honor guest MTRRs for EPT either (AFAICT, QEMU ivshmem doesn't require VFIO). > > It's definitely possible that disabling a guest MTRR resulted in memory becoming > cacheable, but unless there's some very, very magical code hiding, it's not because > KVM actually fully virtualizes guest MTRRs on AMD. > > E.g. before commit 9a3768191d95 ("KVM: x86/mmu: Zap SPTEs on MTRR update iff guest > MTRRs are honored"), which hasn't even made its way to Linus (or Paolo's) tree yet, > KVM unnecessarily zapped all NPT entries on MTRR changes. Zapping NPT entries > could have cleared some weird TLB state, or perhaps even wiped out buggy KVM NPT > entries. > > And on AMD, hardware virtualizes gCR0.CD, i.e. puts the caches into no-fill mode > when guest CR0.CD=1. But Intel CPUs completely ignore guest CR0.CD, i.e. punt it > to software, and under QEMU, for all intents and purposes KVM never honors guest > CR0.CD for VMX. It's seems highly quite unlikely that something in the guest left > CR0.CD=1, but it's possible. And then the guest kernel's process of toggling > CR0.CD when doing MTRR updates would end up clearing CR0.CD and thus re-enable > caching. > >> The thing was that I could not find any KVM code related to emulating guest >> MTRRs on AMD platforms, which was the reason why I decided to send the >> initial email asking about it. >> >> I found this in the AMD64 Architecture Programmer’s Manual Volumes 1–5 (page >> 553): >> >> "Table 15-19 shows how guest and host PAT types are combined into an >> effective PAT type. When interpreting this table, recall (a) that guest and >> host PAT types are not combined when nested paging is disabled and (b) that >> the intent is for the VMM to use its PAT type to simulate guest MTRRs.” >> >> Does this mean that AMD expects the VMM to emulate the effect of guest MTRRs >> by altering the host PAT types? > > Yes. Which is exactly what KVM did in commit 3c2e7f7de324 ("KVM: SVM: use NPT > page attributes"), which was a reverted a few months after it was introduced. Again, thanks for the clarification! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-10-30 19:24 ` Sean Christopherson [not found] ` <3E43ADC6-E817-411A-9EBF-B16142B9B478@cs.utexas.edu> @ 2023-10-31 10:01 ` Yan Zhao 2023-10-31 15:14 ` Sean Christopherson 1 sibling, 1 reply; 16+ messages in thread From: Yan Zhao @ 2023-10-31 10:01 UTC (permalink / raw) To: Sean Christopherson; +Cc: Yibo Huang, kvm On Mon, Oct 30, 2023 at 12:24:02PM -0700, Sean Christopherson wrote: > On Mon, Oct 30, 2023, Yan Zhao wrote: > > On Fri, Oct 27, 2023 at 04:13:36PM -0700, Sean Christopherson wrote: > > > E.g. I have a very hard time believing a real world guest kernel mucks with the > > > MTRRs to setup DMA. And again, this is supported by the absense of bug reports > > > on AMD. > > > > > > > > > Yan, > > > > > > You've been digging into this code recently, am I forgetting something because > > > it's late on a Friday? Or have we been making the very bad assumption that KVM > > > code from 10+ years ago actually makes sense? I.e. for non-coherent DMA, can we > > > delete all of the MTRR insanity and simply clear IPAT? > > Not sure if there are guest drivers can program PAT as WB but treat memory type > > as UC. > > In theory, honoring guest MTRRs is the most safe way. > > Do you think a complete analyse of all corner cases are deserved? > > I 100% agree that honoring guest MTRRs is the safest, but KVM's current approach > make no sense, at all. From a hardware virtualization perspective of guest MTRRs, > there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically > account for guest MTRRs, nor does NPT. > > The only wrinkle is that NPT honors gCR0, i.e. actually puts the CPU caches into > no-fill mode, whereas VMX does nothing and forces KVM to (poorly) emulate that > behavior by forcing UC. > > TL;DR of the below: Rather than try to make MTRR virtualization suck less for EPT, > I think we should delete that code entirely and take a KVM errata to formally > document that KVM doesn't virtualize guest MTRRs. In addition to solving the > performance issues with zapping SPTEs for MTRR changes, that'll eliminate 600+ > lines of complex code (the overlay shenanigans used for fixed MTRRs are downright > mean). > > arch/x86/include/asm/kvm_host.h | 15 +--- > arch/x86/kvm/mmu/mmu.c | 16 ---- > arch/x86/kvm/mtrr.c | 644 ++++++------------------------------------------------------------------------------------------------------------------------------- > arch/x86/kvm/vmx/vmx.c | 12 +-- > arch/x86/kvm/x86.c | 1 - > arch/x86/kvm/x86.h | 4 - > 6 files changed, 36 insertions(+), 656 deletions(-) > > Digging deeper through the history, this *mostly* appears to be the result of coming > to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling. > > The zapping GFNs logic came from > > commit efdfe536d8c643391e19d5726b072f82964bfbdb > Author: Xiao Guangrong <guangrong.xiao@linux.intel.com> > Date: Wed May 13 14:42:27 2015 +0800 > > KVM: MMU: fix MTRR update > > Currently, whenever guest MTRR registers are changed > kvm_mmu_reset_context is called to switch to the new root shadow page > table, however, it's useless since: > 1) the cache type is not cached into shadow page's attribute so that > the original root shadow page will be reused > > 2) the cache type is set on the last spte, that means we should sync > the last sptes when MTRR is changed > > This patch fixs this issue by drop all the spte in the gfn range which > is being updated by MTRR > > which was a fix for > > commit 0bed3b568b68e5835ef5da888a372b9beabf7544 > Author: Sheng Yang <sheng@linux.intel.com> > AuthorDate: Thu Oct 9 16:01:54 2008 +0800 > Commit: Avi Kivity <avi@redhat.com> > CommitDate: Wed Dec 31 16:51:44 2008 +0200 > > KVM: Improve MTRR structure > > As well as reset mmu context when set MTRR. > > (side topic, if anyone wonders why I am so particular about changelogs, the above > is exactly > > Anyways, the above was part of a "MTRR/PAT support for EPT" series that also added > > + if (mt_mask) { > + mt_mask = get_memory_type(vcpu, gfn) << > + kvm_x86_ops->get_mt_mask_shift(); > + spte |= mt_mask; > + } > > where get_memory_type() was a truly gnarly helper to retrive the guest MTRR memtype > for a given memtype. And *very* subtly, at the time of that change, KVM *always* > set VMX_EPT_IGMT_BIT, > > kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK | > VMX_EPT_WRITABLE_MASK | > VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT | > VMX_EPT_IGMT_BIT); > > which came in via > > commit 928d4bf747e9c290b690ff515d8f81e8ee226d97 > Author: Sheng Yang <sheng@linux.intel.com> > AuthorDate: Thu Nov 6 14:55:45 2008 +0800 > Commit: Avi Kivity <avi@redhat.com> > CommitDate: Tue Nov 11 21:00:37 2008 +0200 > > KVM: VMX: Set IGMT bit in EPT entry > > There is a potential issue that, when guest using pagetable without vmexit when > EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, > which would be inconsistent with host side and would cause host MCE due to > inconsistent cache attribute. > > The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default > memory type to protect host (notice that all memory mapped by KVM should be WB). > > Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole > IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough > enabling, but Avi applied it earlier because it was a generic fix. > My feeling is that Current memtype handling for non-coherent DMA is a compromise between (a) security ("qemu mappings will use writeback and guest mapping will use guest specified memory types") (b) the effective memtype cannot be cacheable if guest thinks it's non-cacheable. So, for MMIOs in non-coherent DMAs, mapping them as UC in EPT is understandable, because other value like WB or WC is not preferred -- guest usually sets MMIOs' PAT to UC or WC, so "PAT=UC && EPT=WB" or "PAT=UC && EPT=WC" are not preferred according to SDM due to page aliasing. And VFIO maps the MMIOs to UC in host. (With pass-through GPU in my env, the MMIOs' guest MTRR is UC, I can observe host hang if I program its EPT type to - WB+IPAT or - WC ) For guest RAM, looks honoring guest MTRRs just mitigates the page aliasing problem. E.g. if guest PAT=UC because its MTRR=UC, setting EPT type=UC can avoid "guest PAT=UC && EPT=WB", which is not recommended in SDM. But it still breaks (a) if guest PAT is UC. Also, honoring guest MTRRs in EPT is friendly to old systems that do not enable PAT. I guess :) But I agree, in common cases, honoring guest MTRRs or not looks no big difference. (And I'm not lucky enough to reproduce page-aliasing-caused MCE yet in my environment). For CR0_CD=1, - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a). - w/ KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b); with IPAT=0, it may breaks (a), but meets (b) > Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant > code, or rather lack thereof, is the handling of *host* MMIO. That fix came in a > bit later, but given the author and timing, I think it's safe to say it was all > part of the same EPT+VT-d enabling mess. > > commit 2aaf69dcee864f4fb6402638dd2f263324ac839f > Author: Sheng Yang <sheng@linux.intel.com> > AuthorDate: Wed Jan 21 16:52:16 2009 +0800 > Commit: Avi Kivity <avi@redhat.com> > CommitDate: Sun Feb 15 02:47:37 2009 +0200 > > KVM: MMU: Map device MMIO as UC in EPT > > Software are not allow to access device MMIO using cacheable memory type, the > patch limit MMIO region with UC and WC(guest can select WC using PAT and > PCD/PWT). > > In addition to the host MMIO and IGMT issues, this code was obviously never tested > on NPT until much later, which lends further credence to my theory/argument that > this was all the result of misdiagnosed issues. > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng > Yang was trying to resolve issues with passthrough MMIO. > > * Sheng Yang > : Do you mean host(qemu) would access this memory and if we set it to guest > : MTRR, host access would be broken? We would cover this in our shadow MTRR > : patch, for we encountered this in video ram when doing some experiment with > : VGA assignment. > > And in the same thread, there's also what appears to be confirmation of Intel > running into issues with Windows XP related to a guest device driver mapping > DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks > the fact that EPT and NPT both honor guest PAT by default. /facepalm My interpretation is that the since guest PATs are in guest page tables, while with EPT/NPT, guest page tables are not shadowed, it's not easy to check guest PATs to disallow host QEMU access to non-WB guest RAM. The credence is with Avi's following word: "Looks like a conflict between the requirements of a hypervisor supporting device assignment, and the memory type constraints of mapping everything with the same memory type. As far as I can see, the only solution is not to map guest memory in the hypervisor, and do all accesses via dma. This is easy for virtual disk, somewhat harder for virtual networking (need a dma engine or a multiqueue device). Since qemu will only access memory on demand, we don't actually have to unmap guest memory, only to ensure that qemu doesn't touch it. Things like live migration and page sharing won't work, but they aren't expected to with device assignment anyway." > > * Avi Kavity > : Sheng Yang wrote: > : > Yes... But it's easy to do with assigned devices' mmio, but what if guest > : > specific some non-mmio memory's memory type? E.g. we have met one issue in > : > Xen, that a assigned-device's XP driver specific one memory region as buffer, > : > and modify the memory type then do DMA. > : > > : > Only map MMIO space can be first step, but I guess we can modify assigned > : > memory region memory type follow guest's? > : > > : > : With ept/npt, we can't, since the memory type is in the guest's > : pagetable entries, and these are not accessible > > [*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com > > So, for the most part, what I think happened is that 15 years ago, a few engineers > (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough > device MMIO by emulating *guest* MTRRs. Except for the below case, everything since > then has been a result of those two intertwined changes. > > The one exception, which is actually yet more confirmation of all of the above, > is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: > > commit 606decd67049217684e3cb5a54104d51ddd4ef35 > Author: Paolo Bonzini <pbonzini@redhat.com> > Date: Thu Oct 1 13:12:47 2015 +0200 > > Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" > > This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5. > It was reported to cause Machine Check Exceptions (bug 104091). > > ... > > commit fd717f11015f673487ffc826e59b2bad69d20fe5 > Author: Paolo Bonzini <pbonzini@redhat.com> > Date: Tue Jul 7 14:38:13 2015 +0200 > > KVM: x86: apply guest MTRR virtualization on host reserved pages > > Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. > However, the guest could prefer a different page type than UC for > such pages. A good example is that pass-throughed VGA frame buffer is > not always UC as host expected. > > This patch enables full use of virtual guest MTRRs. > > I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" > and got the same result: machine checks, likely due to the guest MTRRs not being > trustworthy/sane at all times. > > And FWIW, Paolo also tried to enable MTRR virtualization on NP, but that too got > reverted. I read through the threads, and AFAICT no one ever found a smoking gun, > i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times > doesn't appear to have a definitive root cause. > > commit fc07e76ac7ffa3afd621a1c3858a503386a14281 > Author: Paolo Bonzini <pbonzini@redhat.com> > Date: Thu Oct 1 13:20:22 2015 +0200 > > Revert "KVM: SVM: use NPT page attributes" > > This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22. > Initializing the mapping from MTRR to PAT values was reported to > fail nondeterministically, and it also caused extremely slow boot > (due to caching getting disabled---bug 103321) with assigned devices. > > ... > > commit 3c2e7f7de3240216042b61073803b61b9b3cfb22 > Author: Paolo Bonzini <pbonzini@redhat.com> > Date: Tue Jul 7 14:32:17 2015 +0200 > > KVM: SVM: use NPT page attributes > > Right now, NPT page attributes are not used, and the final page > attribute depends solely on gPAT (which however is not synced > correctly), the guest MTRRs and the guest page attributes. > > However, we can do better by mimicking what is done for VMX. > In the absence of PCI passthrough, the guest PAT can be ignored > and the page attributes can be just WB. If passthrough is being > used, instead, keep respecting the guest PAT, and emulate the guest > MTRRs through the PAT field of the nested page tables. > > The only snag is that WP memory cannot be emulated correctly, > because Linux's default PAT setting only includes the other types. > > In other words, my reading of the tea leaves is that honoring guest MTRRs for VMX > was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not > forcing UC for host MMIO. And while there *are* known cases where honoring guest > MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in > that case is to get WC instead of UC, i.e. at this point it's for performance, > not correctness. > > Furthermore, the complete absense of MTRR virtualization on NPT and shadow paging > proves that while KVM theoretically can do better, it's by no means necessary for > correctnesss. > > Lastly, I would argue that since kernels mostly rely on firmware to do MTRR setup, > and the host typically provides guest firmware, honoring guest MTRRs is effectively > honoring *host* userspace memtypes, which is also backwards, i.e. it would be far > better for host userspace to communicate its desired directly to KVM (or perhaps > indirectly via VMAs in the host kernel, just not through guest MTRRs). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-10-31 10:01 ` Yan Zhao @ 2023-10-31 15:14 ` Sean Christopherson 2023-11-01 3:53 ` Huang, Kai 2023-11-01 9:08 ` Yan Zhao 0 siblings, 2 replies; 16+ messages in thread From: Sean Christopherson @ 2023-10-31 15:14 UTC (permalink / raw) To: Yan Zhao; +Cc: Yibo Huang, kvm On Tue, Oct 31, 2023, Yan Zhao wrote: > On Mon, Oct 30, 2023 at 12:24:02PM -0700, Sean Christopherson wrote: > > On Mon, Oct 30, 2023, Yan Zhao wrote: > > Digging deeper through the history, this *mostly* appears to be the result of coming > > to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling. ... > > Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole > > IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough > > enabling, but Avi applied it earlier because it was a generic fix. > > > My feeling is that > Current memtype handling for non-coherent DMA is a compromise between > (a) security ("qemu mappings will use writeback and guest mapping will use guest > specified memory types") > (b) the effective memtype cannot be cacheable if guest thinks it's non-cacheable. And correctness. E.g. accessing memory with conficting memtypes could cause guest data corruption, which isn't strictly the same as (a). > So, for MMIOs in non-coherent DMAs, mapping them as UC in EPT is understandable, > because other value like WB or WC is not preferred -- > guest usually sets MMIOs' PAT to UC or WC, so "PAT=UC && EPT=WB" or > "PAT=UC && EPT=WC" are not preferred according to SDM due to page aliasing. > And VFIO maps the MMIOs to UC in host. > (With pass-through GPU in my env, the MMIOs' guest MTRR is UC, > I can observe host hang if I program its EPT type to > - WB+IPAT or > - WC > ) Yes, but all of that simply confirms that it's KVM's responsibility to map host MMIO as UC. The hangs you observe likely have nothing to do with memory aliasing, and everything to do with accessing real MMIO with incompatible memtypes. > For guest RAM, looks honoring guest MTRRs just mitigates the page aliasing > problem. > E.g. if guest PAT=UC because its MTRR=UC, setting EPT type=UC can avoid > "guest PAT=UC && EPT=WB", which is not recommended in SDM. > But it still breaks (a) if guest PAT is UC. > Also, honoring guest MTRRs in EPT is friendly to old systems that do not enable > PAT. I guess :) LOL, no way. The PAT can't be disabled, and the default PAT combinations are backwards compatible with legacy PCD+PWT. The only way for this to provide value is if someone is virtualizing a pre-Pentium Pro CPU, doing device passthrough, and *only* doing so on hardware with EPT. > But I agree, in common cases, honoring guest MTRRs or not looks no big difference. > (And I'm not lucky enough to reproduce page-aliasing-caused MCE yet in my > environment). FWIW, I don't think that page aliasing with WC/UC actually causes machine checks. What does result in #MC (assuming things haven't changed in the last few years) is accessing MMIO using WB and other cacheable memtypes, e.g. map the host APIC with WB and you should see #MCs. I suspect this is what people encountered years ago when KVM attempted to honored guest MTRRs at all times. E.g. the "full" MTRR virtualization patch that got reverted deliberately allowed the guest to control the memtype for host MMIO. The SDM makes aliasing sound super scary, but then has footnotes where it explicitly requires the CPU to play nice with aliasing, e.g. if MTRRs are *not* UC but the effective memtype is UC, then the CPU is *required* to snoop caches: 2. The UC attribute came from the page-table or page-directory entry and processors are required to check their caches because the data may be cached due to page aliasing, which is not recommended. Lack of snooping can effectively cause data corruption and ordering issues, but at least for WC/UC vs. WB I don't think there are actual #MC problems with aliasing. > For CR0_CD=1, > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a). > - w/ KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b); > with IPAT=0, it may breaks (a), but meets (b) CR0.CD=1 is a mess above and beyond memtypes. Huh. It's even worse than I thought, because according to the SDM, Atom CPUs don't support no-fill mode: 3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor, caching is disabled. Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK, it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way to emulate no-fill mode. > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng > > Yang was trying to resolve issues with passthrough MMIO. > > > > * Sheng Yang > > : Do you mean host(qemu) would access this memory and if we set it to guest > > : MTRR, host access would be broken? We would cover this in our shadow MTRR > > : patch, for we encountered this in video ram when doing some experiment with > > : VGA assignment. > > > > And in the same thread, there's also what appears to be confirmation of Intel > > running into issues with Windows XP related to a guest device driver mapping > > DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks > > the fact that EPT and NPT both honor guest PAT by default. /facepalm > > My interpretation is that the since guest PATs are in guest page tables, > while with EPT/NPT, guest page tables are not shadowed, it's not easy to > check guest PATs to disallow host QEMU access to non-WB guest RAM. Ah, yeah, your interpretation makes sense. The best idea I can think of to support things like this is to have KVM grab the effective PAT memtype from the host userspace page tables, shove that into the EPT/NPT memtype, and then ignore guest PAT. I don't if that would actually work though. > The credence is with Avi's following word: > "Looks like a conflict between the requirements of a hypervisor > supporting device assignment, and the memory type constraints of mapping > everything with the same memory type. As far as I can see, the only > solution is not to map guest memory in the hypervisor, and do all > accesses via dma. This is easy for virtual disk, somewhat harder for > virtual networking (need a dma engine or a multiqueue device). > > Since qemu will only access memory on demand, we don't actually have to > unmap guest memory, only to ensure that qemu doesn't touch it. Things > like live migration and page sharing won't work, but they aren't > expected to with device assignment anyway." ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-10-31 15:14 ` Sean Christopherson @ 2023-11-01 3:53 ` Huang, Kai 2023-11-01 9:08 ` Yan Zhao 1 sibling, 0 replies; 16+ messages in thread From: Huang, Kai @ 2023-11-01 3:53 UTC (permalink / raw) To: Christopherson,, Sean, Zhao, Yan Y Cc: ybhuang@cs.utexas.edu, kvm@vger.kernel.org On Tue, 2023-10-31 at 08:14 -0700, Sean Christopherson wrote: > > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng > > > Yang was trying to resolve issues with passthrough MMIO. > > > > > > * Sheng Yang > > > : Do you mean host(qemu) would access this memory and if we set it to guest > > > : MTRR, host access would be broken? We would cover this in our shadow MTRR > > > : patch, for we encountered this in video ram when doing some experiment with > > > : VGA assignment. Not sure what does the "VGA assignment" mean, but I have a doubt whether it was about passthrough MMIO. Theoretically if it was passthrough MMIO, the host shouldn't need to access it. But the above text seems the issue were host/guest accessing memory together. If we are talking about the video ram here (e.g., for framebuffer) IIUC it isn't passthrough MMIO, but just some memory that used by guest as video ram. KVM needs to periodically write protect it (and clear dirty) so that Qemu can be aware of exact what video ram has been updated to correctly emulate the video ram, i.e., showing on the console of the VM. So I guess the issue was both host and guest access of video ram, while guest sets its memory type to WC or UC. But IIUC host only *reads* from video ram, but never *writes*, thus I don't see there's any real problem if host is accessing via WB and guest is accessing via WC or UC. AMD SDM: VMRUN and #VMEXIT flush the write combiners. This ensures that all writes to WC memory by the guest are visible to the host (or vice-versa) regardless of memory type. (It does not ensure that cacheable writes by one agent are properly observed by WC reads or writes by the other agent.) > > > > > > And in the same thread, there's also what appears to be confirmation of Intel > > > running into issues with Windows XP related to a guest device driver mapping > > > DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the > > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks > > > the fact that EPT and NPT both honor guest PAT by default. /facepalm I think Avi was not talking about guest PAT but guest MTRR, which is not honored by NPT/EPT at all. ? > > > > My interpretation is that the since guest PATs are in guest page tables, > > while with EPT/NPT, guest page tables are not shadowed, it's not easy to > > check guest PATs to disallow host QEMU access to non-WB guest RAM. > > Ah, yeah, your interpretation makes sense. > > The best idea I can think of to support things like this is to have KVM grab the > effective PAT memtype from the host userspace page tables, shove that into the > EPT/NPT memtype, and then ignore guest PAT. I don't if that would actually work > though. I think you are assuming "host userspace page tables" will always have the same memory type in guest's MTRR? I am not sure whether it will always be the case. I haven't checked the Qemu code, but theoretically, for things like video ram, the guest can have its memory as WC/UC in MTRR but host can map it as WB perfectly, because host only needs to read from it. I think we can just get rid of guest MTRR staff completely, i.e. enforce KVM to expose 0 fixed and dynamic MTRRs. Then we don't need to "look at memory type from host userspace page tables", but simply set WB to NPT/EPT. The reason is as you said NPT/EPT honor guest's PAT by default. If guest wants WC then it sets WC to PAT then guest will access it using WC. Host side for passthrough MMIO host should never access it anyway, and for things like video ram host will only read from it, thus it should be safe to map WB in the host. Or do we need to consider host being able to write using WB some memory while it is accessed as WC/UC in the guest? And does kernel-direct mapping worth consideration? Hmm.. But it's possible I am talking non-sense.. :-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-10-31 15:14 ` Sean Christopherson 2023-11-01 3:53 ` Huang, Kai @ 2023-11-01 9:08 ` Yan Zhao 2023-11-06 22:34 ` Sean Christopherson 1 sibling, 1 reply; 16+ messages in thread From: Yan Zhao @ 2023-11-01 9:08 UTC (permalink / raw) To: Sean Christopherson; +Cc: Yibo Huang, kvm On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote: > On Tue, Oct 31, 2023, Yan Zhao wrote: > > On Mon, Oct 30, 2023 at 12:24:02PM -0700, Sean Christopherson wrote: > > > On Mon, Oct 30, 2023, Yan Zhao wrote: > > > Digging deeper through the history, this *mostly* appears to be the result of coming > > > to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling. > > ... > > > > Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole > > > IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough > > > enabling, but Avi applied it earlier because it was a generic fix. > > > > > My feeling is that > > Current memtype handling for non-coherent DMA is a compromise between > > (a) security ("qemu mappings will use writeback and guest mapping will use guest > > specified memory types") > > (b) the effective memtype cannot be cacheable if guest thinks it's non-cacheable. > > And correctness. E.g. accessing memory with conficting memtypes could cause guest > data corruption, which isn't strictly the same as (a). > > > So, for MMIOs in non-coherent DMAs, mapping them as UC in EPT is understandable, > > because other value like WB or WC is not preferred -- > > guest usually sets MMIOs' PAT to UC or WC, so "PAT=UC && EPT=WB" or > > "PAT=UC && EPT=WC" are not preferred according to SDM due to page aliasing. > > And VFIO maps the MMIOs to UC in host. > > (With pass-through GPU in my env, the MMIOs' guest MTRR is UC, > > I can observe host hang if I program its EPT type to > > - WB+IPAT or > > - WC > > ) > > Yes, but all of that simply confirms that it's KVM's responsibility to map host > MMIO as UC. The hangs you observe likely have nothing to do with memory aliasing, > and everything to do with accessing real MMIO with incompatible memtypes. Yes, you are right. For EPT type = WC, the hang case is actually because pci_iomap() maps PAT as UC- by default, then the effective memory type is WC, which is wrong. If I force the driver to map with PAT=UC, then the driver works normal even with EPT type = WC. > > > For guest RAM, looks honoring guest MTRRs just mitigates the page aliasing > > problem. > > E.g. if guest PAT=UC because its MTRR=UC, setting EPT type=UC can avoid > > "guest PAT=UC && EPT=WB", which is not recommended in SDM. > > But it still breaks (a) if guest PAT is UC. > > Also, honoring guest MTRRs in EPT is friendly to old systems that do not enable > > PAT. I guess :) > > LOL, no way. The PAT can't be disabled, and the default PAT combinations are > backwards compatible with legacy PCD+PWT. The only way for this to provide value > is if someone is virtualizing a pre-Pentium Pro CPU, doing device passthrough, > and *only* doing so on hardware with EPT. > > > But I agree, in common cases, honoring guest MTRRs or not looks no big difference. > > (And I'm not lucky enough to reproduce page-aliasing-caused MCE yet in my > > environment). > > FWIW, I don't think that page aliasing with WC/UC actually causes machine checks. > What does result in #MC (assuming things haven't changed in the last few years) > is accessing MMIO using WB and other cacheable memtypes, e.g. map the host APIC > with WB and you should see #MCs. I suspect this is what people encountered years > ago when KVM attempted to honored guest MTRRs at all times. E.g. the "full" MTRR > virtualization patch that got reverted deliberately allowed the guest to control > the memtype for host MMIO. > > The SDM makes aliasing sound super scary, but then has footnotes where it explicitly > requires the CPU to play nice with aliasing, e.g. if MTRRs are *not* UC but the > effective memtype is UC, then the CPU is *required* to snoop caches: > Yes, I tried below combinations, none of them can trigger #MC. - effective memory type for guest access is WC, and that for host access is UC - effective memory type for guest access is UC, and that for host access is WC - effective memory type for guest access is UC, and that for host access is WB > 2. The UC attribute came from the page-table or page-directory entry and > processors are required to check their caches because the data may be cached > due to page aliasing, which is not recommended. > > Lack of snooping can effectively cause data corruption and ordering issues, but > at least for WC/UC vs. WB I don't think there are actual #MC problems with aliasing. > Even no #MC on guest RAM? E.g. what if guest effective memory type is UC/WC, and host effective memory type is WB? (I tried in my machines with guest PAT=WC + host PAT=WB, looks no #MC, but I'm not sure if anything I'm missing and it's only in my specific environment.) If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even without non-coherent DMA? > > For CR0_CD=1, > > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a). > > - w/ KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b); > > with IPAT=0, it may breaks (a), but meets (b) > > CR0.CD=1 is a mess above and beyond memtypes. Huh. It's even worse than I thought, > because according to the SDM, Atom CPUs don't support no-fill mode: > > 3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor, > caching is disabled. > > Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK, > it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way > to emulate no-fill mode. > > > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng > > > Yang was trying to resolve issues with passthrough MMIO. > > > > > > * Sheng Yang > > > : Do you mean host(qemu) would access this memory and if we set it to guest > > > : MTRR, host access would be broken? We would cover this in our shadow MTRR > > > : patch, for we encountered this in video ram when doing some experiment with > > > : VGA assignment. > > > > > > And in the same thread, there's also what appears to be confirmation of Intel > > > running into issues with Windows XP related to a guest device driver mapping > > > DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the > > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks > > > the fact that EPT and NPT both honor guest PAT by default. /facepalm > > > > My interpretation is that the since guest PATs are in guest page tables, > > while with EPT/NPT, guest page tables are not shadowed, it's not easy to > > check guest PATs to disallow host QEMU access to non-WB guest RAM. > > Ah, yeah, your interpretation makes sense. > > The best idea I can think of to support things like this is to have KVM grab the > effective PAT memtype from the host userspace page tables, shove that into the > EPT/NPT memtype, and then ignore guest PAT. I don't if that would actually work > though. Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some others as WC, even they belong to the same BAR. I don't think host can know which one to choose in advance. I think it should be also true to RAM range, guest can do memremap to a memory type that host doesn't know beforehand. > > > The credence is with Avi's following word: > > "Looks like a conflict between the requirements of a hypervisor > > supporting device assignment, and the memory type constraints of mapping > > everything with the same memory type. As far as I can see, the only > > solution is not to map guest memory in the hypervisor, and do all > > accesses via dma. This is easy for virtual disk, somewhat harder for > > virtual networking (need a dma engine or a multiqueue device). > > > > Since qemu will only access memory on demand, we don't actually have to > > unmap guest memory, only to ensure that qemu doesn't touch it. Things > > like live migration and page sharing won't work, but they aren't > > expected to with device assignment anyway." ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-11-01 9:08 ` Yan Zhao @ 2023-11-06 22:34 ` Sean Christopherson 2023-11-07 9:26 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2023-11-06 22:34 UTC (permalink / raw) To: Yan Zhao; +Cc: Yibo Huang, kvm On Wed, Nov 01, 2023, Yan Zhao wrote: > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote: > > FWIW, I don't think that page aliasing with WC/UC actually causes machine checks. > > What does result in #MC (assuming things haven't changed in the last few years) > > is accessing MMIO using WB and other cacheable memtypes, e.g. map the host APIC > > with WB and you should see #MCs. I suspect this is what people encountered years > > ago when KVM attempted to honored guest MTRRs at all times. E.g. the "full" MTRR > > virtualization patch that got reverted deliberately allowed the guest to control > > the memtype for host MMIO. > > > > The SDM makes aliasing sound super scary, but then has footnotes where it explicitly > > requires the CPU to play nice with aliasing, e.g. if MTRRs are *not* UC but the > > effective memtype is UC, then the CPU is *required* to snoop caches: > > > Yes, I tried below combinations, none of them can trigger #MC. > - effective memory type for guest access is WC, and that for host access is UC > - effective memory type for guest access is UC, and that for host access is WC > - effective memory type for guest access is UC, and that for host access is WB > > > 2. The UC attribute came from the page-table or page-directory entry and > > processors are required to check their caches because the data may be cached > > due to page aliasing, which is not recommended. > > > > Lack of snooping can effectively cause data corruption and ordering issues, but > > at least for WC/UC vs. WB I don't think there are actual #MC problems with aliasing. > > > Even no #MC on guest RAM? > E.g. what if guest effective memory type is UC/WC, and host effective memory type > is WB? > (I tried in my machines with guest PAT=WC + host PAT=WB, looks no #MC, but I'm not sure > if anything I'm missing and it's only in my specific environment.) > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even > without non-coherent DMA? No, there are snooping/ordering issues on Intel, and to a lesser extent AMD. AMD's WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT (and vice versa). That may or may not be sufficient for multi-threaded use cases, but I've no idea if there is actually anything to worry about on that front. I think there's also a flaw with guest using UC, which IIUC doesn't snoop caches, i.e. the guest could get stale data. AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide something similar to safely let the guest control memtypes. Arguably, KVM should have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust. But even then, there's still the question of why, i.e. what would be the benefit of letting the guest control memtypes when it's not required for functional correctness, and would that benefit outweight the cost. > > > For CR0_CD=1, > > > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a). > > > - w/ KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b); > > > with IPAT=0, it may breaks (a), but meets (b) > > > > CR0.CD=1 is a mess above and beyond memtypes. Huh. It's even worse than I thought, > > because according to the SDM, Atom CPUs don't support no-fill mode: > > > > 3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor, > > caching is disabled. > > > > Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK, > > it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way > > to emulate no-fill mode. > > > > > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng > > > > Yang was trying to resolve issues with passthrough MMIO. > > > > > > > > * Sheng Yang > > > > : Do you mean host(qemu) would access this memory and if we set it to guest > > > > : MTRR, host access would be broken? We would cover this in our shadow MTRR > > > > : patch, for we encountered this in video ram when doing some experiment with > > > > : VGA assignment. > > > > > > > > And in the same thread, there's also what appears to be confirmation of Intel > > > > running into issues with Windows XP related to a guest device driver mapping > > > > DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the > > > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks > > > > the fact that EPT and NPT both honor guest PAT by default. /facepalm > > > > > > My interpretation is that the since guest PATs are in guest page tables, > > > while with EPT/NPT, guest page tables are not shadowed, it's not easy to > > > check guest PATs to disallow host QEMU access to non-WB guest RAM. > > > > Ah, yeah, your interpretation makes sense. > > > > The best idea I can think of to support things like this is to have KVM grab the > > effective PAT memtype from the host userspace page tables, shove that into the > > EPT/NPT memtype, and then ignore guest PAT. I don't if that would actually work > > though. > Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some > others as WC, even they belong to the same BAR. > I don't think host can know which one to choose in advance. > I think it should be also true to RAM range, guest can do memremap to a memory > type that host doesn't know beforehand. The goal wouldn't be to honor guest memtype, it would be to ensure correctness. E.g. guest can do memremap all it wants, and KVM will always ignore the guest's memtype. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-11-06 22:34 ` Sean Christopherson @ 2023-11-07 9:26 ` Yan Zhao 2023-11-07 18:06 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2023-11-07 9:26 UTC (permalink / raw) To: Sean Christopherson; +Cc: Yibo Huang, kvm On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote: > On Wed, Nov 01, 2023, Yan Zhao wrote: > > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote: > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even > > without non-coherent DMA? > > No, there are snooping/ordering issues on Intel, and to a lesser extent AMD. AMD's > WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and > #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT > (and vice versa). That may or may not be sufficient for multi-threaded use cases, > but I've no idea if there is actually anything to worry about on that front. I > think there's also a flaw with guest using UC, which IIUC doesn't snoop caches, > i.e. the guest could get stale data. > > AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide > something similar to safely let the guest control memtypes. Arguably, KVM should > have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust. > > But even then, there's still the question of why, i.e. what would be the benefit > of letting the guest control memtypes when it's not required for functional > correctness, and would that benefit outweight the cost. Ok, so for a coherent device , if it's assigned together with a non-coherent device, and if there's a page with host PAT = WB and guest PAT=UC, we need to ensure the host write is flushed before guest read/write and guest DMA though no need to worry about #MC, right? > > > > > For CR0_CD=1, > > > > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a). > > > > - w/ KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b); > > > > with IPAT=0, it may breaks (a), but meets (b) > > > > > > CR0.CD=1 is a mess above and beyond memtypes. Huh. It's even worse than I thought, > > > because according to the SDM, Atom CPUs don't support no-fill mode: > > > > > > 3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor, > > > caching is disabled. > > > > > > Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK, > > > it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way > > > to emulate no-fill mode. > > > > > > > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng > > > > > Yang was trying to resolve issues with passthrough MMIO. > > > > > > > > > > * Sheng Yang > > > > > : Do you mean host(qemu) would access this memory and if we set it to guest > > > > > : MTRR, host access would be broken? We would cover this in our shadow MTRR > > > > > : patch, for we encountered this in video ram when doing some experiment with > > > > > : VGA assignment. > > > > > > > > > > And in the same thread, there's also what appears to be confirmation of Intel > > > > > running into issues with Windows XP related to a guest device driver mapping > > > > > DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the > > > > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks > > > > > the fact that EPT and NPT both honor guest PAT by default. /facepalm > > > > > > > > My interpretation is that the since guest PATs are in guest page tables, > > > > while with EPT/NPT, guest page tables are not shadowed, it's not easy to > > > > check guest PATs to disallow host QEMU access to non-WB guest RAM. > > > > > > Ah, yeah, your interpretation makes sense. > > > > > > The best idea I can think of to support things like this is to have KVM grab the > > > effective PAT memtype from the host userspace page tables, shove that into the > > > EPT/NPT memtype, and then ignore guest PAT. I don't if that would actually work > > > though. > > Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some > > others as WC, even they belong to the same BAR. > > I don't think host can know which one to choose in advance. > > I think it should be also true to RAM range, guest can do memremap to a memory > > type that host doesn't know beforehand. > > The goal wouldn't be to honor guest memtype, it would be to ensure correctness. > E.g. guest can do memremap all it wants, and KVM will always ignore the guest's > memtype. AFAIK, some GPUs with TTM driver may call set_pages_array_uc() to convert pages to PAT=UC-(e.g. for doorbell). Intel i915 also could vmap a page with PAT=WC (e.g. for some command buffer, see i915_gem_object_map_page()). It's not easy for host to know which guest pages are allocated by guest driver for such UC/WC conversion, and it should have problem to map such pages as "WB + ignore guest PAT" if the device is non-coherent. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-11-07 9:26 ` Yan Zhao @ 2023-11-07 18:06 ` Sean Christopherson 2023-11-08 4:32 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2023-11-07 18:06 UTC (permalink / raw) To: Yan Zhao; +Cc: Yibo Huang, kvm On Tue, Nov 07, 2023, Yan Zhao wrote: > On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote: > > On Wed, Nov 01, 2023, Yan Zhao wrote: > > > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote: > > > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even > > > without non-coherent DMA? > > > > No, there are snooping/ordering issues on Intel, and to a lesser extent AMD. AMD's > > WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and > > #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT > > (and vice versa). That may or may not be sufficient for multi-threaded use cases, > > but I've no idea if there is actually anything to worry about on that front. I > > think there's also a flaw with guest using UC, which IIUC doesn't snoop caches, > > i.e. the guest could get stale data. > > > > AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide > > something similar to safely let the guest control memtypes. Arguably, KVM should > > have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust. > > > > But even then, there's still the question of why, i.e. what would be the benefit > > of letting the guest control memtypes when it's not required for functional > > correctness, and would that benefit outweight the cost. > > Ok, so for a coherent device , if it's assigned together with a non-coherent > device, and if there's a page with host PAT = WB and guest PAT=UC, we need to > ensure the host write is flushed before guest read/write and guest DMA though no > need to worry about #MC, right? It's not even about devices, it applies to all non-MMIO memory, i.e. unless the host forces UC for a given page, there's potential for WB vs. WC/UC issues. > > > > > For CR0_CD=1, > > > > > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a). > > > > > - w/ KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b); > > > > > with IPAT=0, it may breaks (a), but meets (b) > > > > > > > > CR0.CD=1 is a mess above and beyond memtypes. Huh. It's even worse than I thought, > > > > because according to the SDM, Atom CPUs don't support no-fill mode: > > > > > > > > 3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor, > > > > caching is disabled. > > > > > > > > Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK, > > > > it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way > > > > to emulate no-fill mode. > > > > > > > > > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng > > > > > > Yang was trying to resolve issues with passthrough MMIO. > > > > > > > > > > > > * Sheng Yang > > > > > > : Do you mean host(qemu) would access this memory and if we set it to guest > > > > > > : MTRR, host access would be broken? We would cover this in our shadow MTRR > > > > > > : patch, for we encountered this in video ram when doing some experiment with > > > > > > : VGA assignment. > > > > > > > > > > > > And in the same thread, there's also what appears to be confirmation of Intel > > > > > > running into issues with Windows XP related to a guest device driver mapping > > > > > > DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the > > > > > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks > > > > > > the fact that EPT and NPT both honor guest PAT by default. /facepalm > > > > > > > > > > My interpretation is that the since guest PATs are in guest page tables, > > > > > while with EPT/NPT, guest page tables are not shadowed, it's not easy to > > > > > check guest PATs to disallow host QEMU access to non-WB guest RAM. > > > > > > > > Ah, yeah, your interpretation makes sense. > > > > > > > > The best idea I can think of to support things like this is to have KVM grab the > > > > effective PAT memtype from the host userspace page tables, shove that into the > > > > EPT/NPT memtype, and then ignore guest PAT. I don't if that would actually work > > > > though. > > > Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some > > > others as WC, even they belong to the same BAR. > > > I don't think host can know which one to choose in advance. > > > I think it should be also true to RAM range, guest can do memremap to a memory > > > type that host doesn't know beforehand. > > > > The goal wouldn't be to honor guest memtype, it would be to ensure correctness. > > E.g. guest can do memremap all it wants, and KVM will always ignore the guest's > > memtype. > AFAIK, some GPUs with TTM driver may call set_pages_array_uc() to convert pages > to PAT=UC-(e.g. for doorbell). Intel i915 also could vmap a page with PAT=WC > (e.g. for some command buffer, see i915_gem_object_map_page()). > It's not easy for host to know which guest pages are allocated by guest driver > for such UC/WC conversion, and it should have problem to map such pages as "WB + > ignore guest PAT" if the device is non-coherent. Ah, right, I was thinking specifically of virtio-gpu, where there is more explicit coordination between guest and host regarding the buffers. Drat. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-11-07 18:06 ` Sean Christopherson @ 2023-11-08 4:32 ` Yan Zhao 2023-11-10 17:09 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2023-11-08 4:32 UTC (permalink / raw) To: Sean Christopherson; +Cc: Yibo Huang, kvm On Tue, Nov 07, 2023 at 10:06:02AM -0800, Sean Christopherson wrote: > On Tue, Nov 07, 2023, Yan Zhao wrote: > > On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote: > > > On Wed, Nov 01, 2023, Yan Zhao wrote: > > > > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote: > > > > > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even > > > > without non-coherent DMA? > > > > > > No, there are snooping/ordering issues on Intel, and to a lesser extent AMD. AMD's > > > WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and > > > #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT > > > (and vice versa). That may or may not be sufficient for multi-threaded use cases, > > > but I've no idea if there is actually anything to worry about on that front. I > > > think there's also a flaw with guest using UC, which IIUC doesn't snoop caches, > > > i.e. the guest could get stale data. > > > > > > AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide > > > something similar to safely let the guest control memtypes. Arguably, KVM should > > > have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust. > > > > > > But even then, there's still the question of why, i.e. what would be the benefit > > > of letting the guest control memtypes when it's not required for functional > > > correctness, and would that benefit outweight the cost. > > > > Ok, so for a coherent device , if it's assigned together with a non-coherent > > device, and if there's a page with host PAT = WB and guest PAT=UC, we need to > > ensure the host write is flushed before guest read/write and guest DMA though no > > need to worry about #MC, right? > > It's not even about devices, it applies to all non-MMIO memory, i.e. unless the > host forces UC for a given page, there's potential for WB vs. WC/UC issues. Do you think we can have KVM to expose an ioctl for QEMU to call in QEMU's invalidate_and_set_dirty() or in cpu_physical_memory_set_dirty_range()? In this ioctl, it can do nothing if non-coherent DMA is not attached and call clflush otherwise. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-11-08 4:32 ` Yan Zhao @ 2023-11-10 17:09 ` Sean Christopherson 2023-11-13 8:07 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2023-11-10 17:09 UTC (permalink / raw) To: Yan Zhao; +Cc: Yibo Huang, kvm On Wed, Nov 08, 2023, Yan Zhao wrote: > On Tue, Nov 07, 2023 at 10:06:02AM -0800, Sean Christopherson wrote: > > On Tue, Nov 07, 2023, Yan Zhao wrote: > > > On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote: > > > > On Wed, Nov 01, 2023, Yan Zhao wrote: > > > > > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote: > > > > > > > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even > > > > > without non-coherent DMA? > > > > > > > > No, there are snooping/ordering issues on Intel, and to a lesser extent AMD. AMD's > > > > WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and > > > > #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT > > > > (and vice versa). That may or may not be sufficient for multi-threaded use cases, > > > > but I've no idea if there is actually anything to worry about on that front. I > > > > think there's also a flaw with guest using UC, which IIUC doesn't snoop caches, > > > > i.e. the guest could get stale data. > > > > > > > > AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide > > > > something similar to safely let the guest control memtypes. Arguably, KVM should > > > > have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust. > > > > > > > > But even then, there's still the question of why, i.e. what would be the benefit > > > > of letting the guest control memtypes when it's not required for functional > > > > correctness, and would that benefit outweight the cost. > > > > > > Ok, so for a coherent device , if it's assigned together with a non-coherent > > > device, and if there's a page with host PAT = WB and guest PAT=UC, we need to > > > ensure the host write is flushed before guest read/write and guest DMA though no > > > need to worry about #MC, right? > > > > It's not even about devices, it applies to all non-MMIO memory, i.e. unless the > > host forces UC for a given page, there's potential for WB vs. WC/UC issues. > Do you think we can have KVM to expose an ioctl for QEMU to call in QEMU's > invalidate_and_set_dirty() or in cpu_physical_memory_set_dirty_range()? > > In this ioctl, it can do nothing if non-coherent DMA is not attached and > call clflush otherwise. Why add an ioctl()? Userspace can do CLFLUSH{OPT} directly. If it would fix a real problem, then adding some way for userspace to query whether or not there is non-coherent DMA would be reasonable, though that seems like something that should be in VFIO (if it's not already there). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms 2023-11-10 17:09 ` Sean Christopherson @ 2023-11-13 8:07 ` Yan Zhao 0 siblings, 0 replies; 16+ messages in thread From: Yan Zhao @ 2023-11-13 8:07 UTC (permalink / raw) To: Sean Christopherson; +Cc: Yibo Huang, kvm On Fri, Nov 10, 2023 at 09:09:33AM -0800, Sean Christopherson wrote: > On Wed, Nov 08, 2023, Yan Zhao wrote: > > On Tue, Nov 07, 2023 at 10:06:02AM -0800, Sean Christopherson wrote: > > > On Tue, Nov 07, 2023, Yan Zhao wrote: > > > > On Mon, Nov 06, 2023 at 02:34:08PM -0800, Sean Christopherson wrote: > > > > > On Wed, Nov 01, 2023, Yan Zhao wrote: > > > > > > On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote: > > > > > > > > > > If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even > > > > > > without non-coherent DMA? > > > > > > > > > > No, there are snooping/ordering issues on Intel, and to a lesser extent AMD. AMD's > > > > > WC+ solves the most straightfoward cases, e.g. WC+ snoops caches, and VMRUN and > > > > > #VMEXIT flush the WC buffers to ensure that guest writes are visible and #VMEXIT > > > > > (and vice versa). That may or may not be sufficient for multi-threaded use cases, > > > > > but I've no idea if there is actually anything to worry about on that front. I > > > > > think there's also a flaw with guest using UC, which IIUC doesn't snoop caches, > > > > > i.e. the guest could get stale data. > > > > > > > > > > AFAIK, Intel CPUs don't provide anything like WC+, so KVM would have to provide > > > > > something similar to safely let the guest control memtypes. Arguably, KVM should > > > > > have such mechansisms anyways, e.g. to make non-coherent DMA VMs more robust. > > > > > > > > > > But even then, there's still the question of why, i.e. what would be the benefit > > > > > of letting the guest control memtypes when it's not required for functional > > > > > correctness, and would that benefit outweight the cost. > > > > > > > > Ok, so for a coherent device , if it's assigned together with a non-coherent > > > > device, and if there's a page with host PAT = WB and guest PAT=UC, we need to > > > > ensure the host write is flushed before guest read/write and guest DMA though no > > > > need to worry about #MC, right? > > > > > > It's not even about devices, it applies to all non-MMIO memory, i.e. unless the > > > host forces UC for a given page, there's potential for WB vs. WC/UC issues. > > Do you think we can have KVM to expose an ioctl for QEMU to call in QEMU's > > invalidate_and_set_dirty() or in cpu_physical_memory_set_dirty_range()? > > > > In this ioctl, it can do nothing if non-coherent DMA is not attached and > > call clflush otherwise. > > Why add an ioctl()? Userspace can do CLFLUSH{OPT} directly. If it would fix a > real problem, then adding some way for userspace to query whether or not there > is non-coherent DMA would be reasonable, though that seems like something that > should be in VFIO (if it's not already there). Ah, right. I previously thought KVM can further remove the clflush when TDP is not enabled with an ioctl(). But it's not a real problem so far, as I didn't manage to devise a case to prove the WB vs WC/UC issues. (i.e. in my devised cases, even with guest memory mapped to WC, host still can get latest data with WB...) May come back later if it's proved to be a real issue in future. :) ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-11-13 8:35 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-12 23:04 A question about how the KVM emulates the effect of guest MTRRs on AMD platforms Yibo Huang
2023-10-27 23:13 ` Sean Christopherson
2023-10-30 12:16 ` Yan Zhao
2023-10-30 19:24 ` Sean Christopherson
[not found] ` <3E43ADC6-E817-411A-9EBF-B16142B9B478@cs.utexas.edu>
2023-10-30 21:52 ` Sean Christopherson
2023-11-01 3:07 ` Yibo Huang
2023-10-31 10:01 ` Yan Zhao
2023-10-31 15:14 ` Sean Christopherson
2023-11-01 3:53 ` Huang, Kai
2023-11-01 9:08 ` Yan Zhao
2023-11-06 22:34 ` Sean Christopherson
2023-11-07 9:26 ` Yan Zhao
2023-11-07 18:06 ` Sean Christopherson
2023-11-08 4:32 ` Yan Zhao
2023-11-10 17:09 ` Sean Christopherson
2023-11-13 8:07 ` Yan Zhao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox