Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>,
	iommu@lists.linux.dev, kvm@vger.kernel.org,
	 linux-kernel@vger.kernel.org, alex.williamson@redhat.com,
	pbonzini@redhat.com,  joro@8bytes.org, will@kernel.org,
	robin.murphy@arm.com, kevin.tian@intel.com,
	 baolu.lu@linux.intel.com, dwmw2@infradead.org,
	yi.l.liu@intel.com
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU
Date: Mon, 4 Dec 2023 11:22:49 -0800	[thread overview]
Message-ID: <ZW4nCUS9VDk0DycG@google.com> (raw)
In-Reply-To: <20231204173028.GJ1493156@nvidia.com>

On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:
> 
> > There are more approaches beyond having IOMMUFD and KVM be
> > completely separate entities.  E.g. extract the bulk of KVM's "TDP
> > MMU" implementation to common code so that IOMMUFD doesn't need to
> > reinvent the wheel.
> 
> We've pretty much done this already, it is called "hmm" and it is what
> the IO world uses. Merging/splitting huge page is just something that
> needs some coding in the page table code, that people want for other
> reasons anyhow.

Not really.  HMM is a wildly different implementation than KVM's TDP MMU.  At a
glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs,
runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU
while walking the "secondary" HMM page tables.

KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary
MMU.  The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve
the primary MMU are largely orthogonal.  E.g. getting a PFN from guest_memfd
instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn()
instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was
resolved.  I.e. 99% of KVM's MMU logic has no interaction with the primary MMU.

> > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
> >   mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
> >   hugepage mitigation, etc.
> 
> Does it? I think that just remains isolated in kvm. The output from
> KVM is only a radix table top pointer, it is up to KVM how to manage
> it still.

Oh, I didn't mean from a code perspective, I meant from a behaviorial perspective.
E.g. there's no reason to disallow huge mappings in the IOMMU because the CPU is
vulnerable to the iTLB multi-hit mitigation.

> > I'm not convinced that memory consumption is all that interesting.  If a VM is
> > mapping the majority of memory into a device, then odds are good that the guest
> > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
> > overhead for pages tables is quite small, especially relative to the total amount
> > of memory overheads for such systems.
> 
> AFAIK the main argument is performance. It is similar to why we want
> to do IOMMU SVA with MM page table sharing.
> 
> If IOMMU mirrors/shadows/copies a page table using something like HMM
> techniques then the invalidations will mark ranges of IOVA as
> non-present and faults will occur to trigger hmm_range_fault to do the
> shadowing.
>
> This means that pretty much all IO will always encounter a non-present
> fault, certainly at the start and maybe worse while ongoing.
> 
> On the other hand, if we share the exact page table then natural CPU
> touches will usually make the page present before an IO happens in
> almost all cases and we don't have to take the horribly expensive IO
> page fault at all.

I'm not advocating mirroring/copying/shadowing page tables between KVM and the
IOMMU.  I'm suggesting managing IOMMU page tables mostly independently, but reusing
KVM code to do so.

I wouldn't even be opposed to KVM outright managing the IOMMU's page tables.  E.g.
add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks
rather similar to this series.

What terrifies is me sharing page tables between the CPU and the IOMMU verbatim. 

Yes, sharing page tables will Just Work for faulting in memory, but the downside
is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications
will also impact the IO path.  My understanding is that IO page faults are at least
an order of magnitude more expensive than CPU page faults.  That means that what's
optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page
tables.

E.g. based on our conversation at LPC, write-protecting guest memory to do dirty
logging is not a viable option for the IOMMU because the latency of the resulting
IOPF is too high.  Forcing KVM to use D-bit dirty logging for CPUs just because
the VM has passthrough (mediated?) devices would be likely a non-starter.

One of my biggest concerns with sharing page tables between KVM and IOMMUs is that
we will end up having to revert/reject changes that benefit KVM's usage due to
regressing the IOMMU usage.

If instead KVM treats IOMMU page tables as their own thing, then we can have
divergent behavior as needed, e.g. different dirty logging algorithms, different
software-available bits, etc.  It would also allow us to define new ABI instead
of trying to reconcile the many incompatibilies and warts in KVM's existing ABI.
E.g. off the top of my head:

 - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest
   memory.

 - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU
   doesn't support A/D bits or because the admin turned them off via KVM's
   enable_ept_ad_bits module param.

 - Write-protecting GFNs for shadow paging when L1 is running nested VMs.  KVM's
   ABI can be that device writes to L1's page tables are exempt.

 - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if
   any memslot is deleted" ABI.

> We were not able to make bi-dir notifiers with with the CPU mm, I'm
> not sure that is "relatively easy" :(

I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
same".

It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM
to manage IOMMU page tables, then KVM could simply install mappings for multiple
sets of page tables as appropriate.

next prev parent reply	other threads:[~2023-12-04 19:22 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-02  9:12 [RFC PATCH 00/42] Sharing KVM TDP to IOMMU Yan Zhao
2023-12-02  9:13 ` [RFC PATCH 01/42] KVM: Public header for KVM to export TDP Yan Zhao
2023-12-02  9:15 ` [RFC PATCH 02/42] KVM: x86: Arch header for kvm to export TDP for Intel Yan Zhao
2023-12-02  9:15 ` [RFC PATCH 03/42] KVM: Introduce VM ioctl KVM_CREATE_TDP_FD Yan Zhao
2023-12-02  9:16 ` [RFC PATCH 04/42] KVM: Skeleton of KVM TDP FD object Yan Zhao
2023-12-02  9:16 ` [RFC PATCH 05/42] KVM: Embed "arch" object and call arch init/destroy in TDP FD Yan Zhao
2023-12-02  9:17 ` [RFC PATCH 06/42] KVM: Register/Unregister importers to KVM exported TDP Yan Zhao
2023-12-02  9:18 ` [RFC PATCH 07/42] KVM: Forward page fault requests to arch specific code for " Yan Zhao
2023-12-02  9:18 ` [RFC PATCH 08/42] KVM: Add a helper to notify importers that KVM exported TDP is flushed Yan Zhao
2023-12-02  9:19 ` [RFC PATCH 09/42] iommu: Add IOMMU_DOMAIN_KVM Yan Zhao
2023-12-02  9:20 ` [RFC PATCH 10/42] iommu: Add new iommu op to create domains managed by KVM Yan Zhao
2023-12-04 15:09   ` Jason Gunthorpe
2023-12-02  9:20 ` [RFC PATCH 11/42] iommu: Add new domain op cache_invalidate_kvm Yan Zhao
2023-12-04 15:09   ` Jason Gunthorpe
2023-12-05  6:40     ` Yan Zhao
2023-12-05 14:52       ` Jason Gunthorpe
2023-12-06  1:00         ` Yan Zhao
2023-12-02  9:21 ` [RFC PATCH 12/42] iommufd: Introduce allocation data info and flag for KVM managed HWPT Yan Zhao
2023-12-04 18:29   ` Jason Gunthorpe
2023-12-05  7:08     ` Yan Zhao
2023-12-05 14:53       ` Jason Gunthorpe
2023-12-06  0:58         ` Yan Zhao
2023-12-02  9:21 ` [RFC PATCH 13/42] iommufd: Add a KVM HW pagetable object Yan Zhao
2023-12-02  9:22 ` [RFC PATCH 14/42] iommufd: Enable KVM HW page table object to be proxy between KVM and IOMMU Yan Zhao
2023-12-04 18:34   ` Jason Gunthorpe
2023-12-05  7:09     ` Yan Zhao
2023-12-02  9:22 ` [RFC PATCH 15/42] iommufd: Add iopf handler to KVM hw pagetable Yan Zhao
2023-12-02  9:23 ` [RFC PATCH 16/42] iommufd: Enable device feature IOPF during device attachment to KVM HWPT Yan Zhao
2023-12-04 18:36   ` Jason Gunthorpe
2023-12-05  7:14     ` Yan Zhao
2023-12-05 14:53       ` Jason Gunthorpe
2023-12-06  0:55         ` Yan Zhao
2023-12-02  9:23 ` [RFC PATCH 17/42] iommu/vt-d: Make some macros and helpers to be extern Yan Zhao
2023-12-02  9:24 ` [RFC PATCH 18/42] iommu/vt-d: Support of IOMMU_DOMAIN_KVM domain in Intel IOMMU Yan Zhao
2023-12-02  9:24 ` [RFC PATCH 19/42] iommu/vt-d: Set bit PGSNP in PASIDTE if domain cache coherency is enforced Yan Zhao
2023-12-02  9:25 ` [RFC PATCH 20/42] iommu/vt-d: Support attach devices to IOMMU_DOMAIN_KVM domain Yan Zhao
2023-12-02  9:26 ` [RFC PATCH 21/42] iommu/vt-d: Check reserved bits for " Yan Zhao
2023-12-02  9:26 ` [RFC PATCH 22/42] iommu/vt-d: Support cache invalidate of " Yan Zhao
2023-12-02  9:26 ` [RFC PATCH 23/42] iommu/vt-d: Allow pasid 0 in IOPF Yan Zhao
2023-12-02  9:27 ` [RFC PATCH 24/42] KVM: x86/mmu: Move bit SPTE_MMU_PRESENT from bit 11 to bit 59 Yan Zhao
2023-12-02  9:27 ` [RFC PATCH 25/42] KVM: x86/mmu: Abstract "struct kvm_mmu_common" from "struct kvm_mmu" Yan Zhao
2023-12-02  9:28 ` [RFC PATCH 26/42] KVM: x86/mmu: introduce new op get_default_mt_mask to kvm_x86_ops Yan Zhao
2023-12-02  9:28 ` [RFC PATCH 27/42] KVM: x86/mmu: change param "vcpu" to "kvm" in kvm_mmu_hugepage_adjust() Yan Zhao
2023-12-02  9:29 ` [RFC PATCH 28/42] KVM: x86/mmu: change "vcpu" to "kvm" in page_fault_handle_page_track() Yan Zhao
2023-12-02  9:29 ` [RFC PATCH 29/42] KVM: x86/mmu: remove param "vcpu" from kvm_mmu_get_tdp_level() Yan Zhao
2023-12-02  9:30 ` [RFC PATCH 30/42] KVM: x86/mmu: remove param "vcpu" from kvm_calc_tdp_mmu_root_page_role() Yan Zhao
2023-12-02  9:30 ` [RFC PATCH 31/42] KVM: x86/mmu: add extra param "kvm" to kvm_faultin_pfn() Yan Zhao
2023-12-02  9:31 ` [RFC PATCH 32/42] KVM: x86/mmu: add extra param "kvm" to make_mmio_spte() Yan Zhao
2023-12-02  9:31 ` [RFC PATCH 33/42] KVM: x86/mmu: add extra param "kvm" to make_spte() Yan Zhao
2023-12-02  9:32 ` [RFC PATCH 34/42] KVM: x86/mmu: add extra param "kvm" to tdp_mmu_map_handle_target_level() Yan Zhao
2023-12-02  9:32 ` [RFC PATCH 35/42] KVM: x86/mmu: Get/Put TDP root page to be exported Yan Zhao
2023-12-02  9:33 ` [RFC PATCH 36/42] KVM: x86/mmu: Keep exported TDP root valid Yan Zhao
2023-12-02  9:33 ` [RFC PATCH 37/42] KVM: x86: Implement KVM exported TDP fault handler on x86 Yan Zhao
2023-12-02  9:35 ` [RFC PATCH 38/42] KVM: x86: "compose" and "get" interface for meta data of exported TDP Yan Zhao
2023-12-02  9:35 ` [RFC PATCH 39/42] KVM: VMX: add config KVM_INTEL_EXPORTED_EPT Yan Zhao
2023-12-02  9:36 ` [RFC PATCH 40/42] KVM: VMX: Compose VMX specific meta data for KVM exported TDP Yan Zhao
2023-12-02  9:36 ` [RFC PATCH 41/42] KVM: VMX: Implement ops .flush_remote_tlbs* in VMX when EPT is on Yan Zhao
2023-12-02  9:37 ` [RFC PATCH 42/42] KVM: VMX: Notify importers of exported TDP to flush TLBs on KVM flushes EPT Yan Zhao
2023-12-04 15:08 ` [RFC PATCH 00/42] Sharing KVM TDP to IOMMU Jason Gunthorpe
2023-12-04 16:38   ` Sean Christopherson
2023-12-05  1:31     ` Yan Zhao
2023-12-05  6:45       ` Tian, Kevin
2023-12-05  1:52   ` Yan Zhao
2023-12-05  6:30   ` Tian, Kevin
2023-12-04 17:00 ` Sean Christopherson
2023-12-04 17:30   ` Jason Gunthorpe
2023-12-04 19:22     ` Sean Christopherson [this message]
2023-12-04 19:50       ` Jason Gunthorpe
2023-12-04 20:11         ` Sean Christopherson
2023-12-04 23:49           ` Jason Gunthorpe
2023-12-05  7:17         ` Tian, Kevin
2023-12-05  5:53       ` Yan Zhao
2023-12-05  3:51   ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZW4nCUS9VDk0DycG@google.com \
    --to=seanjc@google.com \
    --cc=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=dwmw2@infradead.org \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=robin.murphy@arm.com \
    --cc=will@kernel.org \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox