From: "Huang, Kai" <kai.huang@intel.com>
To: "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"pbonzini@redhat.com" <pbonzini@redhat.com>,
"seanjc@google.com" <seanjc@google.com>,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: "sagis@google.com" <sagis@google.com>,
"dmatlack@google.com" <dmatlack@google.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"isaku.yamahata@gmail.com" <isaku.yamahata@gmail.com>,
"Zhao, Yan Y" <yan.y.zhao@intel.com>,
"Aktas, Erdem" <erdemaktas@google.com>
Subject: Re: [PATCH 02/16] KVM: x86/mmu: Introduce a slot flag to zap only slot leafs on slot deletion
Date: Wed, 15 May 2024 13:24:45 +0000 [thread overview]
Message-ID: <b89385e5c7f4c3e5bc97045ec909455c33652fb1.camel@intel.com> (raw)
In-Reply-To: <20240515005952.3410568-3-rick.p.edgecombe@intel.com>
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> From: Yan Zhao <yan.y.zhao@intel.com>
>
> Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf
> SPTEs when deleting a memslot.
>
> Today "zapping only memslot leaf SPTEs" on memslot deletion is not done.
> Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for
> AMD) and generate fresh new TDPs based on the new memslot layout. This is
> because zapping and re-generating TDPs is low overhead for most use cases,
> and more importantly, it's due to a bug [1] which caused VM instability
> when a VM is with Nvidia Geforce GPU assigned.
>
> There's a previous attempt [2] to introduce a per-VM flag to workaround bug
> [1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs.
> However, [2] was not merged due to lacking of a clear explanation of
> exactly what is broken [3] and it's not wise to "have a bug that is known
> to happen when you enable the capability".
>
> However, for some specific scenarios, e.g. TDX, invalidating and
> re-generating a new page table is not viable for reasons:
> - TDX requires root page of private page table remains unaltered throughout
> the TD life cycle.
> - TDX mandates that leaf entries in private page table must be zapped prior
> to non-leaf entries.
>
> So, Sean re-considered about introducing a per-VM flag or per-memslot flag
> again for VMs like TDX. [4]
>
> This patch is an implementation of per-memslot flag.
> Compared to per-VM flag approach,
> Pros:
> (1) By allowing userspace to control the zapping behavior in fine-grained
> granularity, optimizations for specific use cases can be developed
> without future kernel changes.
> (2) Allows developing new zapping behaviors without risking regressions by
> changing KVM behavior, as seen previously.
>
> Cons:
> (1) Users need to ensure all necessary memslots are with flag
> KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
> memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
> (2) Opens up the possibility that userspace could configure memslots for
> normal VM in such a way that the bug [1] is seen.
I don't quite follow the logic why userspace should be involved.
TDX cannot use "page table fast zap", and need to use a different way to
zap, a.k.a, zap-leaf-only while holding MMU write lock, but this doesn't
necessarily mean such thing should be exposed to userspace?
It's weird that userspace needs to control how does KVM zap page table for
memslot delete/move.
The [2] mentions there are performance improvement for certain VMs if KVM
does zap-leaf-only, but AFAICT it doesn't provide concrete argument why it
needs to be exposed to userspace so it can be done as per-vm (that whole
thread basically was talking about the bug in [1]). E.g., see:
https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#m702b273057cc318465cb5a1677d94e923dce9832
"
Ya, a capability is a bad idea. I was coming at it from the angle that, if
there is a fundamental requirement with e.g. GPU passthrough that requires
zapping all SPTEs, then enabling the precise capability on a per-VM basis
would make sense. But adding something to the ABI on pure speculation is
silly.
"
So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
We can just set this flag for a TDX guest when memslot is created in KVM.
>
> However, one thing deserves noting for TDX, is that TDX may potentially
> meet bug [1] for either per-memslot flag or per-VM flag approach, since
> there's a usage in radar to assign an untrusted & passthrough GPU device
> in TDX. If that happens, it can be treated as a bug (not regression) and
> fixed accordingly.
>
> An alternative approach we can also consider is to always invalidate &
> rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored
> and private page tables on memslot deletion. This approach could exempt TDX
> from bug [1] when "untrusted & passthrough" devices are involved. But
> downside is that this approach requires creating new very specific KVM
> zapping ABI that could limit future changes in the same way that the bug
> did for normal VMs.
>
> Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
> Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2]
> Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3]
> Link: https://lore.kernel.org/kvm/ZhSYEVCHqSOpVKMh@google.com [4]
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU Part 1:
> - New patch
> ---
> arch/x86/kvm/mmu/mmu.c | 30 +++++++++++++++++++++++++++++-
> arch/x86/kvm/x86.c | 17 +++++++++++++++++
> include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 5 ++++-
> 4 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 61982da8c8b2..4a8e819794db 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6962,10 +6962,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
> kvm_mmu_zap_all(kvm);
> }
>
> +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> + if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> + return;
> +
> + write_lock(&kvm->mmu_lock);
> +
> + /*
> + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> + * case scenario we'll have unused shadow pages lying around until they
> + * are recycled due to age or when the VM is destroyed.
> + */
> + struct kvm_gfn_range range = {
> + .slot = slot,
> + .start = slot->base_gfn,
> + .end = slot->base_gfn + slot->npages,
> + .may_block = true,
> + };
> +
> + if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
> + kvm_flush_remote_tlbs(kvm);
> +
> + write_unlock(&kvm->mmu_lock);
> +}
> +
> void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot)
> {
> - kvm_mmu_zap_all_fast(kvm);
> + if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY)
> + kvm_mmu_zap_memslot_leafs(kvm, slot);
> + else
> + kvm_mmu_zap_all_fast(kvm);
> }
>
> void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7c593a081eba..4b3ec2ec79e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12952,6 +12952,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
> return -EINVAL;
>
> + /*
> + * Since TDX private pages requires re-accepting after zap,
> + * and TDX private root page should not be zapped, TDX requires
> + * memslots for private memory must have flag
> + * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of
> + * the deleting memslot will be zapped and SPTEs in other
> + * memslots would not be affected.
> + */
> + if (kvm->arch.vm_type == KVM_X86_TDX_VM &&
> + (new->flags & KVM_MEM_GUEST_MEMFD) &&
> + !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY))
> + return -EINVAL;
> +
> + /* zap-leafs-only works only when TDP MMU is enabled for now */
> + if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled)
> + return -EINVAL;
If this zap-leaf-only is supposed to be generic, I don't see why we want
to make it only for TDP MMU?
next prev parent reply other threads:[~2024-05-15 13:24 UTC|newest]
Thread overview: 152+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-15 0:59 [PATCH 00/16] TDX MMU prep series part 1 Rick Edgecombe
2024-05-15 0:59 ` [PATCH 01/16] KVM: x86: Add a VM type define for TDX Rick Edgecombe
2024-05-15 0:59 ` [PATCH 02/16] KVM: x86/mmu: Introduce a slot flag to zap only slot leafs on slot deletion Rick Edgecombe
2024-05-15 13:24 ` Huang, Kai [this message]
2024-05-15 19:09 ` Sean Christopherson
2024-05-15 19:23 ` Edgecombe, Rick P
2024-05-15 20:05 ` Sean Christopherson
2024-05-15 20:53 ` Edgecombe, Rick P
2024-05-15 22:47 ` Sean Christopherson
2024-05-15 23:06 ` Huang, Kai
2024-05-15 23:20 ` Sean Christopherson
2024-05-15 23:36 ` Huang, Kai
2024-05-16 1:12 ` Xiaoyao Li
2024-05-17 15:30 ` Paolo Bonzini
2024-05-22 1:29 ` Yan Zhao
2024-05-22 2:31 ` Sean Christopherson
2024-05-22 6:48 ` Yan Zhao
2024-05-22 15:45 ` Paolo Bonzini
2024-05-24 1:50 ` Yan Zhao
2024-05-15 23:56 ` Edgecombe, Rick P
2024-05-16 2:21 ` Edgecombe, Rick P
2024-05-16 3:56 ` Yan Zhao
2024-05-17 15:27 ` Paolo Bonzini
2024-05-17 15:25 ` Paolo Bonzini
2024-05-15 18:03 ` Isaku Yamahata
2024-05-15 0:59 ` [PATCH 03/16] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Rick Edgecombe
2024-05-17 7:44 ` Chao Gao
2024-05-17 9:08 ` Isaku Yamahata
2024-05-15 0:59 ` [PATCH 04/16] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA Rick Edgecombe
2024-05-15 22:34 ` Huang, Kai
2024-05-15 23:21 ` Edgecombe, Rick P
2024-05-15 23:31 ` Huang, Kai
2024-05-15 23:38 ` Edgecombe, Rick P
2024-05-15 23:44 ` Huang, Kai
2024-05-15 23:59 ` Edgecombe, Rick P
2024-05-16 0:12 ` Huang, Kai
2024-05-16 0:19 ` Edgecombe, Rick P
2024-05-16 0:25 ` Huang, Kai
2024-05-16 0:35 ` Edgecombe, Rick P
2024-05-16 1:04 ` Huang, Kai
2024-05-16 1:20 ` Edgecombe, Rick P
2024-05-16 1:40 ` Huang, Kai
2024-05-16 5:52 ` Yan Zhao
2024-05-18 0:25 ` Edgecombe, Rick P
2024-05-16 23:08 ` Edgecombe, Rick P
2024-05-17 0:37 ` Huang, Kai
2024-05-17 1:51 ` Edgecombe, Rick P
2024-05-17 4:26 ` Huang, Kai
2024-05-17 21:12 ` Edgecombe, Rick P
2024-05-15 0:59 ` [PATCH 05/16] KVM: Add member to struct kvm_gfn_range for target alias Rick Edgecombe
2024-05-17 20:58 ` Edgecombe, Rick P
2024-05-15 0:59 ` [PATCH 06/16] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role Rick Edgecombe
2024-05-15 0:59 ` [PATCH 07/16] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page Rick Edgecombe
2024-05-15 0:59 ` [PATCH 08/16] KVM: x86/mmu: Bug the VM if kvm_zap_gfn_range() is called for TDX Rick Edgecombe
2024-05-15 13:27 ` Huang, Kai
2024-05-15 15:22 ` Edgecombe, Rick P
2024-05-15 23:14 ` Huang, Kai
2024-05-15 15:34 ` Sean Christopherson
2024-05-15 15:49 ` Edgecombe, Rick P
2024-05-15 15:56 ` Edgecombe, Rick P
2024-05-15 16:02 ` Sean Christopherson
2024-05-15 16:12 ` Edgecombe, Rick P
2024-05-15 18:09 ` Sean Christopherson
2024-05-15 18:22 ` Edgecombe, Rick P
2024-05-15 19:48 ` Sean Christopherson
2024-05-15 20:32 ` Edgecombe, Rick P
2024-05-15 23:26 ` Sean Christopherson
2024-05-15 16:22 ` Isaku Yamahata
2024-05-15 22:17 ` Huang, Kai
2024-05-15 23:14 ` Edgecombe, Rick P
2024-05-15 23:38 ` Huang, Kai
2024-05-16 0:13 ` Edgecombe, Rick P
2024-05-16 0:27 ` Isaku Yamahata
2024-05-16 1:11 ` Huang, Kai
2024-05-16 0:15 ` Isaku Yamahata
2024-05-16 0:52 ` Edgecombe, Rick P
2024-05-16 1:21 ` Huang, Kai
2024-05-16 17:27 ` Isaku Yamahata
2024-05-16 21:46 ` Edgecombe, Rick P
2024-05-16 22:23 ` Huang, Kai
2024-05-16 22:38 ` Edgecombe, Rick P
2024-05-16 23:16 ` Huang, Kai
2024-05-15 0:59 ` [PATCH 09/16] KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void Rick Edgecombe
2024-05-15 0:59 ` [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU Rick Edgecombe
2024-05-15 17:35 ` Isaku Yamahata
2024-05-15 18:00 ` Edgecombe, Rick P
2024-05-16 0:52 ` Huang, Kai
2024-05-16 1:27 ` Edgecombe, Rick P
2024-05-16 2:07 ` Huang, Kai
2024-05-16 2:57 ` Edgecombe, Rick P
2024-05-16 13:04 ` Huang, Kai
2024-05-16 16:36 ` Edgecombe, Rick P
2024-05-16 19:42 ` Isaku Yamahata
2024-05-17 2:35 ` Edgecombe, Rick P
2024-05-17 9:03 ` Isaku Yamahata
2024-05-17 18:16 ` Edgecombe, Rick P
2024-05-17 19:16 ` Isaku Yamahata
2024-05-20 23:32 ` Isaku Yamahata
2024-05-21 15:07 ` Edgecombe, Rick P
2024-05-21 16:15 ` Isaku Yamahata
2024-05-22 22:34 ` Isaku Yamahata
2024-05-22 23:09 ` Edgecombe, Rick P
2024-05-22 23:47 ` Isaku Yamahata
2024-05-22 23:50 ` Edgecombe, Rick P
2024-05-23 0:01 ` Isaku Yamahata
2024-05-23 18:27 ` Edgecombe, Rick P
2024-05-24 7:55 ` Isaku Yamahata
2024-05-28 16:27 ` Edgecombe, Rick P
2024-05-28 17:47 ` Paolo Bonzini
2024-05-29 2:13 ` Edgecombe, Rick P
2024-05-29 7:25 ` Paolo Bonzini
2024-05-31 14:11 ` Isaku Yamahata
2024-05-28 17:43 ` Paolo Bonzini
2024-05-28 17:16 ` Paolo Bonzini
2024-05-28 18:29 ` Edgecombe, Rick P
2024-05-29 1:06 ` Isaku Yamahata
2024-05-29 1:51 ` Edgecombe, Rick P
2024-05-17 2:36 ` Huang, Kai
2024-05-17 8:14 ` Isaku Yamahata
2024-05-18 5:42 ` Huang, Kai
2024-05-18 15:41 ` Edgecombe, Rick P
2024-05-20 10:38 ` Huang, Kai
2024-05-20 18:58 ` Isaku Yamahata
2024-05-20 19:02 ` Edgecombe, Rick P
2024-05-20 23:39 ` Edgecombe, Rick P
2024-05-21 2:25 ` Isaku Yamahata
2024-05-21 2:57 ` Edgecombe, Rick P
2024-05-20 22:34 ` Huang, Kai
2024-05-16 1:48 ` Isaku Yamahata
2024-05-16 2:00 ` Edgecombe, Rick P
2024-05-16 2:10 ` Huang, Kai
2024-05-28 16:59 ` Paolo Bonzini
2024-05-16 17:10 ` Isaku Yamahata
2024-05-23 23:14 ` Edgecombe, Rick P
2024-05-24 8:20 ` Isaku Yamahata
2024-05-28 21:48 ` Edgecombe, Rick P
2024-05-29 1:16 ` Isaku Yamahata
2024-05-29 1:50 ` Edgecombe, Rick P
2024-05-29 2:20 ` Isaku Yamahata
2024-05-29 2:29 ` Edgecombe, Rick P
2024-05-28 20:54 ` Edgecombe, Rick P
2024-05-29 1:24 ` Isaku Yamahata
2024-05-28 23:06 ` Edgecombe, Rick P
2024-05-29 1:57 ` Isaku Yamahata
2024-05-29 2:13 ` Edgecombe, Rick P
2024-05-29 16:55 ` Isaku Yamahata
2024-05-15 0:59 ` [PATCH 11/16] KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root() Rick Edgecombe
2024-05-15 0:59 ` [PATCH 12/16] KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type Rick Edgecombe
2024-05-15 0:59 ` [PATCH 13/16] KVM: x86/tdp_mmu: Introduce shared, private KVM MMU root types Rick Edgecombe
2024-05-15 0:59 ` [PATCH 14/16] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots() Rick Edgecombe
2024-05-15 0:59 ` [PATCH 15/16] KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process Rick Edgecombe
2024-05-15 0:59 ` [PATCH 16/16] KVM: x86/tdp_mmu: Invalidate correct roots Rick Edgecombe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b89385e5c7f4c3e5bc97045ec909455c33652fb1.camel@intel.com \
--to=kai.huang@intel.com \
--cc=dmatlack@google.com \
--cc=erdemaktas@google.com \
--cc=isaku.yamahata@gmail.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=rick.p.edgecombe@intel.com \
--cc=sagis@google.com \
--cc=seanjc@google.com \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox