All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Adrian Hunter <adrian.hunter@intel.com>
Cc: pbonzini@redhat.com, kvm@vger.kernel.org,
	rick.p.edgecombe@intel.com,  kirill.shutemov@linux.intel.com,
	kai.huang@intel.com,  reinette.chatre@intel.com,
	xiaoyao.li@intel.com,  tony.lindgren@linux.intel.com,
	binbin.wu@linux.intel.com,  isaku.yamahata@intel.com,
	linux-kernel@vger.kernel.org, yan.y.zhao@intel.com,
	 chao.gao@intel.com
Subject: Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time
Date: Thu, 27 Mar 2025 08:54:19 -0700	[thread overview]
Message-ID: <Z-V0qyTn2bXdrPF7@google.com> (raw)
In-Reply-To: <20250313181629.17764-1-adrian.hunter@intel.com>

On Thu, Mar 13, 2025, Adrian Hunter wrote:
> Improve TDX shutdown performance by adding a more efficient shutdown
> operation at the cost of adding separate branches for the TDX MMU
> operations for normal runtime and shutdown.  This more efficient method was
> previously used in earlier versions of the TDX patches, but was removed to
> simplify the initial upstreaming.  This is an RFC, and still needs a proper
> upstream commit log. It is intended to be an eventual follow up to base
> support.

...

> == Options ==
> 
>   1. Start TD teardown earlier so that when pages are removed,
>   they can be reclaimed faster.
>   2. Defer page removal until after TD teardown has started.
>   3. A combination of 1 and 2.
> 
> Option 1 is problematic because it means putting the TD into a non-runnable
> state while it is potentially still active. Also, as mentioned above, Sean
> effectively NAK'ed it.

Option 2 is just as gross, arguably even worse.  I NAK'd a flavor of option 1,
not the base concept of initiating teardown before all references to the VM are
put.

AFAICT, nothing outright prevents adding a TDX sub-ioctl to terminate the VM.
The locking is a bit heinous, but I would prefer heavy locking to deferring
reclaim and pinning inodes.

Oh FFS.  This is also an opportunity to cleanup RISC-V's insidious copy-paste of
ARM.  Because extracting (un)lock_all_vcpus() to common code would have been sooo
hard.  *sigh*

Very roughly, something like the below (*completely* untested).

An alternative to taking mmu_lock would be to lock all bound guest_memfds, but I
think I prefer taking mmu_lock is it's easier to reason about the safety of freeing
the HKID.  Note, the truncation phase of a PUNCH_HOLE could still run in parallel,
but that's a-ok.  The only part of PUNCH_HOLE that needs to be blocked is the call
to kvm_mmu_unmap_gfn_range().

---
 arch/x86/kvm/vmx/tdx.c | 61 ++++++++++++++++++++++++++++++------------
 1 file changed, 44 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 87f188021cbd..6fb595c272ab 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -472,7 +472,7 @@ static void smp_func_do_phymem_cache_wb(void *unused)
 		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err);
 }
 
-void tdx_mmu_release_hkid(struct kvm *kvm)
+static void __tdx_release_hkid(struct kvm *kvm, bool terminate)
 {
 	bool packages_allocated, targets_allocated;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -485,10 +485,11 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	if (!is_hkid_assigned(kvm_tdx))
 		return;
 
+	if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate))
+		return;
+
 	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
 	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
-	cpus_read_lock();
-
 	kvm_for_each_vcpu(j, vcpu, kvm)
 		tdx_flush_vp_on_cpu(vcpu);
 
@@ -500,12 +501,8 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	 */
 	mutex_lock(&tdx_lock);
 
-	/*
-	 * Releasing HKID is in vm_destroy().
-	 * After the above flushing vps, there should be no more vCPU
-	 * associations, as all vCPU fds have been released at this stage.
-	 */
 	err = tdh_mng_vpflushdone(&kvm_tdx->td);
+	/* Uh, what's going on here? */
 	if (err == TDX_FLUSHVP_NOT_DONE)
 		goto out;
 	if (KVM_BUG_ON(err, kvm)) {
@@ -515,6 +512,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		goto out;
 	}
 
+	write_lock(&kvm->mmu_lock);
 	for_each_online_cpu(i) {
 		if (packages_allocated &&
 		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -539,14 +537,21 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	} else {
 		tdx_hkid_free(kvm_tdx);
 	}
-
+	write_unlock(&kvm->mmu_lock);
 out:
 	mutex_unlock(&tdx_lock);
-	cpus_read_unlock();
 	free_cpumask_var(targets);
 	free_cpumask_var(packages);
 }
 
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+	cpus_read_lock();
+	__tdx_release_hkid(kvm, false);
+	cpus_read_unlock();
+}
+
+
 static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1789,13 +1794,10 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm)), kvm) {
+		WARN_ON_ONCE(!kvm->vm_dead);
+		return tdx_reclaim_page(pfn_to_page(pfn));
+	}
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
@@ -2790,6 +2792,28 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+static int tdx_td_terminate(struct kvm *kvm)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	int bkt;
+
+	cpus_read_lock();
+	guard(mutex)(&kvm->lock);
+
+	r = kvm_lock_all_vcpus();
+	if (r)
+		goto out;
+
+	kvm_vm_dead(kvm);
+	kvm_unlock_all_vcpus();
+
+	__tdx_release_hkid(kvm);
+out:
+	cpus_read_unlock();
+	return r;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -2805,6 +2829,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	if (tdx_cmd.hw_error)
 		return -EINVAL;
 
+	if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
+		return tdx_td_terminate(kvm);
+
 	mutex_lock(&kvm->lock);
 
 	switch (tdx_cmd.id) {

base-commit: 2156c3c7d60c5be9c0d9ab1fedccffe3c55a2ca0
-- 

  parent reply	other threads:[~2025-03-27 15:54 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-13 18:16 [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time Adrian Hunter
2025-03-13 18:39 ` Paolo Bonzini
2025-03-13 19:07   ` Adrian Hunter
2025-03-17  8:13 ` Kirill A. Shutemov
2025-03-18 15:41   ` Adrian Hunter
2025-03-18 21:11     ` Vishal Annapurve
2025-03-20  0:42 ` Vishal Annapurve
2025-03-24 10:40   ` Adrian Hunter
2025-03-27  8:14 ` Vishal Annapurve
2025-03-27 10:10   ` Adrian Hunter
2025-03-27 15:54 ` Sean Christopherson [this message]
2025-03-28 10:26   ` Adrian Hunter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z-V0qyTn2bXdrPF7@google.com \
    --to=seanjc@google.com \
    --cc=adrian.hunter@intel.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.gao@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=reinette.chatre@intel.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=tony.lindgren@linux.intel.com \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.