[PATCH V2 0/1] KVM: TDX: Decrease TDX VM shutdown time

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2 0/1] KVM: TDX: Decrease TDX VM shutdown time
@ 2025-04-17 13:19 Adrian Hunter
  2025-04-17 13:19 ` [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Adrian Hunter
  0 siblings, 1 reply; 8+ messages in thread
From: Adrian Hunter @ 2025-04-17 13:19 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: mlevitsk, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

Hi

The version 1 RFC:

	https://lore.kernel.org/all/20250313181629.17764-1-adrian.hunter@intel.com/

listed 3 options and implemented option 2.  Sean replied with code for
option 1, which tested out OK, so here it is plus a commit log.

It depends upon kvm_trylock_all_vcpus(kvm) which is the assumed result
of Maxim's work-in-progress here:

      https://lore.kernel.org/all/20250409014136.2816971-1-mlevitsk@redhat.com/

Note it is assumed that kvm_trylock_all_vcpus(kvm) follows the return value
semantics of mutex_trylock() i.e. 1 means locks have been successfully
acquired, 0 means not succesful.


Sean Christopherson (1):
      KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM

 Documentation/virt/kvm/x86/intel-tdx.rst | 16 ++++++++
 arch/x86/include/uapi/asm/kvm.h          |  1 +
 arch/x86/kvm/vmx/tdx.c                   | 63 ++++++++++++++++++++++----------
 3 files changed, 61 insertions(+), 19 deletions(-)


Regards
Adrian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-04-17 13:19 [PATCH V2 0/1] KVM: TDX: Decrease TDX VM shutdown time Adrian Hunter
@ 2025-04-17 13:19 ` Adrian Hunter
  2025-04-19  0:34   ` Vishal Annapurve
  2025-04-19  1:12   ` Sean Christopherson
  0 siblings, 2 replies; 8+ messages in thread
From: Adrian Hunter @ 2025-04-17 13:19 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: mlevitsk, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

From: Sean Christopherson <seanjc@google.com>

Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
which enables more efficient reclaim of private memory.

Private memory is removed from MMU/TDP when guest_memfds are closed. If
the HKID has not been released, the TDX VM is still in RUNNABLE state,
so pages must be removed using "Dynamic Page Removal" procedure (refer
TDX Module Base spec) which involves a number of steps:
	Block further address translation
	Exit each VCPU
	Clear Secure EPT entry
	Flush/write-back/invalidate relevant caches

However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state
where all TDX VM pages are effectively unmapped, so pages can be reclaimed
directly.

Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
reclaim time.  For example:

	VCPUs	Size (GB)	Before (secs)	After (secs)
	 4	 18		  72		 24
	32	107		 517		134
	64	400		5539		467

[Adrian: wrote commit message, added KVM_TDX_TERMINATE_VM documentation,
 and moved cpus_read_lock() inside kvm->lock for consistency as reported
 by lockdep]

Link: https://lore.kernel.org/r/Z-V0qyTn2bXdrPF7@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 Documentation/virt/kvm/x86/intel-tdx.rst | 16 ++++++
 arch/x86/include/uapi/asm/kvm.h          |  1 +
 arch/x86/kvm/vmx/tdx.c                   | 63 +++++++++++++++++-------
 3 files changed, 61 insertions(+), 19 deletions(-)

diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
index de41d4c01e5c..e5d4d9cf4cf2 100644
--- a/Documentation/virt/kvm/x86/intel-tdx.rst
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@@ -38,6 +38,7 @@ ioctl with TDX specific sub-ioctl() commands.
           KVM_TDX_INIT_MEM_REGION,
           KVM_TDX_FINALIZE_VM,
           KVM_TDX_GET_CPUID,
+          KVM_TDX_TERMINATE_VM,
 
           KVM_TDX_CMD_NR_MAX,
   };
@@ -214,6 +215,21 @@ struct kvm_cpuid2.
 	  __u32 padding[3];
   };
 
+KVM_TDX_TERMINATE_VM
+-------------------
+:Type: vm ioctl
+:Returns: 0 on success, <0 on error
+
+Release Host Key ID (HKID) to allow more efficient reclaim of private memory.
+After this, the TD is no longer in a runnable state.
+
+Using KVM_TDX_TERMINATE_VM is optional.
+
+- id: KVM_TDX_TERMINATE_VM
+- flags: must be 0
+- data: must be 0
+- hw_error: must be 0
+
 KVM TDX creation flow
 =====================
 In addition to the standard KVM flow, new TDX ioctls need to be called.  The
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 225a12e0d5d6..a2f973e1d75d 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -939,6 +939,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_MEM_REGION,
 	KVM_TDX_FINALIZE_VM,
 	KVM_TDX_GET_CPUID,
+	KVM_TDX_TERMINATE_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..5763a70d1ec6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -472,7 +472,7 @@ static void smp_func_do_phymem_cache_wb(void *unused)
 		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err);
 }
 
-void tdx_mmu_release_hkid(struct kvm *kvm)
+static void __tdx_release_hkid(struct kvm *kvm, bool terminate)
 {
 	bool packages_allocated, targets_allocated;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -485,10 +485,11 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	if (!is_hkid_assigned(kvm_tdx))
 		return;
 
+	if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate, kvm))
+		return;
+
 	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
 	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
-	cpus_read_lock();
-
 	kvm_for_each_vcpu(j, vcpu, kvm)
 		tdx_flush_vp_on_cpu(vcpu);
 
@@ -500,14 +501,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	 */
 	mutex_lock(&tdx_lock);
 
-	/*
-	 * Releasing HKID is in vm_destroy().
-	 * After the above flushing vps, there should be no more vCPU
-	 * associations, as all vCPU fds have been released at this stage.
-	 */
 	err = tdh_mng_vpflushdone(&kvm_tdx->td);
-	if (err == TDX_FLUSHVP_NOT_DONE)
-		goto out;
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error(TDH_MNG_VPFLUSHDONE, err);
 		pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
@@ -515,6 +509,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		goto out;
 	}
 
+	write_lock(&kvm->mmu_lock);
 	for_each_online_cpu(i) {
 		if (packages_allocated &&
 		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -539,14 +534,20 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	} else {
 		tdx_hkid_free(kvm_tdx);
 	}
-
+	write_unlock(&kvm->mmu_lock);
 out:
 	mutex_unlock(&tdx_lock);
-	cpus_read_unlock();
 	free_cpumask_var(targets);
 	free_cpumask_var(packages);
 }
 
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+	cpus_read_lock();
+	__tdx_release_hkid(kvm, false);
+	cpus_read_unlock();
+}
+
 static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1789,13 +1790,13 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
+		WARN_ON_ONCE(!kvm->vm_dead);
+		ret = tdx_reclaim_page(page);
+		if (!ret)
+			tdx_unpin(kvm, page);
+		return ret;
+	}
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
@@ -2790,6 +2791,27 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+static int tdx_terminate_vm(struct kvm *kvm)
+{
+	int r = 0;
+
+	guard(mutex)(&kvm->lock);
+	cpus_read_lock();
+
+	if (!kvm_trylock_all_vcpus(kvm)) {
+		r = -EBUSY;
+		goto out;
+	}
+
+	kvm_vm_dead(kvm);
+	kvm_unlock_all_vcpus(kvm);
+
+	__tdx_release_hkid(kvm, true);
+out:
+	cpus_read_unlock();
+	return r;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -2805,6 +2827,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	if (tdx_cmd.hw_error)
 		return -EINVAL;
 
+	if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
+		return tdx_terminate_vm(kvm);
+
 	mutex_lock(&kvm->lock);
 
 	switch (tdx_cmd.id) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-04-17 13:19 ` [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Adrian Hunter
@ 2025-04-19  0:34   ` Vishal Annapurve
  2025-04-19  1:08     ` Sean Christopherson
  2025-04-19  1:12   ` Sean Christopherson
  1 sibling, 1 reply; 8+ messages in thread
From: Vishal Annapurve @ 2025-04-19  0:34 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, seanjc, mlevitsk, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, linux-kernel,
	yan.y.zhao, chao.gao

On Thu, Apr 17, 2025 at 6:20 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> ...
> +static int tdx_terminate_vm(struct kvm *kvm)
> +{
> +       int r = 0;
> +
> +       guard(mutex)(&kvm->lock);
> +       cpus_read_lock();
> +
> +       if (!kvm_trylock_all_vcpus(kvm)) {

Does this need to be a trylock variant? Is userspace expected to keep
retrying this operation indefinitely?

> +               r = -EBUSY;
> +               goto out;
> +       }
> +
> +       kvm_vm_dead(kvm);
> +       kvm_unlock_all_vcpus(kvm);
> +
> +       __tdx_release_hkid(kvm, true);
> +out:
> +       cpus_read_unlock();
> +       return r;
> +}
> +

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-04-19  0:34   ` Vishal Annapurve
@ 2025-04-19  1:08     ` Sean Christopherson
  2025-04-22  7:30       ` Adrian Hunter
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2025-04-19  1:08 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Adrian Hunter, pbonzini, mlevitsk, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, linux-kernel,
	yan.y.zhao, chao.gao

On Fri, Apr 18, 2025, Vishal Annapurve wrote:
> On Thu, Apr 17, 2025 at 6:20 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >
> > ...
> > +static int tdx_terminate_vm(struct kvm *kvm)
> > +{
> > +       int r = 0;
> > +
> > +       guard(mutex)(&kvm->lock);
> > +       cpus_read_lock();
> > +
> > +       if (!kvm_trylock_all_vcpus(kvm)) {
> 
> Does this need to be a trylock variant? Is userspace expected to keep
> retrying this operation indefinitely?

Userspace is expected to not be stupid, i.e. not be doing things with vCPUs when
terminating the VM.  This is already rather unpleasant, I'd rather not have to
think hard about what could go wrong if KVM has to wait on all vCPU mutexes.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-04-17 13:19 ` [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Adrian Hunter
  2025-04-19  0:34   ` Vishal Annapurve
@ 2025-04-19  1:12   ` Sean Christopherson
  2025-04-22  8:13     ` Adrian Hunter
  1 sibling, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2025-04-19  1:12 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, mlevitsk, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Thu, Apr 17, 2025, Adrian Hunter wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
> which enables more efficient reclaim of private memory.
> 
> Private memory is removed from MMU/TDP when guest_memfds are closed. If
> the HKID has not been released, the TDX VM is still in RUNNABLE state,
> so pages must be removed using "Dynamic Page Removal" procedure (refer
> TDX Module Base spec) which involves a number of steps:
> 	Block further address translation
> 	Exit each VCPU
> 	Clear Secure EPT entry
> 	Flush/write-back/invalidate relevant caches
> 
> However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state
> where all TDX VM pages are effectively unmapped, so pages can be reclaimed
> directly.
> 
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time.  For example:
> 
> 	VCPUs	Size (GB)	Before (secs)	After (secs)
> 	 4	 18		  72		 24
> 	32	107		 517		134
> 	64	400		5539		467
> 
> [Adrian: wrote commit message, added KVM_TDX_TERMINATE_VM documentation,
>  and moved cpus_read_lock() inside kvm->lock for consistency as reported
>  by lockdep]

/facepalm

I over-thought this.  We've had an long-standing battle with kvm_lock vs.
cpus_read_lock(), but this is kvm->lock, not kvm_lock.  /sigh

> +static int tdx_terminate_vm(struct kvm *kvm)
> +{
> +	int r = 0;
> +
> +	guard(mutex)(&kvm->lock);

With kvm->lock taken outside cpus_read_lock(), just handle KVM_TDX_TERMINATE_VM
in the switch statement, i.e. let tdx_vm_ioctl() deal with kvm->lock.

> +	cpus_read_lock();
> +
> +	if (!kvm_trylock_all_vcpus(kvm)) {
> +		r = -EBUSY;
> +		goto out;
> +	}
> +
> +	kvm_vm_dead(kvm);
> +	kvm_unlock_all_vcpus(kvm);
> +
> +	__tdx_release_hkid(kvm, true);
> +out:
> +	cpus_read_unlock();
> +	return r;
> +}
> +
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_tdx_cmd tdx_cmd;
> @@ -2805,6 +2827,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  	if (tdx_cmd.hw_error)
>  		return -EINVAL;
>  
> +	if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
> +		return tdx_terminate_vm(kvm);
> +
>  	mutex_lock(&kvm->lock);
>  
>  	switch (tdx_cmd.id) {
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-04-19  1:08     ` Sean Christopherson
@ 2025-04-22  7:30       ` Adrian Hunter
  0 siblings, 0 replies; 8+ messages in thread
From: Adrian Hunter @ 2025-04-22  7:30 UTC (permalink / raw)
  To: Sean Christopherson, Vishal Annapurve
  Cc: pbonzini, mlevitsk, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On 19/04/25 04:08, Sean Christopherson wrote:
> On Fri, Apr 18, 2025, Vishal Annapurve wrote:
>> On Thu, Apr 17, 2025 at 6:20 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>
>>> ...
>>> +static int tdx_terminate_vm(struct kvm *kvm)
>>> +{
>>> +       int r = 0;
>>> +
>>> +       guard(mutex)(&kvm->lock);
>>> +       cpus_read_lock();
>>> +
>>> +       if (!kvm_trylock_all_vcpus(kvm)) {
>>
>> Does this need to be a trylock variant? Is userspace expected to keep
>> retrying this operation indefinitely?

No issue was seen in testing with a QEMU hack with no retrying.
Presumably if user space is not doing anything with the TDX VM at
the same time, then there should not be contention.

KVM_TDX_TERMINATE_VM is optional so it is not necessary to wait
indefinitely.

> Userspace is expected to not be stupid, i.e. not be doing things with vCPUs when
> terminating the VM.  This is already rather unpleasant, I'd rather not have to
> think hard about what could go wrong if KVM has to wait on all vCPU mutexes.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-04-19  1:12   ` Sean Christopherson
@ 2025-04-22  8:13     ` Adrian Hunter
  2025-04-22  9:37       ` Adrian Hunter
  0 siblings, 1 reply; 8+ messages in thread
From: Adrian Hunter @ 2025-04-22  8:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, mlevitsk, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On 19/04/25 04:12, Sean Christopherson wrote:
> On Thu, Apr 17, 2025, Adrian Hunter wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
>> which enables more efficient reclaim of private memory.
>>
>> Private memory is removed from MMU/TDP when guest_memfds are closed. If
>> the HKID has not been released, the TDX VM is still in RUNNABLE state,
>> so pages must be removed using "Dynamic Page Removal" procedure (refer
>> TDX Module Base spec) which involves a number of steps:
>> 	Block further address translation
>> 	Exit each VCPU
>> 	Clear Secure EPT entry
>> 	Flush/write-back/invalidate relevant caches
>>
>> However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state
>> where all TDX VM pages are effectively unmapped, so pages can be reclaimed
>> directly.
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time.  For example:
>>
>> 	VCPUs	Size (GB)	Before (secs)	After (secs)
>> 	 4	 18		  72		 24
>> 	32	107		 517		134
>> 	64	400		5539		467
>>
>> [Adrian: wrote commit message, added KVM_TDX_TERMINATE_VM documentation,
>>  and moved cpus_read_lock() inside kvm->lock for consistency as reported
>>  by lockdep]
> 
> /facepalm
> 
> I over-thought this.  We've had an long-standing battle with kvm_lock vs.
> cpus_read_lock(), but this is kvm->lock, not kvm_lock.  /sigh
> 
>> +static int tdx_terminate_vm(struct kvm *kvm)
>> +{
>> +	int r = 0;
>> +
>> +	guard(mutex)(&kvm->lock);
> 
> With kvm->lock taken outside cpus_read_lock(), just handle KVM_TDX_TERMINATE_VM
> in the switch statement, i.e. let tdx_vm_ioctl() deal with kvm->lock.

Ok, also cpus_read_lock() can go back where it was in __tdx_release_hkid().

But also in __tdx_release_hkid(), there is

	if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate, kvm))
		return;

However, __tdx_td_init() calls tdx_mmu_release_hkid() on the
error path so that is not correct.

> 
>> +	cpus_read_lock();
>> +
>> +	if (!kvm_trylock_all_vcpus(kvm)) {
>> +		r = -EBUSY;
>> +		goto out;
>> +	}
>> +
>> +	kvm_vm_dead(kvm);
>> +	kvm_unlock_all_vcpus(kvm);
>> +
>> +	__tdx_release_hkid(kvm, true);
>> +out:
>> +	cpus_read_unlock();
>> +	return r;
>> +}
>> +
>>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>>  {
>>  	struct kvm_tdx_cmd tdx_cmd;
>> @@ -2805,6 +2827,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>>  	if (tdx_cmd.hw_error)
>>  		return -EINVAL;
>>  
>> +	if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
>> +		return tdx_terminate_vm(kvm);
>> +
>>  	mutex_lock(&kvm->lock);
>>  
>>  	switch (tdx_cmd.id) {
>> -- 
>> 2.43.0
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-04-22  8:13     ` Adrian Hunter
@ 2025-04-22  9:37       ` Adrian Hunter
  0 siblings, 0 replies; 8+ messages in thread
From: Adrian Hunter @ 2025-04-22  9:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, mlevitsk, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On 22/04/25 11:13, Adrian Hunter wrote:
> On 19/04/25 04:12, Sean Christopherson wrote:
>> On Thu, Apr 17, 2025, Adrian Hunter wrote:
>>> From: Sean Christopherson <seanjc@google.com>
>>>
>>> Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
>>> which enables more efficient reclaim of private memory.
>>>
>>> Private memory is removed from MMU/TDP when guest_memfds are closed. If
>>> the HKID has not been released, the TDX VM is still in RUNNABLE state,
>>> so pages must be removed using "Dynamic Page Removal" procedure (refer
>>> TDX Module Base spec) which involves a number of steps:
>>> 	Block further address translation
>>> 	Exit each VCPU
>>> 	Clear Secure EPT entry
>>> 	Flush/write-back/invalidate relevant caches
>>>
>>> However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state
>>> where all TDX VM pages are effectively unmapped, so pages can be reclaimed
>>> directly.
>>>
>>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>>> reclaim time.  For example:
>>>
>>> 	VCPUs	Size (GB)	Before (secs)	After (secs)
>>> 	 4	 18		  72		 24
>>> 	32	107		 517		134
>>> 	64	400		5539		467
>>>
>>> [Adrian: wrote commit message, added KVM_TDX_TERMINATE_VM documentation,
>>>  and moved cpus_read_lock() inside kvm->lock for consistency as reported
>>>  by lockdep]
>>
>> /facepalm
>>
>> I over-thought this.  We've had an long-standing battle with kvm_lock vs.
>> cpus_read_lock(), but this is kvm->lock, not kvm_lock.  /sigh
>>
>>> +static int tdx_terminate_vm(struct kvm *kvm)
>>> +{
>>> +	int r = 0;
>>> +
>>> +	guard(mutex)(&kvm->lock);
>>
>> With kvm->lock taken outside cpus_read_lock(), just handle KVM_TDX_TERMINATE_VM
>> in the switch statement, i.e. let tdx_vm_ioctl() deal with kvm->lock.
> 
> Ok, also cpus_read_lock() can go back where it was in __tdx_release_hkid().
> 
> But also in __tdx_release_hkid(), there is
> 
> 	if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate, kvm))
> 		return;
> 
> However, __tdx_td_init() calls tdx_mmu_release_hkid() on the
> error path so that is not correct.
> 

So it ends up like this:

diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
index de41d4c01e5c..e5d4d9cf4cf2 100644
--- a/Documentation/virt/kvm/x86/intel-tdx.rst
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@@ -38,6 +38,7 @@ ioctl with TDX specific sub-ioctl() commands.
           KVM_TDX_INIT_MEM_REGION,
           KVM_TDX_FINALIZE_VM,
           KVM_TDX_GET_CPUID,
+          KVM_TDX_TERMINATE_VM,
 
           KVM_TDX_CMD_NR_MAX,
   };
@@ -214,6 +215,21 @@ struct kvm_cpuid2.
 	  __u32 padding[3];
   };
 
+KVM_TDX_TERMINATE_VM
+-------------------
+:Type: vm ioctl
+:Returns: 0 on success, <0 on error
+
+Release Host Key ID (HKID) to allow more efficient reclaim of private memory.
+After this, the TD is no longer in a runnable state.
+
+Using KVM_TDX_TERMINATE_VM is optional.
+
+- id: KVM_TDX_TERMINATE_VM
+- flags: must be 0
+- data: must be 0
+- hw_error: must be 0
+
 KVM TDX creation flow
 =====================
 In addition to the standard KVM flow, new TDX ioctls need to be called.  The
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 225a12e0d5d6..a2f973e1d75d 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -939,6 +939,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_MEM_REGION,
 	KVM_TDX_FINALIZE_VM,
 	KVM_TDX_GET_CPUID,
+	KVM_TDX_TERMINATE_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..5161f6f891d7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -500,14 +500,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	 */
 	mutex_lock(&tdx_lock);
 
-	/*
-	 * Releasing HKID is in vm_destroy().
-	 * After the above flushing vps, there should be no more vCPU
-	 * associations, as all vCPU fds have been released at this stage.
-	 */
 	err = tdh_mng_vpflushdone(&kvm_tdx->td);
-	if (err == TDX_FLUSHVP_NOT_DONE)
-		goto out;
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error(TDH_MNG_VPFLUSHDONE, err);
 		pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
@@ -515,6 +508,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		goto out;
 	}
 
+	write_lock(&kvm->mmu_lock);
 	for_each_online_cpu(i) {
 		if (packages_allocated &&
 		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -539,7 +533,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	} else {
 		tdx_hkid_free(kvm_tdx);
 	}
-
+	write_unlock(&kvm->mmu_lock);
 out:
 	mutex_unlock(&tdx_lock);
 	cpus_read_unlock();
@@ -1789,13 +1783,13 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
+		WARN_ON_ONCE(!kvm->vm_dead);
+		ret = tdx_reclaim_page(page);
+		if (!ret)
+			tdx_unpin(kvm, page);
+		return ret;
+	}
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
@@ -2790,6 +2784,20 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+static int tdx_terminate_vm(struct kvm *kvm)
+{
+	if (!kvm_trylock_all_vcpus(kvm))
+		return -EBUSY;
+
+	kvm_vm_dead(kvm);
+
+	kvm_unlock_all_vcpus(kvm);
+
+	tdx_mmu_release_hkid(kvm);
+
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -2817,6 +2825,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_FINALIZE_VM:
 		r = tdx_td_finalize(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_TERMINATE_VM:
+		r = tdx_terminate_vm(kvm);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-22  9:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-17 13:19 [PATCH V2 0/1] KVM: TDX: Decrease TDX VM shutdown time Adrian Hunter
2025-04-17 13:19 ` [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Adrian Hunter
2025-04-19  0:34   ` Vishal Annapurve
2025-04-19  1:08     ` Sean Christopherson
2025-04-22  7:30       ` Adrian Hunter
2025-04-19  1:12   ` Sean Christopherson
2025-04-22  8:13     ` Adrian Hunter
2025-04-22  9:37       ` Adrian Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox