[PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
@ 2025-06-11  9:51 Adrian Hunter
  2025-06-11  9:51 ` [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Adrian Hunter
  2025-06-25 22:25 ` [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time Sean Christopherson
  0 siblings, 2 replies; 43+ messages in thread
From: Adrian Hunter @ 2025-06-11  9:51 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

Hi

Changes in V4:

	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
	Use KVM_BUG_ON() instead of WARN_ON().
	Correct kvm_trylock_all_vcpus() return value.

Changes in V3:
	Refer:
            https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com

	Remove KVM_BUG_ON() from tdx_mmu_release_hkid() because it would
	trigger on the error path from __tdx_td_init()

	Put cpus_read_lock() handling back into tdx_mmu_release_hkid()

	Handle KVM_TDX_TERMINATE_VM in the switch statement, i.e. let
	tdx_vm_ioctl() deal with kvm->lock


The version 1 RFC:

	https://lore.kernel.org/all/20250313181629.17764-1-adrian.hunter@intel.com/

listed 3 options and implemented option 2.  Sean replied with code for
option 1, which tested out OK, so here it is plus a commit log.

It depends upon kvm_trylock_all_vcpus(kvm) which is now upstream:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e4a454ced74c0ac97c8bd32f086ee3ad74528780


Sean Christopherson (1):
      KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM

 Documentation/virt/kvm/x86/intel-tdx.rst | 16 +++++++++++++++
 arch/x86/include/uapi/asm/kvm.h          |  1 +
 arch/x86/kvm/vmx/tdx.c                   | 34 ++++++++++++++++++++++++--------
 3 files changed, 43 insertions(+), 8 deletions(-)


Regards
Adrian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-11  9:51 [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time Adrian Hunter
@ 2025-06-11  9:51 ` Adrian Hunter
  2025-06-16  3:40   ` Vishal Annapurve
  2025-06-23 20:40   ` Vishal Annapurve
  2025-06-25 22:25 ` [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time Sean Christopherson
  1 sibling, 2 replies; 43+ messages in thread
From: Adrian Hunter @ 2025-06-11  9:51 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

From: Sean Christopherson <seanjc@google.com>

Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
which enables more efficient reclaim of private memory.

Private memory is removed from MMU/TDP when guest_memfds are closed. If
the HKID has not been released, the TDX VM is still in RUNNABLE state,
so pages must be removed using "Dynamic Page Removal" procedure (refer
TDX Module Base spec) which involves a number of steps:
	Block further address translation
	Exit each VCPU
	Clear Secure EPT entry
	Flush/write-back/invalidate relevant caches

However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state
where all TDX VM pages are effectively unmapped, so pages can be reclaimed
directly.

Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
reclaim time.  For example:

	VCPUs	Size (GB)	Before (secs)	After (secs)
	 4	 18		  72		 24
	32	107		 517		134
	64	400		5539		467

Link: https://lore.kernel.org/r/Z-V0qyTn2bXdrPF7@google.com
Link: https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---


Changes in V4:

	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
	Use KVM_BUG_ON() instead of WARN_ON().
	Correct kvm_trylock_all_vcpus() return value.

Changes in V3:

	Remove KVM_BUG_ON() from tdx_mmu_release_hkid() because it would
	trigger on the error path from __tdx_td_init()

	Put cpus_read_lock() handling back into tdx_mmu_release_hkid()

	Handle KVM_TDX_TERMINATE_VM in the switch statement, i.e. let
	tdx_vm_ioctl() deal with kvm->lock


 Documentation/virt/kvm/x86/intel-tdx.rst | 16 +++++++++++
 arch/x86/include/uapi/asm/kvm.h          |  1 +
 arch/x86/kvm/vmx/tdx.c                   | 34 ++++++++++++++++++------
 3 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
index de41d4c01e5c..e5d4d9cf4cf2 100644
--- a/Documentation/virt/kvm/x86/intel-tdx.rst
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@@ -38,6 +38,7 @@ ioctl with TDX specific sub-ioctl() commands.
           KVM_TDX_INIT_MEM_REGION,
           KVM_TDX_FINALIZE_VM,
           KVM_TDX_GET_CPUID,
+          KVM_TDX_TERMINATE_VM,
 
           KVM_TDX_CMD_NR_MAX,
   };
@@ -214,6 +215,21 @@ struct kvm_cpuid2.
 	  __u32 padding[3];
   };
 
+KVM_TDX_TERMINATE_VM
+-------------------
+:Type: vm ioctl
+:Returns: 0 on success, <0 on error
+
+Release Host Key ID (HKID) to allow more efficient reclaim of private memory.
+After this, the TD is no longer in a runnable state.
+
+Using KVM_TDX_TERMINATE_VM is optional.
+
+- id: KVM_TDX_TERMINATE_VM
+- flags: must be 0
+- data: must be 0
+- hw_error: must be 0
+
 KVM TDX creation flow
 =====================
 In addition to the standard KVM flow, new TDX ioctls need to be called.  The
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 6f3499507c5e..697d396b2cc0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -940,6 +940,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_MEM_REGION,
 	KVM_TDX_FINALIZE_VM,
 	KVM_TDX_GET_CPUID,
+	KVM_TDX_TERMINATE_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..457f91b95147 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -515,6 +515,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		goto out;
 	}
 
+	write_lock(&kvm->mmu_lock);
 	for_each_online_cpu(i) {
 		if (packages_allocated &&
 		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -539,7 +540,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	} else {
 		tdx_hkid_free(kvm_tdx);
 	}
-
+	write_unlock(&kvm->mmu_lock);
 out:
 	mutex_unlock(&tdx_lock);
 	cpus_read_unlock();
@@ -1789,13 +1790,13 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
+		KVM_BUG_ON(!kvm->vm_dead, kvm);
+		ret = tdx_reclaim_page(page);
+		if (!ret)
+			tdx_unpin(kvm, page);
+		return ret;
+	}
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
@@ -2790,6 +2791,20 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+static int tdx_terminate_vm(struct kvm *kvm)
+{
+	if (kvm_trylock_all_vcpus(kvm))
+		return -EBUSY;
+
+	kvm_vm_dead(kvm);
+
+	kvm_unlock_all_vcpus(kvm);
+
+	tdx_mmu_release_hkid(kvm);
+
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -2817,6 +2832,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_FINALIZE_VM:
 		r = tdx_td_finalize(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_TERMINATE_VM:
+		r = tdx_terminate_vm(kvm);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-11  9:51 ` [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Adrian Hunter
@ 2025-06-16  3:40   ` Vishal Annapurve
  2025-06-18  5:50     ` Adrian Hunter
  2025-06-23 20:40   ` Vishal Annapurve
  1 sibling, 1 reply; 43+ messages in thread
From: Vishal Annapurve @ 2025-06-16  3:40 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Wed, Jun 11, 2025 at 2:52 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
> which enables more efficient reclaim of private memory.
>
> Private memory is removed from MMU/TDP when guest_memfds are closed. If
> the HKID has not been released, the TDX VM is still in RUNNABLE state,
> so pages must be removed using "Dynamic Page Removal" procedure (refer
> TDX Module Base spec) which involves a number of steps:
>         Block further address translation
>         Exit each VCPU
>         Clear Secure EPT entry
>         Flush/write-back/invalidate relevant caches
>
> However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state
> where all TDX VM pages are effectively unmapped, so pages can be reclaimed
> directly.
>
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time.  For example:
>
>         VCPUs   Size (GB)       Before (secs)   After (secs)
>          4       18               72             24
>         32      107              517            134
>         64      400             5539            467
>
> Link: https://lore.kernel.org/r/Z-V0qyTn2bXdrPF7@google.com
> Link: https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>
>
> Changes in V4:
>
>         Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
>         Use KVM_BUG_ON() instead of WARN_ON().
>         Correct kvm_trylock_all_vcpus() return value.
>
> Changes in V3:
>
>         Remove KVM_BUG_ON() from tdx_mmu_release_hkid() because it would
>         trigger on the error path from __tdx_td_init()
>
>         Put cpus_read_lock() handling back into tdx_mmu_release_hkid()
>
>         Handle KVM_TDX_TERMINATE_VM in the switch statement, i.e. let
>         tdx_vm_ioctl() deal with kvm->lock
> ....
>
> +static int tdx_terminate_vm(struct kvm *kvm)
> +{
> +       if (kvm_trylock_all_vcpus(kvm))
> +               return -EBUSY;
> +
> +       kvm_vm_dead(kvm);

With this no more VM ioctls can be issued on this instance. How would
userspace VMM clean up the memslots? Is the expectation that
guest_memfd and VM fds are closed to actually reclaim the memory?

Ability to clean up memslots from userspace without closing
VM/guest_memfd handles is useful to keep reusing the same guest_memfds
for the next boot iteration of the VM in case of reboot.

> +
> +       kvm_unlock_all_vcpus(kvm);
> +
> +       tdx_mmu_release_hkid(kvm);
> +
> +       return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-16  3:40   ` Vishal Annapurve
@ 2025-06-18  5:50     ` Adrian Hunter
  2025-06-18  6:00       ` Vishal Annapurve
  0 siblings, 1 reply; 43+ messages in thread
From: Adrian Hunter @ 2025-06-18  5:50 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On 16/06/2025 06:40, Vishal Annapurve wrote:
> On Wed, Jun 11, 2025 at 2:52 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> From: Sean Christopherson <seanjc@google.com>
>>
>> Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
>> which enables more efficient reclaim of private memory.
>>
>> Private memory is removed from MMU/TDP when guest_memfds are closed. If
>> the HKID has not been released, the TDX VM is still in RUNNABLE state,
>> so pages must be removed using "Dynamic Page Removal" procedure (refer
>> TDX Module Base spec) which involves a number of steps:
>>         Block further address translation
>>         Exit each VCPU
>>         Clear Secure EPT entry
>>         Flush/write-back/invalidate relevant caches
>>
>> However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state
>> where all TDX VM pages are effectively unmapped, so pages can be reclaimed
>> directly.
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time.  For example:
>>
>>         VCPUs   Size (GB)       Before (secs)   After (secs)
>>          4       18               72             24
>>         32      107              517            134
>>         64      400             5539            467
>>
>> Link: https://lore.kernel.org/r/Z-V0qyTn2bXdrPF7@google.com
>> Link: https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>>
>>
>> Changes in V4:
>>
>>         Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
>>         Use KVM_BUG_ON() instead of WARN_ON().
>>         Correct kvm_trylock_all_vcpus() return value.
>>
>> Changes in V3:
>>
>>         Remove KVM_BUG_ON() from tdx_mmu_release_hkid() because it would
>>         trigger on the error path from __tdx_td_init()
>>
>>         Put cpus_read_lock() handling back into tdx_mmu_release_hkid()
>>
>>         Handle KVM_TDX_TERMINATE_VM in the switch statement, i.e. let
>>         tdx_vm_ioctl() deal with kvm->lock
>> ....
>>
>> +static int tdx_terminate_vm(struct kvm *kvm)
>> +{
>> +       if (kvm_trylock_all_vcpus(kvm))
>> +               return -EBUSY;
>> +
>> +       kvm_vm_dead(kvm);
> 
> With this no more VM ioctls can be issued on this instance. How would
> userspace VMM clean up the memslots? Is the expectation that
> guest_memfd and VM fds are closed to actually reclaim the memory?

Yes

> 
> Ability to clean up memslots from userspace without closing
> VM/guest_memfd handles is useful to keep reusing the same guest_memfds
> for the next boot iteration of the VM in case of reboot.

TD lifecycle does not include reboot.  In other words, reboot is
done by shutting down the TD and then starting again with a new TD.

AFAIK it is not currently possible to shut down without closing
guest_memfds since the guest_memfd holds a reference (users_count)
to struct kvm, and destruction begins when users_count hits zero.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-18  5:50     ` Adrian Hunter
@ 2025-06-18  6:00       ` Vishal Annapurve
  2025-06-18  8:33         ` Adrian Hunter
  2025-06-18 22:07         ` Edgecombe, Rick P
  0 siblings, 2 replies; 43+ messages in thread
From: Vishal Annapurve @ 2025-06-18  6:00 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Tue, Jun 17, 2025 at 10:50 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> ...
> >>
> >> Changes in V4:
> >>
> >>         Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
> >>         Use KVM_BUG_ON() instead of WARN_ON().
> >>         Correct kvm_trylock_all_vcpus() return value.
> >>
> >> Changes in V3:
> >>
> >>         Remove KVM_BUG_ON() from tdx_mmu_release_hkid() because it would
> >>         trigger on the error path from __tdx_td_init()
> >>
> >>         Put cpus_read_lock() handling back into tdx_mmu_release_hkid()
> >>
> >>         Handle KVM_TDX_TERMINATE_VM in the switch statement, i.e. let
> >>         tdx_vm_ioctl() deal with kvm->lock
> >> ....
> >>
> >> +static int tdx_terminate_vm(struct kvm *kvm)
> >> +{
> >> +       if (kvm_trylock_all_vcpus(kvm))
> >> +               return -EBUSY;
> >> +
> >> +       kvm_vm_dead(kvm);
> >
> > With this no more VM ioctls can be issued on this instance. How would
> > userspace VMM clean up the memslots? Is the expectation that
> > guest_memfd and VM fds are closed to actually reclaim the memory?
>
> Yes
>
> >
> > Ability to clean up memslots from userspace without closing
> > VM/guest_memfd handles is useful to keep reusing the same guest_memfds
> > for the next boot iteration of the VM in case of reboot.
>
> TD lifecycle does not include reboot.  In other words, reboot is
> done by shutting down the TD and then starting again with a new TD.
>
> AFAIK it is not currently possible to shut down without closing
> guest_memfds since the guest_memfd holds a reference (users_count)
> to struct kvm, and destruction begins when users_count hits zero.
>

gmem link support[1] allows associating existing guest_memfds with new
VM instances.

Breakdown of the userspace VMM flow:
1) Create a new VM instance before closing guest_memfd files.
2) Link existing guest_memfd files with the new VM instance. -> This
creates new set of files backed by the same inode but associated with
the new VM instance.
3) Close the older guest memfd handles -> results in older VM instance cleanup.

[1] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/#t

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-18  6:00       ` Vishal Annapurve
@ 2025-06-18  8:33         ` Adrian Hunter
  2025-06-19  0:33           ` Sean Christopherson
  2025-06-18 22:07         ` Edgecombe, Rick P
  1 sibling, 1 reply; 43+ messages in thread
From: Adrian Hunter @ 2025-06-18  8:33 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On 18/06/2025 09:00, Vishal Annapurve wrote:
> On Tue, Jun 17, 2025 at 10:50 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>> ...
>>>>
>>>> Changes in V4:
>>>>
>>>>         Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
>>>>         Use KVM_BUG_ON() instead of WARN_ON().
>>>>         Correct kvm_trylock_all_vcpus() return value.
>>>>
>>>> Changes in V3:
>>>>
>>>>         Remove KVM_BUG_ON() from tdx_mmu_release_hkid() because it would
>>>>         trigger on the error path from __tdx_td_init()
>>>>
>>>>         Put cpus_read_lock() handling back into tdx_mmu_release_hkid()
>>>>
>>>>         Handle KVM_TDX_TERMINATE_VM in the switch statement, i.e. let
>>>>         tdx_vm_ioctl() deal with kvm->lock
>>>> ....
>>>>
>>>> +static int tdx_terminate_vm(struct kvm *kvm)
>>>> +{
>>>> +       if (kvm_trylock_all_vcpus(kvm))
>>>> +               return -EBUSY;
>>>> +
>>>> +       kvm_vm_dead(kvm);
>>>
>>> With this no more VM ioctls can be issued on this instance. How would
>>> userspace VMM clean up the memslots? Is the expectation that
>>> guest_memfd and VM fds are closed to actually reclaim the memory?
>>
>> Yes
>>
>>>
>>> Ability to clean up memslots from userspace without closing
>>> VM/guest_memfd handles is useful to keep reusing the same guest_memfds
>>> for the next boot iteration of the VM in case of reboot.
>>
>> TD lifecycle does not include reboot.  In other words, reboot is
>> done by shutting down the TD and then starting again with a new TD.
>>
>> AFAIK it is not currently possible to shut down without closing
>> guest_memfds since the guest_memfd holds a reference (users_count)
>> to struct kvm, and destruction begins when users_count hits zero.
>>
> 
> gmem link support[1] allows associating existing guest_memfds with new
> VM instances.
> 
> Breakdown of the userspace VMM flow:
> 1) Create a new VM instance before closing guest_memfd files.
> 2) Link existing guest_memfd files with the new VM instance. -> This
> creates new set of files backed by the same inode but associated with
> the new VM instance.

So what about:

2.5) Call KVM_TDX_TERMINATE_VM IOCTL

Memory reclaimed after KVM_TDX_TERMINATE_VM will be done efficiently,
so avoid causing it to be reclaimed earlier.

> 3) Close the older guest memfd handles -> results in older VM instance cleanup.
> 
> [1] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/#t


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-18  6:00       ` Vishal Annapurve
  2025-06-18  8:33         ` Adrian Hunter
@ 2025-06-18 22:07         ` Edgecombe, Rick P
  1 sibling, 0 replies; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-06-18 22:07 UTC (permalink / raw)
  To: Annapurve, Vishal, Hunter, Adrian
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai,
	binbin.wu@linux.intel.com, Chatre, Reinette, Li, Xiaoyao,
	kirill.shutemov@linux.intel.com, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Zhao, Yan Y

On Tue, 2025-06-17 at 23:00 -0700, Vishal Annapurve wrote:
> gmem link support[1] allows associating existing guest_memfds with new
> VM instances.
> 
> Breakdown of the userspace VMM flow:
> 1) Create a new VM instance before closing guest_memfd files.
> 2) Link existing guest_memfd files with the new VM instance. -> This
> creates new set of files backed by the same inode but associated with
> the new VM instance.
> 3) Close the older guest memfd handles -> results in older VM instance cleanup.
> 
> [1] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/#t

Is it intended to even work for TDX private memory? It seems to be SEV focused.

I think we really need to come to an agreement on how much to design code around
future ideas vs a more iterative approach. I had thought we had...



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-18  8:33         ` Adrian Hunter
@ 2025-06-19  0:33           ` Sean Christopherson
  2025-06-19 11:12             ` Adrian Hunter
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-06-19  0:33 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Vishal Annapurve, pbonzini, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, linux-kernel,
	yan.y.zhao, chao.gao

On Wed, Jun 18, 2025, Adrian Hunter wrote:
> On 18/06/2025 09:00, Vishal Annapurve wrote:
> > On Tue, Jun 17, 2025 at 10:50 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>> Ability to clean up memslots from userspace without closing
> >>> VM/guest_memfd handles is useful to keep reusing the same guest_memfds
> >>> for the next boot iteration of the VM in case of reboot.
> >>
> >> TD lifecycle does not include reboot.  In other words, reboot is
> >> done by shutting down the TD and then starting again with a new TD.
> >>
> >> AFAIK it is not currently possible to shut down without closing
> >> guest_memfds since the guest_memfd holds a reference (users_count)
> >> to struct kvm, and destruction begins when users_count hits zero.
> >>
> > 
> > gmem link support[1] allows associating existing guest_memfds with new
> > VM instances.
> > 
> > Breakdown of the userspace VMM flow:
> > 1) Create a new VM instance before closing guest_memfd files.
> > 2) Link existing guest_memfd files with the new VM instance. -> This
> > creates new set of files backed by the same inode but associated with
> > the new VM instance.
> 
> So what about:
> 
> 2.5) Call KVM_TDX_TERMINATE_VM IOCTL
> 
> Memory reclaimed after KVM_TDX_TERMINATE_VM will be done efficiently,
> so avoid causing it to be reclaimed earlier.

The problem is that setting kvm->vm_dead will prevent (3) from succeeding.  If
kvm->vm_dead is set, KVM will reject all vCPU, VM, and device (not /dev/kvm the
device, but rather devices bound to the VM) ioctls.

I intended that behavior, e.g. to guard against userspace blowing up KVM because
the hkid was released, I just didn't consider the memslots angle.

The other thing I didn't consider at the time, is that vm_dead doesn't fully
protect against ioctls that are already in flight.  E.g. see commit
17c80d19df6f ("KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation
is in-flight").  Though without a failure/splat of some kind, it's impossible to
to know if this is actually a problem.  I.e. I don't think we should *entirely*
scrap blocking ioctls just because it *might* not be perfect (we can always make
KVM better).

I can think of a few options:

 1. Skip the vm_dead check if userspace is deleting a memslot.
 2. Provide a way for userspace to delete all memslots, and have that bypass
    vm_dead.
 3. Delete all memslots on  KVM_TDX_TERMINATE_VM.
 4. Remove vm_dead and instead reject ioctls based on vm_bugged, and simply rely
    on KVM_REQ_VM_DEAD to prevent running the guest.  I.e. tweak kvm_vm_dead()
    to be that it only prevents running the VM, and have kvm_vm_bugged() be the
    "something is badly broken, try to limit the damage".

I'm heavily leaning toward #4.  #1 is doable, but painful.  #2 is basically #1,
but with new uAPI.  #3 is all kinds of gross, e.g. userspace might want to simply
kill the VM and move on.  KVM would still block ioctls, but only if a bug was
detected.  And the few use cases where KVM just wants to prevent entering the
guest won't prevent gracefully tearing down the VM.

Hah!  And there's actually a TDX bug fix here, because "checking" for KVM_REQ_VM_DEAD
in kvm_tdp_map_page() and tdx_handle_ept_violation() will *clear* the request,
which isn't what we want, e.g. a vCPU could actually re-enter the guest at that
point.

This is what I'm thinking.  If I don't hate it come Friday (or Monday), I'll turn
this patch into a mini-series and post v5.

Adrian, does that work for you?

---
 arch/arm64/kvm/arm.c            |  2 +-
 arch/arm64/kvm/vgic/vgic-init.c |  2 +-
 arch/x86/kvm/x86.c              |  2 +-
 include/linux/kvm_host.h        |  2 --
 virt/kvm/kvm_main.c             | 10 +++++-----
 5 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index de2b4e9c9f9f..18bd80388b59 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1017,7 +1017,7 @@ static int kvm_vcpu_suspend(struct kvm_vcpu *vcpu)
 static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 {
 	if (kvm_request_pending(vcpu)) {
-		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
 			return -EIO;
 
 		if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index eb1205654ac8..c2033bae73b2 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -612,7 +612,7 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 	mutex_unlock(&kvm->arch.config_lock);
 out_slots:
 	if (ret)
-		kvm_vm_dead(kvm);
+		kvm_vm_bugged(kvm);
 
 	mutex_unlock(&kvm->slots_lock);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b58a74c1722d..37f835d77b65 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10783,7 +10783,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	bool req_immediate_exit = false;
 
 	if (kvm_request_pending(vcpu)) {
-		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
 			r = -EIO;
 			goto out;
 		}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3bde4fb5c6aa..56898e4ab524 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -853,7 +853,6 @@ struct kvm {
 	u32 dirty_ring_size;
 	bool dirty_ring_with_bitmap;
 	bool vm_bugged;
-	bool vm_dead;
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
@@ -893,7 +892,6 @@ struct kvm {
 
 static inline void kvm_vm_dead(struct kvm *kvm)
 {
-	kvm->vm_dead = true;
 	kvm_make_all_cpus_request(kvm, KVM_REQ_VM_DEAD);
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eec82775c5bf..4220579a9a74 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4403,7 +4403,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
 	struct kvm_fpu *fpu = NULL;
 	struct kvm_sregs *kvm_sregs = NULL;
 
-	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
+	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
 		return -EIO;
 
 	if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
@@ -4646,7 +4646,7 @@ static long kvm_vcpu_compat_ioctl(struct file *filp,
 	void __user *argp = compat_ptr(arg);
 	int r;
 
-	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
+	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
 		return -EIO;
 
 	switch (ioctl) {
@@ -4712,7 +4712,7 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
 {
 	struct kvm_device *dev = filp->private_data;
 
-	if (dev->kvm->mm != current->mm || dev->kvm->vm_dead)
+	if (dev->kvm->mm != current->mm || dev->kvm->vm_bugged)
 		return -EIO;
 
 	switch (ioctl) {
@@ -5131,7 +5131,7 @@ static long kvm_vm_ioctl(struct file *filp,
 	void __user *argp = (void __user *)arg;
 	int r;
 
-	if (kvm->mm != current->mm || kvm->vm_dead)
+	if (kvm->mm != current->mm || kvm->vm_bugged)
 		return -EIO;
 	switch (ioctl) {
 	case KVM_CREATE_VCPU:
@@ -5395,7 +5395,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
 	struct kvm *kvm = filp->private_data;
 	int r;
 
-	if (kvm->mm != current->mm || kvm->vm_dead)
+	if (kvm->mm != current->mm || kvm->vm_bugged)
 		return -EIO;
 
 	r = kvm_arch_vm_compat_ioctl(filp, ioctl, arg);

base-commit: 8046d29dde17002523f94d3e6e0ebe486ce52166
--

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-19  0:33           ` Sean Christopherson
@ 2025-06-19 11:12             ` Adrian Hunter
  2025-06-20 14:24               ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: Adrian Hunter @ 2025-06-19 11:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, pbonzini, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, linux-kernel,
	yan.y.zhao, chao.gao

On 19/06/2025 03:33, Sean Christopherson wrote:
> On Wed, Jun 18, 2025, Adrian Hunter wrote:
>> On 18/06/2025 09:00, Vishal Annapurve wrote:
>>> On Tue, Jun 17, 2025 at 10:50 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>> Ability to clean up memslots from userspace without closing
>>>>> VM/guest_memfd handles is useful to keep reusing the same guest_memfds
>>>>> for the next boot iteration of the VM in case of reboot.
>>>>
>>>> TD lifecycle does not include reboot.  In other words, reboot is
>>>> done by shutting down the TD and then starting again with a new TD.
>>>>
>>>> AFAIK it is not currently possible to shut down without closing
>>>> guest_memfds since the guest_memfd holds a reference (users_count)
>>>> to struct kvm, and destruction begins when users_count hits zero.
>>>>
>>>
>>> gmem link support[1] allows associating existing guest_memfds with new
>>> VM instances.
>>>
>>> Breakdown of the userspace VMM flow:
>>> 1) Create a new VM instance before closing guest_memfd files.
>>> 2) Link existing guest_memfd files with the new VM instance. -> This
>>> creates new set of files backed by the same inode but associated with
>>> the new VM instance.
>>
>> So what about:
>>
>> 2.5) Call KVM_TDX_TERMINATE_VM IOCTL
>>
>> Memory reclaimed after KVM_TDX_TERMINATE_VM will be done efficiently,
>> so avoid causing it to be reclaimed earlier.
> 
> The problem is that setting kvm->vm_dead will prevent (3) from succeeding.  If
> kvm->vm_dead is set, KVM will reject all vCPU, VM, and device (not /dev/kvm the
> device, but rather devices bound to the VM) ioctls.

(3) is "Close the older guest memfd handles -> results in older VM instance cleanup."

close() is not an IOCTL, so I do not understand.

> 
> I intended that behavior, e.g. to guard against userspace blowing up KVM because
> the hkid was released, I just didn't consider the memslots angle.

The patch was tested with QEMU which AFAICT does not touch  memslots when
shutting down.  Is there a reason to?

Obviously memslots still need to be freed which is done by kvm_destroy_vm().

> 
> The other thing I didn't consider at the time, is that vm_dead doesn't fully
> protect against ioctls that are already in flight.  E.g. see commit
> 17c80d19df6f ("KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation
> is in-flight").  Though without a failure/splat of some kind, it's impossible to
> to know if this is actually a problem.  I.e. I don't think we should *entirely*
> scrap blocking ioctls just because it *might* not be perfect (we can always make
> KVM better).
> 
> I can think of a few options:
> 
>  1. Skip the vm_dead check if userspace is deleting a memslot.
>  2. Provide a way for userspace to delete all memslots, and have that bypass
>     vm_dead.
>  3. Delete all memslots on  KVM_TDX_TERMINATE_VM.
>  4. Remove vm_dead and instead reject ioctls based on vm_bugged, and simply rely
>     on KVM_REQ_VM_DEAD to prevent running the guest.  I.e. tweak kvm_vm_dead()
>     to be that it only prevents running the VM, and have kvm_vm_bugged() be the
>     "something is badly broken, try to limit the damage".
> 
> I'm heavily leaning toward #4.  #1 is doable, but painful.  #2 is basically #1,
> but with new uAPI.  #3 is all kinds of gross, e.g. userspace might want to simply
> kill the VM and move on.  KVM would still block ioctls, but only if a bug was
> detected.  And the few use cases where KVM just wants to prevent entering the
> guest won't prevent gracefully tearing down the VM.
> 
> Hah!  And there's actually a TDX bug fix here, because "checking" for KVM_REQ_VM_DEAD
> in kvm_tdp_map_page() and tdx_handle_ept_violation() will *clear* the request,
> which isn't what we want, e.g. a vCPU could actually re-enter the guest at that
> point.
> 
> This is what I'm thinking.  If I don't hate it come Friday (or Monday), I'll turn
> this patch into a mini-series and post v5.
> 
> Adrian, does that work for you?

Might need some more checks - I will have to look more closely.

> 
> ---
>  arch/arm64/kvm/arm.c            |  2 +-
>  arch/arm64/kvm/vgic/vgic-init.c |  2 +-
>  arch/x86/kvm/x86.c              |  2 +-
>  include/linux/kvm_host.h        |  2 --
>  virt/kvm/kvm_main.c             | 10 +++++-----
>  5 files changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index de2b4e9c9f9f..18bd80388b59 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1017,7 +1017,7 @@ static int kvm_vcpu_suspend(struct kvm_vcpu *vcpu)
>  static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  {
>  	if (kvm_request_pending(vcpu)) {
> -		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
>  			return -EIO;
>  
>  		if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
> index eb1205654ac8..c2033bae73b2 100644
> --- a/arch/arm64/kvm/vgic/vgic-init.c
> +++ b/arch/arm64/kvm/vgic/vgic-init.c
> @@ -612,7 +612,7 @@ int kvm_vgic_map_resources(struct kvm *kvm)
>  	mutex_unlock(&kvm->arch.config_lock);
>  out_slots:
>  	if (ret)
> -		kvm_vm_dead(kvm);
> +		kvm_vm_bugged(kvm);
>  
>  	mutex_unlock(&kvm->slots_lock);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b58a74c1722d..37f835d77b65 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10783,7 +10783,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	bool req_immediate_exit = false;
>  
>  	if (kvm_request_pending(vcpu)) {
> -		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
> +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
>  			r = -EIO;
>  			goto out;
>  		}
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3bde4fb5c6aa..56898e4ab524 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -853,7 +853,6 @@ struct kvm {
>  	u32 dirty_ring_size;
>  	bool dirty_ring_with_bitmap;
>  	bool vm_bugged;
> -	bool vm_dead;
>  
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  	struct notifier_block pm_notifier;
> @@ -893,7 +892,6 @@ struct kvm {
>  
>  static inline void kvm_vm_dead(struct kvm *kvm)
>  {
> -	kvm->vm_dead = true;
>  	kvm_make_all_cpus_request(kvm, KVM_REQ_VM_DEAD);
>  }
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index eec82775c5bf..4220579a9a74 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4403,7 +4403,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  	struct kvm_fpu *fpu = NULL;
>  	struct kvm_sregs *kvm_sregs = NULL;
>  
> -	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
> +	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
>  		return -EIO;
>  
>  	if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
> @@ -4646,7 +4646,7 @@ static long kvm_vcpu_compat_ioctl(struct file *filp,
>  	void __user *argp = compat_ptr(arg);
>  	int r;
>  
> -	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
> +	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
>  		return -EIO;
>  
>  	switch (ioctl) {
> @@ -4712,7 +4712,7 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
>  {
>  	struct kvm_device *dev = filp->private_data;
>  
> -	if (dev->kvm->mm != current->mm || dev->kvm->vm_dead)
> +	if (dev->kvm->mm != current->mm || dev->kvm->vm_bugged)
>  		return -EIO;
>  
>  	switch (ioctl) {
> @@ -5131,7 +5131,7 @@ static long kvm_vm_ioctl(struct file *filp,
>  	void __user *argp = (void __user *)arg;
>  	int r;
>  
> -	if (kvm->mm != current->mm || kvm->vm_dead)
> +	if (kvm->mm != current->mm || kvm->vm_bugged)
>  		return -EIO;
>  	switch (ioctl) {
>  	case KVM_CREATE_VCPU:
> @@ -5395,7 +5395,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
>  	struct kvm *kvm = filp->private_data;
>  	int r;
>  
> -	if (kvm->mm != current->mm || kvm->vm_dead)
> +	if (kvm->mm != current->mm || kvm->vm_bugged)
>  		return -EIO;
>  
>  	r = kvm_arch_vm_compat_ioctl(filp, ioctl, arg);
> 
> base-commit: 8046d29dde17002523f94d3e6e0ebe486ce52166
> --


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-19 11:12             ` Adrian Hunter
@ 2025-06-20 14:24               ` Sean Christopherson
  2025-06-20 16:14                 ` Vishal Annapurve
  2025-06-20 18:59                 ` Edgecombe, Rick P
  0 siblings, 2 replies; 43+ messages in thread
From: Sean Christopherson @ 2025-06-20 14:24 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Vishal Annapurve, pbonzini, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, linux-kernel,
	yan.y.zhao, chao.gao

On Thu, Jun 19, 2025, Adrian Hunter wrote:
> On 19/06/2025 03:33, Sean Christopherson wrote:
> > On Wed, Jun 18, 2025, Adrian Hunter wrote:
> >> On 18/06/2025 09:00, Vishal Annapurve wrote:
> >>> On Tue, Jun 17, 2025 at 10:50 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>> Ability to clean up memslots from userspace without closing
> >>>>> VM/guest_memfd handles is useful to keep reusing the same guest_memfds
> >>>>> for the next boot iteration of the VM in case of reboot.
> >>>>
> >>>> TD lifecycle does not include reboot.  In other words, reboot is
> >>>> done by shutting down the TD and then starting again with a new TD.
> >>>>
> >>>> AFAIK it is not currently possible to shut down without closing
> >>>> guest_memfds since the guest_memfd holds a reference (users_count)
> >>>> to struct kvm, and destruction begins when users_count hits zero.
> >>>>
> >>>
> >>> gmem link support[1] allows associating existing guest_memfds with new
> >>> VM instances.
> >>>
> >>> Breakdown of the userspace VMM flow:
> >>> 1) Create a new VM instance before closing guest_memfd files.
> >>> 2) Link existing guest_memfd files with the new VM instance. -> This
> >>> creates new set of files backed by the same inode but associated with
> >>> the new VM instance.
> >>
> >> So what about:
> >>
> >> 2.5) Call KVM_TDX_TERMINATE_VM IOCTL
> >>
> >> Memory reclaimed after KVM_TDX_TERMINATE_VM will be done efficiently,
> >> so avoid causing it to be reclaimed earlier.
> > 
> > The problem is that setting kvm->vm_dead will prevent (3) from succeeding.  If
> > kvm->vm_dead is set, KVM will reject all vCPU, VM, and device (not /dev/kvm the
> > device, but rather devices bound to the VM) ioctls.
> 
> (3) is "Close the older guest memfd handles -> results in older VM instance cleanup."
> 
> close() is not an IOCTL, so I do not understand.

Sorry, I misread that as "Close the older guest memfd handles by deleting the
memslots".

> > I intended that behavior, e.g. to guard against userspace blowing up KVM because
> > the hkid was released, I just didn't consider the memslots angle.
> 
> The patch was tested with QEMU which AFAICT does not touch  memslots when
> shutting down.  Is there a reason to?

In this case, the VMM process is not shutting down.  To emulate a reboot, the
VMM destroys the VM, but reuses the guest_memfd files for the "new" VM.  Because
guest_memfd takes a reference to "struct kvm", through memslot bindings, memslots
need to be manually destroyed so that all references are put and the VM is freed
by the kernel.  E.g. otherwise multiple reboots would manifest as memory leakds
and eventually OOM the host.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-20 14:24               ` Sean Christopherson
@ 2025-06-20 16:14                 ` Vishal Annapurve
  2025-06-20 16:26                   ` Sean Christopherson
  2025-06-23 20:36                   ` Vishal Annapurve
  2025-06-20 18:59                 ` Edgecombe, Rick P
  1 sibling, 2 replies; 43+ messages in thread
From: Vishal Annapurve @ 2025-06-20 16:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Adrian Hunter, pbonzini, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Fri, Jun 20, 2025 at 7:24 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jun 19, 2025, Adrian Hunter wrote:
> > On 19/06/2025 03:33, Sean Christopherson wrote:
> > > On Wed, Jun 18, 2025, Adrian Hunter wrote:
> > >> On 18/06/2025 09:00, Vishal Annapurve wrote:
> > >>> On Tue, Jun 17, 2025 at 10:50 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> > >>>>> Ability to clean up memslots from userspace without closing
> > >>>>> VM/guest_memfd handles is useful to keep reusing the same guest_memfds
> > >>>>> for the next boot iteration of the VM in case of reboot.
> > >>>>
> > >>>> TD lifecycle does not include reboot.  In other words, reboot is
> > >>>> done by shutting down the TD and then starting again with a new TD.
> > >>>>
> > >>>> AFAIK it is not currently possible to shut down without closing
> > >>>> guest_memfds since the guest_memfd holds a reference (users_count)
> > >>>> to struct kvm, and destruction begins when users_count hits zero.
> > >>>>
> > >>>
> > >>> gmem link support[1] allows associating existing guest_memfds with new
> > >>> VM instances.
> > >>>
> > >>> Breakdown of the userspace VMM flow:
> > >>> 1) Create a new VM instance before closing guest_memfd files.
> > >>> 2) Link existing guest_memfd files with the new VM instance. -> This
> > >>> creates new set of files backed by the same inode but associated with
> > >>> the new VM instance.
> > >>
> > >> So what about:
> > >>
> > >> 2.5) Call KVM_TDX_TERMINATE_VM IOCTL
> > >>
> > >> Memory reclaimed after KVM_TDX_TERMINATE_VM will be done efficiently,
> > >> so avoid causing it to be reclaimed earlier.
> > >
> > > The problem is that setting kvm->vm_dead will prevent (3) from succeeding.  If
> > > kvm->vm_dead is set, KVM will reject all vCPU, VM, and device (not /dev/kvm the
> > > device, but rather devices bound to the VM) ioctls.
> >
> > (3) is "Close the older guest memfd handles -> results in older VM instance cleanup."
> >
> > close() is not an IOCTL, so I do not understand.
>
> Sorry, I misread that as "Close the older guest memfd handles by deleting the
> memslots".
>
> > > I intended that behavior, e.g. to guard against userspace blowing up KVM because
> > > the hkid was released, I just didn't consider the memslots angle.
> >
> > The patch was tested with QEMU which AFAICT does not touch  memslots when
> > shutting down.  Is there a reason to?
>
> In this case, the VMM process is not shutting down.  To emulate a reboot, the
> VMM destroys the VM, but reuses the guest_memfd files for the "new" VM.  Because
> guest_memfd takes a reference to "struct kvm", through memslot bindings, memslots

guest_memfd takes a reference on the "struct kvm" only on
creation/linking, currently memslot binding doesn't add additional
references.

Adrian's suggestion makes sense and it should be functional but I am
running into some issues which likely need to be resolved on the
userspace side. I will keep this thread updated.

Currently testing this reboot flow:
1) Issue KVM_TDX_TERMINATE_VM on the old VM.
2) Close the VM fd.
3) Create a new VM fd.
4) Link the old guest_memfd handles to the new VM fd.
5) Close the old guest_memfd handles.
6) Register memslots on the new VM using the linked guest_memfd handles.

That being said, I still see the value in what Sean suggested.
" Remove vm_dead and instead reject ioctls based on vm_bugged, and simply rely
    on KVM_REQ_VM_DEAD to prevent running the guest."

This will help with:
1) Keeping the cleanup sequence as close as possible to the normal VM
cleanup sequence.
2) Actual VM destruction happens at step 5 from the above mentioned
flow, if there is any cleanup that happens asynchronously, userspace
can enforce synchronous cleanup by executing graceful VM shutdown
stages before step 2 above.

And IIUC the goal here is to achieve exactly what Sean suggested above
i.e. prevent running the guest after KVM_TDX_TERMINATE_VM is issued.

> need to be manually destroyed so that all references are put and the VM is freed
> by the kernel.  E.g. otherwise multiple reboots would manifest as memory leakds
> and eventually OOM the host.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-20 16:14                 ` Vishal Annapurve
@ 2025-06-20 16:26                   ` Sean Christopherson
  2025-06-23 20:36                   ` Vishal Annapurve
  1 sibling, 0 replies; 43+ messages in thread
From: Sean Christopherson @ 2025-06-20 16:26 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Adrian Hunter, pbonzini, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Fri, Jun 20, 2025, Vishal Annapurve wrote:
> On Fri, Jun 20, 2025 at 7:24 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > The problem is that setting kvm->vm_dead will prevent (3) from succeeding.  If
> > > > kvm->vm_dead is set, KVM will reject all vCPU, VM, and device (not /dev/kvm the
> > > > device, but rather devices bound to the VM) ioctls.
> > >
> > > (3) is "Close the older guest memfd handles -> results in older VM instance cleanup."
> > >
> > > close() is not an IOCTL, so I do not understand.
> >
> > Sorry, I misread that as "Close the older guest memfd handles by deleting the
> > memslots".
> >
> > > > I intended that behavior, e.g. to guard against userspace blowing up KVM because
> > > > the hkid was released, I just didn't consider the memslots angle.
> > >
> > > The patch was tested with QEMU which AFAICT does not touch  memslots when
> > > shutting down.  Is there a reason to?
> >
> > In this case, the VMM process is not shutting down.  To emulate a reboot, the
> > VMM destroys the VM, but reuses the guest_memfd files for the "new" VM.  Because
> > guest_memfd takes a reference to "struct kvm", through memslot bindings, memslots
> 
> guest_memfd takes a reference on the "struct kvm" only on
> creation/linking, currently memslot binding doesn't add additional
> references.

Oh yeah, duh.

> Adrian's suggestion makes sense

+1.  It should also be faster overall (hopefully notably faster?).

> and it should be functional but I am running into some issues which likely
> need to be resolved on the userspace side. I will keep this thread updated.
> 
> Currently testing this reboot flow:
> 1) Issue KVM_TDX_TERMINATE_VM on the old VM.
> 2) Close the VM fd.
> 3) Create a new VM fd.
> 4) Link the old guest_memfd handles to the new VM fd.
> 5) Close the old guest_memfd handles.
> 6) Register memslots on the new VM using the linked guest_memfd handles.
> 
> That being said, I still see the value in what Sean suggested.
> " Remove vm_dead and instead reject ioctls based on vm_bugged, and simply rely
>     on KVM_REQ_VM_DEAD to prevent running the guest."
> 
> This will help with:
> 1) Keeping the cleanup sequence as close as possible to the normal VM
> cleanup sequence.
> 2) Actual VM destruction happens at step 5 from the above mentioned
> flow, if there is any cleanup that happens asynchronously, userspace
> can enforce synchronous cleanup by executing graceful VM shutdown
> stages before step 2 above.
> 
> And IIUC the goal here is to achieve exactly what Sean suggested above
> i.e. prevent running the guest after KVM_TDX_TERMINATE_VM is issued.

Ya, I still like the idea.  But given that it's not needed for KVM_TDX_TERMINATE_VM,
it can and should be posted/landed separately.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-20 14:24               ` Sean Christopherson
  2025-06-20 16:14                 ` Vishal Annapurve
@ 2025-06-20 18:59                 ` Edgecombe, Rick P
  2025-06-20 21:21                   ` Vishal Annapurve
  1 sibling, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-06-20 18:59 UTC (permalink / raw)
  To: Hunter, Adrian, seanjc@google.com
  Cc: Gao, Chao, Huang, Kai, binbin.wu@linux.intel.com,
	Annapurve, Vishal, Chatre, Reinette, Li, Xiaoyao,
	kirill.shutemov@linux.intel.com, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Zhao, Yan Y

On Fri, 2025-06-20 at 07:24 -0700, Sean Christopherson wrote:
> > The patch was tested with QEMU which AFAICT does not touch  memslots when
> > shutting down.  Is there a reason to?
> 
> In this case, the VMM process is not shutting down.  To emulate a reboot, the
> VMM destroys the VM, but reuses the guest_memfd files for the "new" VM. 
> Because guest_memfd takes a reference to "struct kvm", through memslot
> bindings, memslots need to be manually destroyed so that all references are
> put and the VM is freed by the kernel.

Sorry if I'm being dumb, but why does it do this? It saves freeing/allocating
the guestmemfd pages? Or the in-place data gets reused somehow?

The series Vishal linked has some kind of SEV state transfer thing. How is it
intended to work for TDX?

>   E.g. otherwise multiple reboots would manifest as memory leakds and
> eventually OOM the host.

This is in the case of future guestmemfd functionality? Or today?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-20 18:59                 ` Edgecombe, Rick P
@ 2025-06-20 21:21                   ` Vishal Annapurve
  2025-06-20 23:34                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 43+ messages in thread
From: Vishal Annapurve @ 2025-06-20 21:21 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Hunter, Adrian, seanjc@google.com, Gao, Chao, Huang, Kai,
	binbin.wu@linux.intel.com, Chatre, Reinette, Li, Xiaoyao,
	kirill.shutemov@linux.intel.com, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Zhao, Yan Y

On Fri, Jun 20, 2025 at 11:59 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2025-06-20 at 07:24 -0700, Sean Christopherson wrote:
> > > The patch was tested with QEMU which AFAICT does not touch  memslots when
> > > shutting down.  Is there a reason to?
> >
> > In this case, the VMM process is not shutting down.  To emulate a reboot, the
> > VMM destroys the VM, but reuses the guest_memfd files for the "new" VM.
> > Because guest_memfd takes a reference to "struct kvm", through memslot
> > bindings, memslots need to be manually destroyed so that all references are
> > put and the VM is freed by the kernel.
>
> Sorry if I'm being dumb, but why does it do this? It saves freeing/allocating
> the guestmemfd pages? Or the in-place data gets reused somehow?

The goal is just to be able to reuse the same physical memory for the
next boot of the guest. Freeing and faulting-in the same amount of
memory is redundant and time-consuming for large VM sizes.

>
> The series Vishal linked has some kind of SEV state transfer thing. How is it
> intended to work for TDX?

The series[1] unblocks intrahost-migration [2] and reboot usecases.

[1] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/#t
[2] https://lore.kernel.org/lkml/cover.1749672978.git.afranji@google.com/#t

>
> >   E.g. otherwise multiple reboots would manifest as memory leakds and
> > eventually OOM the host.
>
> This is in the case of future guestmemfd functionality? Or today?

Intrahost-migration and guest reboot are important usecases for Google
to support guest VM lifecycles.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-20 21:21                   ` Vishal Annapurve
@ 2025-06-20 23:34                     ` Edgecombe, Rick P
  2025-06-21  3:00                       ` Vishal Annapurve
  0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-06-20 23:34 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: Gao, Chao, linux-kernel@vger.kernel.org, seanjc@google.com,
	Huang, Kai, binbin.wu@linux.intel.com, Chatre, Reinette,
	Li, Xiaoyao, Hunter, Adrian, kirill.shutemov@linux.intel.com,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com, Zhao, Yan Y

On Fri, 2025-06-20 at 14:21 -0700, Vishal Annapurve wrote:
> > Sorry if I'm being dumb, but why does it do this? It saves
> > freeing/allocating
> > the guestmemfd pages? Or the in-place data gets reused somehow?
> 
> The goal is just to be able to reuse the same physical memory for the
> next boot of the guest. Freeing and faulting-in the same amount of
> memory is redundant and time-consuming for large VM sizes.

Can you provide enough information to evaluate how the whole problem is being
solved? (it sounds like you have the full solution implemented?)

The problem seems to be that rebuilding a whole TD for reboot is too slow. Does
the S-EPT survive if the VM is destroyed? If not, how does keeping the pages in
guestmemfd help with re-faulting? If the S-EPT is preserved, then what happens
when the new guest re-accepts it?

> 
> > 
> > The series Vishal linked has some kind of SEV state transfer thing. How is
> > it
> > intended to work for TDX?
> 
> The series[1] unblocks intrahost-migration [2] and reboot usecases.
> 
> [1] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/#t
> [2] https://lore.kernel.org/lkml/cover.1749672978.git.afranji@google.com/#t

The question was: how was this reboot optimization intended to work for TDX? Are
you saying that it works via intra-host migration? Like some state is migrated
to the new TD to start it up? 

> 
> > 
> > >    E.g. otherwise multiple reboots would manifest as memory leakds and
> > > eventually OOM the host.
> > 
> > This is in the case of future guestmemfd functionality? Or today?

This question was originally intended for Sean, but I gather from context that
the answer is in the future.

> 
> Intrahost-migration and guest reboot are important usecases for Google
> to support guest VM lifecycles.

I am not challenging the priority of the use case *at all*.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-20 23:34                     ` Edgecombe, Rick P
@ 2025-06-21  3:00                       ` Vishal Annapurve
  2025-06-23 16:23                         ` Edgecombe, Rick P
  0 siblings, 1 reply; 43+ messages in thread
From: Vishal Annapurve @ 2025-06-21  3:00 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, linux-kernel@vger.kernel.org, seanjc@google.com,
	Huang, Kai, binbin.wu@linux.intel.com, Chatre, Reinette,
	Li, Xiaoyao, Hunter, Adrian, kirill.shutemov@linux.intel.com,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com, Zhao, Yan Y

On Fri, Jun 20, 2025 at 4:34 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2025-06-20 at 14:21 -0700, Vishal Annapurve wrote:
> > > Sorry if I'm being dumb, but why does it do this? It saves
> > > freeing/allocating
> > > the guestmemfd pages? Or the in-place data gets reused somehow?
> >
> > The goal is just to be able to reuse the same physical memory for the
> > next boot of the guest. Freeing and faulting-in the same amount of
> > memory is redundant and time-consuming for large VM sizes.
>
> Can you provide enough information to evaluate how the whole problem is being
> solved? (it sounds like you have the full solution implemented?)
>
> The problem seems to be that rebuilding a whole TD for reboot is too slow. Does
> the S-EPT survive if the VM is destroyed? If not, how does keeping the pages in
> guestmemfd help with re-faulting? If the S-EPT is preserved, then what happens
> when the new guest re-accepts it?

SEPT entries don't survive reboots.

The faulting-in I was referring to is just allocation of memory pages
for guest_memfd offsets.

>
> >
> > >
> > > The series Vishal linked has some kind of SEV state transfer thing. How is
> > > it
> > > intended to work for TDX?
> >
> > The series[1] unblocks intrahost-migration [2] and reboot usecases.
> >
> > [1] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/#t
> > [2] https://lore.kernel.org/lkml/cover.1749672978.git.afranji@google.com/#t
>
> The question was: how was this reboot optimization intended to work for TDX? Are
> you saying that it works via intra-host migration? Like some state is migrated
> to the new TD to start it up?

Reboot optimization is not specific to TDX, it's basically just about
trying to reuse the same physical memory for the next boot. No state
is preserved here except the mapping of guest_memfd offsets to
physical memory pages.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-21  3:00                       ` Vishal Annapurve
@ 2025-06-23 16:23                         ` Edgecombe, Rick P
  2025-06-23 20:22                           ` Vishal Annapurve
  0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-06-23 16:23 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai,
	binbin.wu@linux.intel.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Hunter, Adrian, Li, Xiaoyao,
	tony.lindgren@linux.intel.com, kirill.shutemov@linux.intel.com,
	Yamahata, Isaku, kvm@vger.kernel.org, Zhao, Yan Y,
	pbonzini@redhat.com

On Fri, 2025-06-20 at 20:00 -0700, Vishal Annapurve wrote:
> Can you provide enough information to evaluate how the whole problem is being
> > solved? (it sounds like you have the full solution implemented?)
> > 
> > The problem seems to be that rebuilding a whole TD for reboot is too slow. Does
> > the S-EPT survive if the VM is destroyed? If not, how does keeping the pages in
> > guestmemfd help with re-faulting? If the S-EPT is preserved, then what happens
> > when the new guest re-accepts it?
> 
> SEPT entries don't survive reboots.
> 
> The faulting-in I was referring to is just allocation of memory pages
> for guest_memfd offsets.
> 
> > 
> > > 
> > > > 
> > > > The series Vishal linked has some kind of SEV state transfer thing. How is
> > > > it
> > > > intended to work for TDX?
> > > 
> > > The series[1] unblocks intrahost-migration [2] and reboot usecases.
> > > 
> > > [1] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/#t
> > > [2] https://lore.kernel.org/lkml/cover.1749672978.git.afranji@google.com/#t
> > 
> > The question was: how was this reboot optimization intended to work for TDX? Are
> > you saying that it works via intra-host migration? Like some state is migrated
> > to the new TD to start it up?
> 
> Reboot optimization is not specific to TDX, it's basically just about
> trying to reuse the same physical memory for the next boot. No state
> is preserved here except the mapping of guest_memfd offsets to
> physical memory pages.

Hmm, it doesn't sound like much work, especially at the 1GB level. I wonder if
it has something to do with the cost of zeroing the pages. If they went to a
global allocator and back, they would need to be zeroed to make sure data is not
leaked to another userspace process. But if it stays with the fd, this could be
skipped?

For TDX though, hmm, we may not actually need to zero the private pages because
of the transition to keyid 0. It would be beneficial to have the different VMs
types work the same. But, under this speculation of the real benefit, there may
be other ways to get the same benefits that are worth considering when we hit
frictions like this. To do that kind of consideration though, everyone needs to
understand what the real goal is.

In general I think we really need to fully evaluate these optimizations as part
of the upstreaming process. We have already seen two post-base series TDX
optimizations that didn't stand up under scrutiny. It turned out the existing
TDX page promotion implementation wasn't actually getting used much if at all.
Also, the parallel TD reclaim thing turned out to be misguided once we looked
into the root cause. So if we blindly incorporate optimizations based on vague
or promised justification, it seems likely we will end up maintaining some
amount of complex code with no purpose. Then it will be difficult to prove later
that it is not needed, and just remain a burden.

So can we please start explaining more of the "why" for this stuff so we can get
to the best upstream solution?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-23 16:23                         ` Edgecombe, Rick P
@ 2025-06-23 20:22                           ` Vishal Annapurve
  2025-06-23 22:51                             ` Edgecombe, Rick P
  0 siblings, 1 reply; 43+ messages in thread
From: Vishal Annapurve @ 2025-06-23 20:22 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai,
	binbin.wu@linux.intel.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Hunter, Adrian, Li, Xiaoyao,
	tony.lindgren@linux.intel.com, kirill.shutemov@linux.intel.com,
	Yamahata, Isaku, kvm@vger.kernel.org, Zhao, Yan Y,
	pbonzini@redhat.com

On Mon, Jun 23, 2025 at 9:23 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2025-06-20 at 20:00 -0700, Vishal Annapurve wrote:
> > Can you provide enough information to evaluate how the whole problem is being
> > > solved? (it sounds like you have the full solution implemented?)
> > >
> > > The problem seems to be that rebuilding a whole TD for reboot is too slow. Does
> > > the S-EPT survive if the VM is destroyed? If not, how does keeping the pages in
> > > guestmemfd help with re-faulting? If the S-EPT is preserved, then what happens
> > > when the new guest re-accepts it?
> >
> > SEPT entries don't survive reboots.
> >
> > The faulting-in I was referring to is just allocation of memory pages
> > for guest_memfd offsets.
> >
> > >
> > > >
> > > > >
> > > > > The series Vishal linked has some kind of SEV state transfer thing. How is
> > > > > it
> > > > > intended to work for TDX?
> > > >
> > > > The series[1] unblocks intrahost-migration [2] and reboot usecases.
> > > >
> > > > [1] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/#t
> > > > [2] https://lore.kernel.org/lkml/cover.1749672978.git.afranji@google.com/#t
> > >
> > > The question was: how was this reboot optimization intended to work for TDX? Are
> > > you saying that it works via intra-host migration? Like some state is migrated
> > > to the new TD to start it up?
> >
> > Reboot optimization is not specific to TDX, it's basically just about
> > trying to reuse the same physical memory for the next boot. No state
> > is preserved here except the mapping of guest_memfd offsets to
> > physical memory pages.
>
> Hmm, it doesn't sound like much work, especially at the 1GB level. I wonder if
> it has something to do with the cost of zeroing the pages. If they went to a
> global allocator and back, they would need to be zeroed to make sure data is not
> leaked to another userspace process. But if it stays with the fd, this could be
> skipped?

A simple question I ask to myself is that if a certain memory specific
optimization/feature is enabled for non-confidential VMs, why it can't
be enabled for Confidential VMs. I think as long as we cleanly
separate memory management from RMP/SEPT management for CVMs, there
should ideally be no major issues with enabling such optimizations for
Confidential VMs.

Just memory allocation without zeroing, even with hugepages takes time
for large VM shapes and I don't really see a valid reason for the
userspace VMM to repeat the freeing and allocation cycles.

> For TDX though, hmm, we may not actually need to zero the private pages because
> of the transition to keyid 0. It would be beneficial to have the different VMs
> types work the same. But, under this speculation of the real benefit, there may
> be other ways to get the same benefits that are worth considering when we hit
> frictions like this. To do that kind of consideration though, everyone needs to
> understand what the real goal is.
>
> In general I think we really need to fully evaluate these optimizations as part
> of the upstreaming process. We have already seen two post-base series TDX
> optimizations that didn't stand up under scrutiny. It turned out the existing
> TDX page promotion implementation wasn't actually getting used much if at all.
> Also, the parallel TD reclaim thing turned out to be misguided once we looked

For a ~700G guest memory, guest shutdown times:
1) Parallel TD reclaim + hugepage EPT mappings  : 30 secs
2) TD shutdown with KVM_TDX_TERMINATE_VM + hugepage EPT mappings: 2 mins
3) Without any optimization: ~ 30-40 mins

KVM_TDX_TERMINATE_VM for now is a very good start and is much simpler
to upstream.

> into the root cause. So if we blindly incorporate optimizations based on vague
> or promised justification, it seems likely we will end up maintaining some
> amount of complex code with no purpose. Then it will be difficult to prove later
> that it is not needed, and just remain a burden.
>
> So can we please start explaining more of the "why" for this stuff so we can get
> to the best upstream solution?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-20 16:14                 ` Vishal Annapurve
  2025-06-20 16:26                   ` Sean Christopherson
@ 2025-06-23 20:36                   ` Vishal Annapurve
  2025-06-23 21:39                     ` Sean Christopherson
  1 sibling, 1 reply; 43+ messages in thread
From: Vishal Annapurve @ 2025-06-23 20:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Adrian Hunter, pbonzini, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Fri, Jun 20, 2025 at 9:14 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Fri, Jun 20, 2025 at 7:24 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jun 19, 2025, Adrian Hunter wrote:
> > > On 19/06/2025 03:33, Sean Christopherson wrote:
> > > > On Wed, Jun 18, 2025, Adrian Hunter wrote:
> > > >> On 18/06/2025 09:00, Vishal Annapurve wrote:
> > > >>> On Tue, Jun 17, 2025 at 10:50 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> > > >>>>> Ability to clean up memslots from userspace without closing
> > > >>>>> VM/guest_memfd handles is useful to keep reusing the same guest_memfds
> > > >>>>> for the next boot iteration of the VM in case of reboot.
> > > >>>>
> > > >>>> TD lifecycle does not include reboot.  In other words, reboot is
> > > >>>> done by shutting down the TD and then starting again with a new TD.
> > > >>>>
> > > >>>> AFAIK it is not currently possible to shut down without closing
> > > >>>> guest_memfds since the guest_memfd holds a reference (users_count)
> > > >>>> to struct kvm, and destruction begins when users_count hits zero.
> > > >>>>
> > > >>>
> > > >>> gmem link support[1] allows associating existing guest_memfds with new
> > > >>> VM instances.
> > > >>>
> > > >>> Breakdown of the userspace VMM flow:
> > > >>> 1) Create a new VM instance before closing guest_memfd files.
> > > >>> 2) Link existing guest_memfd files with the new VM instance. -> This
> > > >>> creates new set of files backed by the same inode but associated with
> > > >>> the new VM instance.
> > > >>
> > > >> So what about:
> > > >>
> > > >> 2.5) Call KVM_TDX_TERMINATE_VM IOCTL
> > > >>
> > > >> Memory reclaimed after KVM_TDX_TERMINATE_VM will be done efficiently,
> > > >> so avoid causing it to be reclaimed earlier.
> > > >
> > > > The problem is that setting kvm->vm_dead will prevent (3) from succeeding.  If
> > > > kvm->vm_dead is set, KVM will reject all vCPU, VM, and device (not /dev/kvm the
> > > > device, but rather devices bound to the VM) ioctls.
> > >
> > > (3) is "Close the older guest memfd handles -> results in older VM instance cleanup."
> > >
> > > close() is not an IOCTL, so I do not understand.
> >
> > Sorry, I misread that as "Close the older guest memfd handles by deleting the
> > memslots".
> >
> > > > I intended that behavior, e.g. to guard against userspace blowing up KVM because
> > > > the hkid was released, I just didn't consider the memslots angle.
> > >
> > > The patch was tested with QEMU which AFAICT does not touch  memslots when
> > > shutting down.  Is there a reason to?
> >
> > In this case, the VMM process is not shutting down.  To emulate a reboot, the
> > VMM destroys the VM, but reuses the guest_memfd files for the "new" VM.  Because
> > guest_memfd takes a reference to "struct kvm", through memslot bindings, memslots
>
> guest_memfd takes a reference on the "struct kvm" only on
> creation/linking, currently memslot binding doesn't add additional
> references.
>
> Adrian's suggestion makes sense and it should be functional but I am
> running into some issues which likely need to be resolved on the
> userspace side. I will keep this thread updated.
>
> Currently testing this reboot flow:
> 1) Issue KVM_TDX_TERMINATE_VM on the old VM.
> 2) Close the VM fd.
> 3) Create a new VM fd.
> 4) Link the old guest_memfd handles to the new VM fd.
> 5) Close the old guest_memfd handles.
> 6) Register memslots on the new VM using the linked guest_memfd handles.
>

Apparently mmap takes a refcount on backing files.

So basically I had to modify the reboot flow as:
1) Issue KVM_TDX_TERMINATE_VM on the old VM.
2) Close the VM fd.
3) Create a new VM fd.
4) Link the old guest_memfd handles to the new VM fd.
5) Unmap the VMAs backed by the guest memfd
6) Close the old guest_memfd handles. -> Results in VM destruction
7) Setup new VMAs backed by linked guest_memfd handles.
8) Register memslots on the new VM using the linked guest_memfd handles.

I think the issue simply is that we have tied guest_memfd lifecycle
with VM lifecycle and that discussion is out of scope for this patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-11  9:51 ` [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Adrian Hunter
  2025-06-16  3:40   ` Vishal Annapurve
@ 2025-06-23 20:40   ` Vishal Annapurve
  1 sibling, 0 replies; 43+ messages in thread
From: Vishal Annapurve @ 2025-06-23 20:40 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Wed, Jun 11, 2025 at 2:52 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
> which enables more efficient reclaim of private memory.
>
> Private memory is removed from MMU/TDP when guest_memfds are closed. If
> the HKID has not been released, the TDX VM is still in RUNNABLE state,
> so pages must be removed using "Dynamic Page Removal" procedure (refer
> TDX Module Base spec) which involves a number of steps:
>         Block further address translation
>         Exit each VCPU
>         Clear Secure EPT entry
>         Flush/write-back/invalidate relevant caches
>
> However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state
> where all TDX VM pages are effectively unmapped, so pages can be reclaimed
> directly.
>
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time.  For example:
>
>         VCPUs   Size (GB)       Before (secs)   After (secs)
>          4       18               72             24
>         32      107              517            134
>         64      400             5539            467
>
> Link: https://lore.kernel.org/r/Z-V0qyTn2bXdrPF7@google.com
> Link: https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>
>
> Changes in V4:
>
>         Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
>         Use KVM_BUG_ON() instead of WARN_ON().
>         Correct kvm_trylock_all_vcpus() return value.
>
> Changes in V3:
>
>         Remove KVM_BUG_ON() from tdx_mmu_release_hkid() because it would
>         trigger on the error path from __tdx_td_init()
>
>         Put cpus_read_lock() handling back into tdx_mmu_release_hkid()
>
>         Handle KVM_TDX_TERMINATE_VM in the switch statement, i.e. let
>         tdx_vm_ioctl() deal with kvm->lock
>
>
>  Documentation/virt/kvm/x86/intel-tdx.rst | 16 +++++++++++
>  arch/x86/include/uapi/asm/kvm.h          |  1 +
>  arch/x86/kvm/vmx/tdx.c                   | 34 ++++++++++++++++++------
>  3 files changed, 43 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
> index de41d4c01e5c..e5d4d9cf4cf2 100644
> --- a/Documentation/virt/kvm/x86/intel-tdx.rst
> +++ b/Documentation/virt/kvm/x86/intel-tdx.rst
> @@ -38,6 +38,7 @@ ioctl with TDX specific sub-ioctl() commands.
>            KVM_TDX_INIT_MEM_REGION,
>            KVM_TDX_FINALIZE_VM,
>            KVM_TDX_GET_CPUID,
> +          KVM_TDX_TERMINATE_VM,
>
>            KVM_TDX_CMD_NR_MAX,
>    };
> @@ -214,6 +215,21 @@ struct kvm_cpuid2.
>           __u32 padding[3];
>    };
>
> +KVM_TDX_TERMINATE_VM
> +-------------------
> +:Type: vm ioctl
> +:Returns: 0 on success, <0 on error
> +
> +Release Host Key ID (HKID) to allow more efficient reclaim of private memory.
> +After this, the TD is no longer in a runnable state.
> +
> +Using KVM_TDX_TERMINATE_VM is optional.
> +
> +- id: KVM_TDX_TERMINATE_VM
> +- flags: must be 0
> +- data: must be 0
> +- hw_error: must be 0
> +
>  KVM TDX creation flow
>  =====================
>  In addition to the standard KVM flow, new TDX ioctls need to be called.  The
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 6f3499507c5e..697d396b2cc0 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -940,6 +940,7 @@ enum kvm_tdx_cmd_id {
>         KVM_TDX_INIT_MEM_REGION,
>         KVM_TDX_FINALIZE_VM,
>         KVM_TDX_GET_CPUID,
> +       KVM_TDX_TERMINATE_VM,
>
>         KVM_TDX_CMD_NR_MAX,
>  };
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..457f91b95147 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -515,6 +515,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>                 goto out;
>         }
>
> +       write_lock(&kvm->mmu_lock);
>         for_each_online_cpu(i) {
>                 if (packages_allocated &&
>                     cpumask_test_and_set_cpu(topology_physical_package_id(i),
> @@ -539,7 +540,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>         } else {
>                 tdx_hkid_free(kvm_tdx);
>         }
> -
> +       write_unlock(&kvm->mmu_lock);
>  out:
>         mutex_unlock(&tdx_lock);
>         cpus_read_unlock();
> @@ -1789,13 +1790,13 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>         struct page *page = pfn_to_page(pfn);
>         int ret;
>
> -       /*
> -        * HKID is released after all private pages have been removed, and set
> -        * before any might be populated. Warn if zapping is attempted when
> -        * there can't be anything populated in the private EPT.
> -        */
> -       if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> -               return -EINVAL;
> +       if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
> +               KVM_BUG_ON(!kvm->vm_dead, kvm);
> +               ret = tdx_reclaim_page(page);
> +               if (!ret)
> +                       tdx_unpin(kvm, page);
> +               return ret;
> +       }
>
>         ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>         if (ret <= 0)
> @@ -2790,6 +2791,20 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>         return 0;
>  }
>
> +static int tdx_terminate_vm(struct kvm *kvm)
> +{
> +       if (kvm_trylock_all_vcpus(kvm))
> +               return -EBUSY;
> +
> +       kvm_vm_dead(kvm);
> +
> +       kvm_unlock_all_vcpus(kvm);
> +
> +       tdx_mmu_release_hkid(kvm);
> +
> +       return 0;
> +}
> +
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  {
>         struct kvm_tdx_cmd tdx_cmd;
> @@ -2817,6 +2832,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>         case KVM_TDX_FINALIZE_VM:
>                 r = tdx_td_finalize(kvm, &tdx_cmd);
>                 break;
> +       case KVM_TDX_TERMINATE_VM:
> +               r = tdx_terminate_vm(kvm);
> +               break;
>         default:
>                 r = -EINVAL;
>                 goto out;
> --
> 2.48.1
>
>

Acked-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Vishal Annapurve <vannapurve@google.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-23 20:36                   ` Vishal Annapurve
@ 2025-06-23 21:39                     ` Sean Christopherson
  2025-06-23 23:35                       ` Vishal Annapurve
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-06-23 21:39 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Adrian Hunter, pbonzini, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Mon, Jun 23, 2025, Vishal Annapurve wrote:
> On Fri, Jun 20, 2025 at 9:14 AM Vishal Annapurve <vannapurve@google.com> wrote:
> > Adrian's suggestion makes sense and it should be functional but I am
> > running into some issues which likely need to be resolved on the
> > userspace side. I will keep this thread updated.
> >
> > Currently testing this reboot flow:
> > 1) Issue KVM_TDX_TERMINATE_VM on the old VM.
> > 2) Close the VM fd.
> > 3) Create a new VM fd.
> > 4) Link the old guest_memfd handles to the new VM fd.
> > 5) Close the old guest_memfd handles.
> > 6) Register memslots on the new VM using the linked guest_memfd handles.
> >
> 
> Apparently mmap takes a refcount on backing files.

Heh, yep.

> So basically I had to modify the reboot flow as:
> 1) Issue KVM_TDX_TERMINATE_VM on the old VM.
> 2) Close the VM fd.
> 3) Create a new VM fd.
> 4) Link the old guest_memfd handles to the new VM fd.
> 5) Unmap the VMAs backed by the guest memfd
> 6) Close the old guest_memfd handles. -> Results in VM destruction
> 7) Setup new VMAs backed by linked guest_memfd handles.
> 8) Register memslots on the new VM using the linked guest_memfd handles.
> 
> I think the issue simply is that we have tied guest_memfd lifecycle
> with VM lifecycle and that discussion is out of scope for this patch.

I wouldn't say it's entirely out of scope.  E.g. if there's a blocking problem
_in the kernel_ that prevents utilizing KVM_TDX_TERMINATE_VM, then we definitely
want to sort that out before adding support for KVM_TDX_TERMINATE_VM.

But IIUC, the hiccups you've encountered essentially fall into the category of
"working as intended", albeit with a lot of not-so-obvious behaviors and dependencies.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-23 20:22                           ` Vishal Annapurve
@ 2025-06-23 22:51                             ` Edgecombe, Rick P
  0 siblings, 0 replies; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-06-23 22:51 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai,
	binbin.wu@linux.intel.com, Chatre, Reinette, Li, Xiaoyao,
	Hunter, Adrian, tony.lindgren@linux.intel.com,
	linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com,
	Yamahata, Isaku, kvm@vger.kernel.org, Zhao, Yan Y,
	pbonzini@redhat.com

On Mon, 2025-06-23 at 13:22 -0700, Vishal Annapurve wrote:
> A simple question I ask to myself is that if a certain memory specific
> optimization/feature is enabled for non-confidential VMs, why it can't
> be enabled for Confidential VMs. I think as long as we cleanly
> separate memory management from RMP/SEPT management for CVMs, there
> should ideally be no major issues with enabling such optimizations for
> Confidential VMs.

Yes, having them work the same should probably help with maintainability. As
long as making them work the same doesn't cause too much complexity somewhere
else. i.e. kind of what we were discussing here.

> 
> Just memory allocation without zeroing, even with hugepages takes time
> for large VM shapes and I don't really see a valid reason for the
> userspace VMM to repeat the freeing and allocation cycles.

Hmm, this is surprising to me. Do you have any idea what kind of cycles we are
talking about?

> 
> > For TDX though, hmm, we may not actually need to zero the private pages because
> > of the transition to keyid 0. It would be beneficial to have the different VMs
> > types work the same. But, under this speculation of the real benefit, there may
> > be other ways to get the same benefits that are worth considering when we hit
> > frictions like this. To do that kind of consideration though, everyone needs to
> > understand what the real goal is.
> > 
> > In general I think we really need to fully evaluate these optimizations as part
> > of the upstreaming process. We have already seen two post-base series TDX
> > optimizations that didn't stand up under scrutiny. It turned out the existing
> > TDX page promotion implementation wasn't actually getting used much if at all.
> > Also, the parallel TD reclaim thing turned out to be misguided once we looked
> 
> For a ~700G guest memory, guest shutdown times:
> 1) Parallel TD reclaim + hugepage EPT mappings  : 30 secs
> 2) TD shutdown with KVM_TDX_TERMINATE_VM + hugepage EPT mappings: 2 mins
> 3) Without any optimization: ~ 30-40 mins
> 
> KVM_TDX_TERMINATE_VM for now is a very good start and is much simpler
> to upstream.

Parallel reclaim is misguided because it's attacking the wrong root cause. It's
not an example of a bad goal, but a pitfall of requiring a specific solution
instead of reviewing the reasoning as part of the upstreaming process. We
shouldn't do parallel reclaim ioctl because there are simpler, faster ways to
reclaim. It looks like we never circled back on this though. My bad for bring it
up as an example for explaining the details.

We have not posted the alternate approach because we have too many TDX series in
progress on the list and I think we should do them iteratively. Also, as you say
huge pages + KVM_TDX_TERMINATE_VM gets us an order of magnitude the way there.
It puts further improvements down the priority list.

> 
> > into the root cause. So if we blindly incorporate optimizations based on vague
> > or promised justification, it seems likely we will end up maintaining some
> > amount of complex code with no purpose. Then it will be difficult to prove later
> > that it is not needed, and just remain a burden.
> > 
> > So can we please start explaining more of the "why" for this stuff so we can get
> > to the best upstream solution?

So the answer is no? I think we should close it because it seems to be
generating a lot of mails with the same pattern.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-06-23 21:39                     ` Sean Christopherson
@ 2025-06-23 23:35                       ` Vishal Annapurve
  0 siblings, 0 replies; 43+ messages in thread
From: Vishal Annapurve @ 2025-06-23 23:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Adrian Hunter, pbonzini, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Mon, Jun 23, 2025 at 2:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jun 23, 2025, Vishal Annapurve wrote:
> > On Fri, Jun 20, 2025 at 9:14 AM Vishal Annapurve <vannapurve@google.com> wrote:
> > > Adrian's suggestion makes sense and it should be functional but I am
> > > running into some issues which likely need to be resolved on the
> > > userspace side. I will keep this thread updated.
> > >
> > > Currently testing this reboot flow:
> > > 1) Issue KVM_TDX_TERMINATE_VM on the old VM.
> > > 2) Close the VM fd.
> > > 3) Create a new VM fd.
> > > 4) Link the old guest_memfd handles to the new VM fd.
> > > 5) Close the old guest_memfd handles.
> > > 6) Register memslots on the new VM using the linked guest_memfd handles.
> > >
> >
> > Apparently mmap takes a refcount on backing files.
>
> Heh, yep.
>
> > So basically I had to modify the reboot flow as:
> > 1) Issue KVM_TDX_TERMINATE_VM on the old VM.
> > 2) Close the VM fd.
> > 3) Create a new VM fd.
> > 4) Link the old guest_memfd handles to the new VM fd.
> > 5) Unmap the VMAs backed by the guest memfd
> > 6) Close the old guest_memfd handles. -> Results in VM destruction
> > 7) Setup new VMAs backed by linked guest_memfd handles.
> > 8) Register memslots on the new VM using the linked guest_memfd handles.
> >
> > I think the issue simply is that we have tied guest_memfd lifecycle
> > with VM lifecycle and that discussion is out of scope for this patch.
>
> I wouldn't say it's entirely out of scope.  E.g. if there's a blocking problem
> _in the kernel_ that prevents utilizing KVM_TDX_TERMINATE_VM, then we definitely
> want to sort that out before adding support for KVM_TDX_TERMINATE_VM.
>
> But IIUC, the hiccups you've encountered essentially fall into the category of
> "working as intended", albeit with a lot of not-so-obvious behaviors and dependencies.

Yes, that's correct. The "issue" I referred to above is just extra
steps that need to be taken by userspace VMM and is still in the
"working as intended" category.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-06-11  9:51 [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time Adrian Hunter
  2025-06-11  9:51 ` [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Adrian Hunter
@ 2025-06-25 22:25 ` Sean Christopherson
  2025-06-26 15:58   ` Sean Christopherson
  1 sibling, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-06-25 22:25 UTC (permalink / raw)
  To: Sean Christopherson, pbonzini, Adrian Hunter
  Cc: kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Wed, 11 Jun 2025 12:51:57 +0300, Adrian Hunter wrote:
> Changes in V4:
> 
> 	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
> 	Use KVM_BUG_ON() instead of WARN_ON().
> 	Correct kvm_trylock_all_vcpus() return value.
> 
> Changes in V3:
> 	Refer:
>             https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
> 
> [...]

Applied to kvm-x86 vmx, thanks!

[1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
      https://github.com/kvm-x86/linux/commit/111a7311a016

--
https://github.com/kvm-x86/kvm-unit-tests/tree/next

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-06-25 22:25 ` [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time Sean Christopherson
@ 2025-06-26 15:58   ` Sean Christopherson
  2025-06-26 19:52     ` Adrian Hunter
  2025-07-11  8:55     ` Xiaoyao Li
  0 siblings, 2 replies; 43+ messages in thread
From: Sean Christopherson @ 2025-06-26 15:58 UTC (permalink / raw)
  To: pbonzini, Adrian Hunter
  Cc: kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Wed, Jun 25, 2025, Sean Christopherson wrote:
> On Wed, 11 Jun 2025 12:51:57 +0300, Adrian Hunter wrote:
> > Changes in V4:
> > 
> > 	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
> > 	Use KVM_BUG_ON() instead of WARN_ON().
> > 	Correct kvm_trylock_all_vcpus() return value.
> > 
> > Changes in V3:
> > 	Refer:
> >             https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
> > 
> > [...]
> 
> Applied to kvm-x86 vmx, thanks!
> 
> [1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
>       https://github.com/kvm-x86/linux/commit/111a7311a016

Fixed up to address a docs goof[*], new hash:

      https://github.com/kvm-x86/linux/commit/e4775f57ad51

[*] https://lore.kernel.org/all/20250626171004.7a1a024b@canb.auug.org.au

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-06-26 15:58   ` Sean Christopherson
@ 2025-06-26 19:52     ` Adrian Hunter
  2025-07-11  8:55     ` Xiaoyao Li
  1 sibling, 0 replies; 43+ messages in thread
From: Adrian Hunter @ 2025-06-26 19:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, pbonzini, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On 26/06/2025 18:58, Sean Christopherson wrote:
> On Wed, Jun 25, 2025, Sean Christopherson wrote:
>> On Wed, 11 Jun 2025 12:51:57 +0300, Adrian Hunter wrote:
>>> Changes in V4:
>>>
>>> 	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
>>> 	Use KVM_BUG_ON() instead of WARN_ON().
>>> 	Correct kvm_trylock_all_vcpus() return value.
>>>
>>> Changes in V3:
>>> 	Refer:
>>>             https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
>>>
>>> [...]
>>
>> Applied to kvm-x86 vmx, thanks!
>>
>> [1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
>>       https://github.com/kvm-x86/linux/commit/111a7311a016
> 
> Fixed up to address a docs goof[*], new hash:
> 
>       https://github.com/kvm-x86/linux/commit/e4775f57ad51
> 
> [*] https://lore.kernel.org/all/20250626171004.7a1a024b@canb.auug.org.au

Thank you!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-06-26 15:58   ` Sean Christopherson
  2025-06-26 19:52     ` Adrian Hunter
@ 2025-07-11  8:55     ` Xiaoyao Li
  2025-07-11 13:05       ` Sean Christopherson
  1 sibling, 1 reply; 43+ messages in thread
From: Xiaoyao Li @ 2025-07-11  8:55 UTC (permalink / raw)
  To: Sean Christopherson, pbonzini, Adrian Hunter
  Cc: kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, tony.lindgren, binbin.wu, isaku.yamahata,
	linux-kernel, yan.y.zhao, chao.gao

On 6/26/2025 11:58 PM, Sean Christopherson wrote:
> On Wed, Jun 25, 2025, Sean Christopherson wrote:
>> On Wed, 11 Jun 2025 12:51:57 +0300, Adrian Hunter wrote:
>>> Changes in V4:
>>>
>>> 	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
>>> 	Use KVM_BUG_ON() instead of WARN_ON().
>>> 	Correct kvm_trylock_all_vcpus() return value.
>>>
>>> Changes in V3:
>>> 	Refer:
>>>              https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
>>>
>>> [...]
>>
>> Applied to kvm-x86 vmx, thanks!
>>
>> [1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
>>        https://github.com/kvm-x86/linux/commit/111a7311a016
> 
> Fixed up to address a docs goof[*], new hash:
> 
>        https://github.com/kvm-x86/linux/commit/e4775f57ad51
> 
> [*] https://lore.kernel.org/all/20250626171004.7a1a024b@canb.auug.org.au

Hi Sean,

I think it's targeted for v6.17, right?

If so, do we need the enumeration for the new TDX ioctl? Yes, the 
userspace could always try and ignore the failure. But since the ship 
has not sailed, I would like to report it and hear your opinion.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11  8:55     ` Xiaoyao Li
@ 2025-07-11 13:05       ` Sean Christopherson
  2025-07-11 13:40         ` Xiaoyao Li
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-07-11 13:05 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: pbonzini, Adrian Hunter, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Fri, Jul 11, 2025, Xiaoyao Li wrote:
> On 6/26/2025 11:58 PM, Sean Christopherson wrote:
> > On Wed, Jun 25, 2025, Sean Christopherson wrote:
> > > On Wed, 11 Jun 2025 12:51:57 +0300, Adrian Hunter wrote:
> > > > Changes in V4:
> > > > 
> > > > 	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
> > > > 	Use KVM_BUG_ON() instead of WARN_ON().
> > > > 	Correct kvm_trylock_all_vcpus() return value.
> > > > 
> > > > Changes in V3:
> > > > 	Refer:
> > > >              https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
> > > > 
> > > > [...]
> > > 
> > > Applied to kvm-x86 vmx, thanks!
> > > 
> > > [1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
> > >        https://github.com/kvm-x86/linux/commit/111a7311a016
> > 
> > Fixed up to address a docs goof[*], new hash:
> > 
> >        https://github.com/kvm-x86/linux/commit/e4775f57ad51
> > 
> > [*] https://lore.kernel.org/all/20250626171004.7a1a024b@canb.auug.org.au
> 
> Hi Sean,
> 
> I think it's targeted for v6.17, right?
> 
> If so, do we need the enumeration for the new TDX ioctl? Yes, the userspace
> could always try and ignore the failure. But since the ship has not sailed,
> I would like to report it and hear your opinion.

Bugger, you're right.  It's sitting at the top of 'kvm-x86 vmx', so it should be
easy enough to tack on a capability.

This?

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0d961436d0f..dcb879897cab 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -9147,6 +9147,13 @@ KVM exits with the register state of either the L1 or L2 guest
 depending on which executed at the time of an exit. Userspace must
 take care to differentiate between these cases.
 
+8.46 KVM_CAP_TDX_TERMINATE_VM
+-----------------------------
+
+:Architectures: x86
+
+This capability indicates that KVM supports the KVM_TDX_TERMINATE_VM sub-ioctl.
+
 9. Known KVM API problems
 =========================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b58a74c1722d..e437a50429d3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4823,6 +4823,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
        case KVM_CAP_READONLY_MEM:
                r = kvm ? kvm_arch_has_readonly_mem(kvm) : 1;
                break;
+       case KVM_CAP_TDX_TERMINATE_VM:
+               r = !!(kvm_caps.supported_vm_types & BIT(KVM_X86_TDX_VM));
+               break;
        default:
                break;
        }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7a4c35ff03fe..54293df4a342 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -960,6 +960,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_ARM_EL2 240
 #define KVM_CAP_ARM_EL2_E2H0 241
 #define KVM_CAP_RISCV_MP_STATE_RESET 242
+#define KVM_CAP_TDX_TERMINATE_VM 243
 
 struct kvm_irq_routing_irqchip {
        __u32 irqchip;

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 13:05       ` Sean Christopherson
@ 2025-07-11 13:40         ` Xiaoyao Li
  2025-07-11 14:19           ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: Xiaoyao Li @ 2025-07-11 13:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, Adrian Hunter, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On 7/11/2025 9:05 PM, Sean Christopherson wrote:
> On Fri, Jul 11, 2025, Xiaoyao Li wrote:
>> On 6/26/2025 11:58 PM, Sean Christopherson wrote:
>>> On Wed, Jun 25, 2025, Sean Christopherson wrote:
>>>> On Wed, 11 Jun 2025 12:51:57 +0300, Adrian Hunter wrote:
>>>>> Changes in V4:
>>>>>
>>>>> 	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
>>>>> 	Use KVM_BUG_ON() instead of WARN_ON().
>>>>> 	Correct kvm_trylock_all_vcpus() return value.
>>>>>
>>>>> Changes in V3:
>>>>> 	Refer:
>>>>>               https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
>>>>>
>>>>> [...]
>>>>
>>>> Applied to kvm-x86 vmx, thanks!
>>>>
>>>> [1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
>>>>         https://github.com/kvm-x86/linux/commit/111a7311a016
>>>
>>> Fixed up to address a docs goof[*], new hash:
>>>
>>>         https://github.com/kvm-x86/linux/commit/e4775f57ad51
>>>
>>> [*] https://lore.kernel.org/all/20250626171004.7a1a024b@canb.auug.org.au
>>
>> Hi Sean,
>>
>> I think it's targeted for v6.17, right?
>>
>> If so, do we need the enumeration for the new TDX ioctl? Yes, the userspace
>> could always try and ignore the failure. But since the ship has not sailed,
>> I would like to report it and hear your opinion.
> 
> Bugger, you're right.  It's sitting at the top of 'kvm-x86 vmx', so it should be
> easy enough to tack on a capability.
> 
> This?

I'm wondering if we need a TDX centralized enumeration interface, e.g., 
new field in struct kvm_tdx_capabilities. I believe there will be more 
and more TDX new features, and assigning each a KVM_CAP seems wasteful.

> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f0d961436d0f..dcb879897cab 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -9147,6 +9147,13 @@ KVM exits with the register state of either the L1 or L2 guest
>   depending on which executed at the time of an exit. Userspace must
>   take care to differentiate between these cases.
>   
> +8.46 KVM_CAP_TDX_TERMINATE_VM
> +-----------------------------
> +
> +:Architectures: x86
> +
> +This capability indicates that KVM supports the KVM_TDX_TERMINATE_VM sub-ioctl.
> +
>   9. Known KVM API problems
>   =========================
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b58a74c1722d..e437a50429d3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4823,6 +4823,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>          case KVM_CAP_READONLY_MEM:
>                  r = kvm ? kvm_arch_has_readonly_mem(kvm) : 1;
>                  break;
> +       case KVM_CAP_TDX_TERMINATE_VM:
> +               r = !!(kvm_caps.supported_vm_types & BIT(KVM_X86_TDX_VM));
> +               break;
>          default:
>                  break;
>          }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 7a4c35ff03fe..54293df4a342 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -960,6 +960,7 @@ struct kvm_enable_cap {
>   #define KVM_CAP_ARM_EL2 240
>   #define KVM_CAP_ARM_EL2_E2H0 241
>   #define KVM_CAP_RISCV_MP_STATE_RESET 242
> +#define KVM_CAP_TDX_TERMINATE_VM 243
>   
>   struct kvm_irq_routing_irqchip {
>          __u32 irqchip;


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 13:40         ` Xiaoyao Li
@ 2025-07-11 14:19           ` Sean Christopherson
  2025-07-11 22:31             ` Edgecombe, Rick P
                               ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Sean Christopherson @ 2025-07-11 14:19 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: pbonzini, Adrian Hunter, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Fri, Jul 11, 2025, Xiaoyao Li wrote:
> On 7/11/2025 9:05 PM, Sean Christopherson wrote:
> > On Fri, Jul 11, 2025, Xiaoyao Li wrote:
> > > On 6/26/2025 11:58 PM, Sean Christopherson wrote:
> > > > On Wed, Jun 25, 2025, Sean Christopherson wrote:
> > > > > On Wed, 11 Jun 2025 12:51:57 +0300, Adrian Hunter wrote:
> > > > > > Changes in V4:
> > > > > > 
> > > > > > 	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
> > > > > > 	Use KVM_BUG_ON() instead of WARN_ON().
> > > > > > 	Correct kvm_trylock_all_vcpus() return value.
> > > > > > 
> > > > > > Changes in V3:
> > > > > > 	Refer:
> > > > > >               https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
> > > > > > 
> > > > > > [...]
> > > > > 
> > > > > Applied to kvm-x86 vmx, thanks!
> > > > > 
> > > > > [1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
> > > > >         https://github.com/kvm-x86/linux/commit/111a7311a016
> > > > 
> > > > Fixed up to address a docs goof[*], new hash:
> > > > 
> > > >         https://github.com/kvm-x86/linux/commit/e4775f57ad51
> > > > 
> > > > [*] https://lore.kernel.org/all/20250626171004.7a1a024b@canb.auug.org.au
> > > 
> > > Hi Sean,
> > > 
> > > I think it's targeted for v6.17, right?
> > > 
> > > If so, do we need the enumeration for the new TDX ioctl? Yes, the userspace
> > > could always try and ignore the failure. But since the ship has not sailed,
> > > I would like to report it and hear your opinion.
> > 
> > Bugger, you're right.  It's sitting at the top of 'kvm-x86 vmx', so it should be
> > easy enough to tack on a capability.
> > 
> > This?
> 
> I'm wondering if we need a TDX centralized enumeration interface, e.g., new
> field in struct kvm_tdx_capabilities. I believe there will be more and more
> TDX new features, and assigning each a KVM_CAP seems wasteful.

Oh, yeah, that's a much better idea.  In addition to not polluting KVM_CAP, 

LOL, and we certainly have the capacity in the structure:

	__u64 reserved[250];

Sans documentation, something like so?

--
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 13da87c05098..70ffe6e8d216 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -963,6 +963,8 @@ struct kvm_tdx_cmd {
        __u64 hw_error;
 };
 
+#define KVM_TDX_CAP_TERMINATE_VM       _BITULL(0)
+
 struct kvm_tdx_capabilities {
        __u64 supported_attrs;
        __u64 supported_xfam;
@@ -972,7 +974,9 @@ struct kvm_tdx_capabilities {
        __u64 kernel_tdvmcallinfo_1_r12;
        __u64 user_tdvmcallinfo_1_r12;
 
-       __u64 reserved[250];
+       __u64 supported_caps;
+
+       __u64 reserved[249];
 
        /* Configurable CPUID bits for userspace */
        struct kvm_cpuid2 cpuid;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f4d4fd5cc6e8..783b1046f6c1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -189,6 +189,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
        if (!caps->supported_xfam)
                return -EIO;
 
+       caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM;
+
        caps->cpuid.nent = td_conf->num_cpuid_config;
 
        caps->user_tdvmcallinfo_1_r11 =
--


Aha!  And if we squeeze in a patch for 6.16. to zero out the reserved array, we
can even avoid adding a capability to enumerate the TDX capability functionality.

--
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f4d4fd5cc6e8..9c2997665762 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -181,6 +181,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
 {
        int i;
 
+       memset(caps->reserved, 0, sizeof(caps->reserved));
+
        caps->supported_attrs = tdx_get_supported_attrs(td_conf);
        if (!caps->supported_attrs)
                return -EIO;
--

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 14:19           ` Sean Christopherson
@ 2025-07-11 22:31             ` Edgecombe, Rick P
  2025-07-11 22:54               ` Sean Christopherson
  2025-07-11 23:00             ` Edgecombe, Rick P
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-07-11 22:31 UTC (permalink / raw)
  To: Li, Xiaoyao, seanjc@google.com
  Cc: Gao, Chao, Huang, Kai, binbin.wu@linux.intel.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Hunter, Adrian,
	kvm@vger.kernel.org, pbonzini@redhat.com,
	tony.lindgren@linux.intel.com, Yamahata, Isaku, Zhao, Yan Y

On Fri, 2025-07-11 at 07:19 -0700, Sean Christopherson wrote:
> > > Bugger, you're right.  It's sitting at the top of 'kvm-x86 vmx', so it
> > > should be
> > > easy enough to tack on a capability.
> > > 
> > > This?
> > 
> > I'm wondering if we need a TDX centralized enumeration interface, e.g., new
> > field in struct kvm_tdx_capabilities. I believe there will be more and more
> > TDX new features, and assigning each a KVM_CAP seems wasteful.
> 
> Oh, yeah, that's a much better idea.  In addition to not polluting KVM_CAP,

How do you guys see it as wasteful? The highest cap is currently 242. For 32 bit
KVM that leaves 2147483405 caps. If we create an interface we grow some code and
docs, and get 64 additional ones for TDX only.

The less interfaces the better I say, so KVM_CAP_TDX_TERMINATE_VM seems better.
Xiaoyao, is this something QEMU needs? Or more of a completeness kind of thing?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 22:31             ` Edgecombe, Rick P
@ 2025-07-11 22:54               ` Sean Christopherson
  2025-07-11 23:04                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-07-11 22:54 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Xiaoyao Li, Chao Gao, Kai Huang, binbin.wu@linux.intel.com,
	Reinette Chatre, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Adrian Hunter,
	kvm@vger.kernel.org, pbonzini@redhat.com,
	tony.lindgren@linux.intel.com, Isaku Yamahata, Yan Y Zhao

On Fri, Jul 11, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-07-11 at 07:19 -0700, Sean Christopherson wrote:
> > > > Bugger, you're right.  It's sitting at the top of 'kvm-x86 vmx', so it
> > > > should be
> > > > easy enough to tack on a capability.
> > > > 
> > > > This?
> > > 
> > > I'm wondering if we need a TDX centralized enumeration interface, e.g., new
> > > field in struct kvm_tdx_capabilities. I believe there will be more and more
> > > TDX new features, and assigning each a KVM_CAP seems wasteful.
> > 
> > Oh, yeah, that's a much better idea.  In addition to not polluting KVM_CAP,
> 
> How do you guys see it as wasteful? The highest cap is currently 242. For 32 bit
> KVM that leaves 2147483405 caps. If we create an interface we grow some code and
> docs, and get 64 additional ones for TDX only.

It bleeds TDX details into arch neutral code.

> The less interfaces the better I say, so KVM_CAP_TDX_TERMINATE_VM seems better.

But we already have KVM_TDX_CAPABILITIES.  This isn't really a new interface, it's
a new field in a pre-existing interface.

> Xiaoyao, is this something QEMU needs? Or more of a completeness kind of thing?

Required by VMMs.  KVM always needs to be able enumerate its new features.  We
absolutely do not want userspace making guesses based on e.g. kernel version.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 14:19           ` Sean Christopherson
  2025-07-11 22:31             ` Edgecombe, Rick P
@ 2025-07-11 23:00             ` Edgecombe, Rick P
  2025-07-11 23:05               ` Sean Christopherson
  2025-07-16  9:22             ` Xiaoyao Li
  2025-07-17  9:14             ` Nikolay Borisov
  3 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-07-11 23:00 UTC (permalink / raw)
  To: Li, Xiaoyao, seanjc@google.com
  Cc: Gao, Chao, Huang, Kai, binbin.wu@linux.intel.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Hunter, Adrian,
	kvm@vger.kernel.org, pbonzini@redhat.com,
	tony.lindgren@linux.intel.com, Yamahata, Isaku, Zhao, Yan Y

On Fri, 2025-07-11 at 07:19 -0700, Sean Christopherson wrote:
> --
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f4d4fd5cc6e8..9c2997665762 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -181,6 +181,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
>  {
>         int i;
>  
> +       memset(caps->reserved, 0, sizeof(caps->reserved));
> +
>         caps->supported_attrs = tdx_get_supported_attrs(td_conf);
>         if (!caps->supported_attrs)
>                 return -EIO;
> --

I started to try to help by chipping in a log for this, but I couldn't justify
it very well. struct kvm_tdx_capabilities gets copied from userspace before
being populated So a userspace that knows to look for something in the reserved
area could know to zero it. If they left their own data in the reserved area,
and then relied on that data to remain the same, and then we started setting a
new field in it I guess it could disturb it. But that is strange, and I'm not
sure it really reduces much risk. Anyway here is the attempt to justify it.

KVM: TDX: Zero reserved reserved area in struct kvm_tdx_capabilities

Zero the reserved area in struct kvm_tdx_capabilities so that fields added in
the reserved area won't disturb any userspace that previously had garbage there.

struct kvm_tdx_capabilities holds information about the combined support of KVM
and the TDX module. For future growth, there is an area of the struct marked as
reserved. This way fields can be added into that space without increasing the
size of the struct.

However, currently the reserved area is not zeroed, meaning any data that
userspace left in the reserved area would be clobbered by a future field written
in the reserved area. So zero the reserved area to reduce the risk that
userspace might try to rely on some data there.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 22:54               ` Sean Christopherson
@ 2025-07-11 23:04                 ` Edgecombe, Rick P
  0 siblings, 0 replies; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-07-11 23:04 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Gao, Chao, Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Hunter, Adrian,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Chatre, Reinette, Yamahata, Isaku,
	Zhao, Yan Y, tony.lindgren@linux.intel.com

On Fri, 2025-07-11 at 15:54 -0700, Sean Christopherson wrote:
> > How do you guys see it as wasteful? The highest cap is currently 242. For 32
> > bit
> > KVM that leaves 2147483405 caps. If we create an interface we grow some code
> > and
> > docs, and get 64 additional ones for TDX only.
> 
> It bleeds TDX details into arch neutral code.

There are tons of arch specific caps. Can you help me understand this point a
little more? Is TDX special compared to the other arch specific ones?

> 
> > The less interfaces the better I say, so KVM_CAP_TDX_TERMINATE_VM seems
> > better.
> 
> But we already have KVM_TDX_CAPABILITIES.  This isn't really a new interface,
> it's
> a new field in a pre-existing interface.

I guess. It's new place to check for the same type of information that caps
currently provides. Not a big deal either way to me though.

> 
> > Xiaoyao, is this something QEMU needs? Or more of a completeness kind of
> > thing?
> 
> Required by VMMs.  KVM always needs to be able enumerate its new features.  We
> absolutely do not want userspace making guesses based on e.g. kernel version.

Ok.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 23:00             ` Edgecombe, Rick P
@ 2025-07-11 23:05               ` Sean Christopherson
  2025-07-11 23:17                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-07-11 23:05 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Xiaoyao Li, Chao Gao, Kai Huang, binbin.wu@linux.intel.com,
	Reinette Chatre, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Adrian Hunter,
	kvm@vger.kernel.org, pbonzini@redhat.com,
	tony.lindgren@linux.intel.com, Isaku Yamahata, Yan Y Zhao

On Fri, Jul 11, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-07-11 at 07:19 -0700, Sean Christopherson wrote:
> > --
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index f4d4fd5cc6e8..9c2997665762 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -181,6 +181,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
> >  {
> >         int i;
> >  
> > +       memset(caps->reserved, 0, sizeof(caps->reserved));
> > +
> >         caps->supported_attrs = tdx_get_supported_attrs(td_conf);
> >         if (!caps->supported_attrs)
> >                 return -EIO;
> > --
> 
> I started to try to help by chipping in a log for this, but I couldn't justify
> it very well. struct kvm_tdx_capabilities gets copied from userspace before
> being populated So a userspace that knows to look for something in the reserved
> area could know to zero it. If they left their own data in the reserved area,
> and then relied on that data to remain the same, and then we started setting a
> new field in it I guess it could disturb it. But that is strange, and I'm not
> sure it really reduces much risk. Anyway here is the attempt to justify it.
> 
> 
> KVM: TDX: Zero reserved reserved area in struct kvm_tdx_capabilities
> 
> Zero the reserved area in struct kvm_tdx_capabilities so that fields added in
> the reserved area won't disturb any userspace that previously had garbage there.

It's not only about disturbing userspace, it's also about actually being able to
repurpose the reserved fields in the future without needing *another* flag to tell
userspace that it's ok to read the previously-reserved fields.  I care about this
much more than I care about userspace using reserved fields as scratch space.

> struct kvm_tdx_capabilities holds information about the combined support of KVM
> and the TDX module. For future growth, there is an area of the struct marked as
> reserved. This way fields can be added into that space without increasing the
> size of the struct.
> 
> However, currently the reserved area is not zeroed, meaning any data that
> userspace left in the reserved area would be clobbered by a future field written
> in the reserved area. So zero the reserved area to reduce the risk that
> userspace might try to rely on some data there.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 23:05               ` Sean Christopherson
@ 2025-07-11 23:17                 ` Edgecombe, Rick P
  2025-07-14  3:20                   ` Xiaoyao Li
  0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-07-11 23:17 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Gao, Chao, Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Hunter, Adrian,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Chatre, Reinette, Yamahata, Isaku,
	Zhao, Yan Y, tony.lindgren@linux.intel.com

On Fri, 2025-07-11 at 16:05 -0700, Sean Christopherson wrote:
> > Zero the reserved area in struct kvm_tdx_capabilities so that fields added
> > in
> > the reserved area won't disturb any userspace that previously had garbage
> > there.
> 
> It's not only about disturbing userspace, it's also about actually being able
> to repurpose the reserved fields in the future without needing *another* flag
> to tell userspace that it's ok to read the previously-reserved fields.  I care
> about this much more than I care about userspace using reserved fields as
> scratch space.

If, before calling KVM_TDX_CAPABILITIES, userspace zeros the new field that it
knows about, but isn't sure if the kernel does, it's the same no?

Did you see that the way KVM_TDX_CAPABILITIES is implemented today is a little
weird? It actually copies the whole struct kvm_tdx_capabilities from userspace
and then sets some fields (not reserved) and then copies it back. So userspace
can zero any fields it wants to know about before calling KVM_TDX_CAPABILITIES.
Then it could know the same things as if the kernel zeroed it.

I was actually wondering if we want to change the kernel to zero reserved, if it
might make more sense to just copy caps->cpuid.nent field from userspace, and
then populate the whole thing starting from a zero'd buffer in the kernel.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 23:17                 ` Edgecombe, Rick P
@ 2025-07-14  3:20                   ` Xiaoyao Li
  2025-07-14 13:56                     ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: Xiaoyao Li @ 2025-07-14  3:20 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com
  Cc: Gao, Chao, Huang, Kai, binbin.wu@linux.intel.com,
	linux-kernel@vger.kernel.org, Hunter, Adrian,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Chatre, Reinette, Yamahata, Isaku,
	Zhao, Yan Y, tony.lindgren@linux.intel.com

On 7/12/2025 7:17 AM, Edgecombe, Rick P wrote:
> On Fri, 2025-07-11 at 16:05 -0700, Sean Christopherson wrote:
>>> Zero the reserved area in struct kvm_tdx_capabilities so that fields added
>>> in
>>> the reserved area won't disturb any userspace that previously had garbage
>>> there.
>>
>> It's not only about disturbing userspace, it's also about actually being able
>> to repurpose the reserved fields in the future without needing *another* flag
>> to tell userspace that it's ok to read the previously-reserved fields.  I care
>> about this much more than I care about userspace using reserved fields as
>> scratch space.
> 
> If, before calling KVM_TDX_CAPABILITIES, userspace zeros the new field that it
> knows about, but isn't sure if the kernel does, it's the same no?
> 
> Did you see that the way KVM_TDX_CAPABILITIES is implemented today is a little
> weird? It actually copies the whole struct kvm_tdx_capabilities from userspace
> and then sets some fields (not reserved) and then copies it back. So userspace
> can zero any fields it wants to know about before calling KVM_TDX_CAPABILITIES.
> Then it could know the same things as if the kernel zeroed it.
> 
> I was actually wondering if we want to change the kernel to zero reserved, if it
> might make more sense to just copy caps->cpuid.nent field from userspace, and
> then populate the whole thing starting from a zero'd buffer in the kernel.

+1 to zero the whole buffer of *caps in the kernel.

current code seems to have issue on the 
caps->kernel_tdvmcallinfo_1_r11/kernel_tdvmcallinfo_1_r12/user_tdvmcallinfo_1_r12, 
as KVM cannot guarantee zero'ed value are returned to userspace.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-14  3:20                   ` Xiaoyao Li
@ 2025-07-14 13:56                     ` Sean Christopherson
  2025-07-14 15:06                       ` Xiaoyao Li
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-07-14 13:56 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Rick P Edgecombe, Chao Gao, Kai Huang, binbin.wu@linux.intel.com,
	linux-kernel@vger.kernel.org, Adrian Hunter,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Reinette Chatre, Isaku Yamahata, Yan Y Zhao,
	tony.lindgren@linux.intel.com

On Mon, Jul 14, 2025, Xiaoyao Li wrote:
> On 7/12/2025 7:17 AM, Edgecombe, Rick P wrote:
> > On Fri, 2025-07-11 at 16:05 -0700, Sean Christopherson wrote:
> > > > Zero the reserved area in struct kvm_tdx_capabilities so that fields added
> > > > in
> > > > the reserved area won't disturb any userspace that previously had garbage
> > > > there.
> > > 
> > > It's not only about disturbing userspace, it's also about actually being able
> > > to repurpose the reserved fields in the future without needing *another* flag
> > > to tell userspace that it's ok to read the previously-reserved fields.  I care
> > > about this much more than I care about userspace using reserved fields as
> > > scratch space.
> > 
> > If, before calling KVM_TDX_CAPABILITIES, userspace zeros the new field that it
> > knows about, but isn't sure if the kernel does, it's the same no?

Heh, yeah, this crossed my mind about 5 minutes after I logged off :-)

> > Did you see that the way KVM_TDX_CAPABILITIES is implemented today is a little
> > weird? It actually copies the whole struct kvm_tdx_capabilities from userspace
> > and then sets some fields (not reserved) and then copies it back. So userspace
> > can zero any fields it wants to know about before calling KVM_TDX_CAPABILITIES.
> > Then it could know the same things as if the kernel zeroed it.
> > 
> > I was actually wondering if we want to change the kernel to zero reserved, if it
> > might make more sense to just copy caps->cpuid.nent field from userspace, and
> > then populate the whole thing starting from a zero'd buffer in the kernel.
> 
> +1 to zero the whole buffer of *caps in the kernel.

Ya, I almost suggested that, but assumed there was a reason for copying the entire
structure.

> current code seems to have issue on the caps->kernel_tdvmcallinfo_1_r11/kernel_tdvmcallinfo_1_r12/user_tdvmcallinfo_1_r12,
> as KVM cannot guarantee zero'ed value are returned to userspace.

This?  (untested)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f4d4fd5cc6e8..42cb328d8a7d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2270,25 +2270,26 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
        const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
        struct kvm_tdx_capabilities __user *user_caps;
        struct kvm_tdx_capabilities *caps = NULL;
+       u32 nr_user_entries;
        int ret = 0;
 
        /* flags is reserved for future use */
        if (cmd->flags)
                return -EINVAL;
 
-       caps = kmalloc(sizeof(*caps) +
+       caps = kzalloc(sizeof(*caps) +
                       sizeof(struct kvm_cpuid_entry2) * td_conf->num_cpuid_config,
                       GFP_KERNEL);
        if (!caps)
                return -ENOMEM;
 
        user_caps = u64_to_user_ptr(cmd->data);
-       if (copy_from_user(caps, user_caps, sizeof(*caps))) {
+       if (get_user(nr_user_entries, &user_caps->cpuid.nent)) {
                ret = -EFAULT;
                goto out;
        }
 
-       if (caps->cpuid.nent < td_conf->num_cpuid_config) {
+       if (nr_user_entries < td_conf->num_cpuid_config) {
                ret = -E2BIG;
                goto out;
        }

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-14 13:56                     ` Sean Christopherson
@ 2025-07-14 15:06                       ` Xiaoyao Li
  0 siblings, 0 replies; 43+ messages in thread
From: Xiaoyao Li @ 2025-07-14 15:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, Chao Gao, Kai Huang, binbin.wu@linux.intel.com,
	linux-kernel@vger.kernel.org, Adrian Hunter,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Reinette Chatre, Isaku Yamahata, Yan Y Zhao,
	tony.lindgren@linux.intel.com

On 7/14/2025 9:56 PM, Sean Christopherson wrote:
> On Mon, Jul 14, 2025, Xiaoyao Li wrote:
>> On 7/12/2025 7:17 AM, Edgecombe, Rick P wrote:
>>> On Fri, 2025-07-11 at 16:05 -0700, Sean Christopherson wrote:
>>>>> Zero the reserved area in struct kvm_tdx_capabilities so that fields added
>>>>> in
>>>>> the reserved area won't disturb any userspace that previously had garbage
>>>>> there.
>>>>
>>>> It's not only about disturbing userspace, it's also about actually being able
>>>> to repurpose the reserved fields in the future without needing *another* flag
>>>> to tell userspace that it's ok to read the previously-reserved fields.  I care
>>>> about this much more than I care about userspace using reserved fields as
>>>> scratch space.
>>>
>>> If, before calling KVM_TDX_CAPABILITIES, userspace zeros the new field that it
>>> knows about, but isn't sure if the kernel does, it's the same no?
> 
> Heh, yeah, this crossed my mind about 5 minutes after I logged off :-)
> 
>>> Did you see that the way KVM_TDX_CAPABILITIES is implemented today is a little
>>> weird? It actually copies the whole struct kvm_tdx_capabilities from userspace
>>> and then sets some fields (not reserved) and then copies it back. So userspace
>>> can zero any fields it wants to know about before calling KVM_TDX_CAPABILITIES.
>>> Then it could know the same things as if the kernel zeroed it.
>>>
>>> I was actually wondering if we want to change the kernel to zero reserved, if it
>>> might make more sense to just copy caps->cpuid.nent field from userspace, and
>>> then populate the whole thing starting from a zero'd buffer in the kernel.
>>
>> +1 to zero the whole buffer of *caps in the kernel.
> 
> Ya, I almost suggested that, but assumed there was a reason for copying the entire
> structure.
> 
>> current code seems to have issue on the caps->kernel_tdvmcallinfo_1_r11/kernel_tdvmcallinfo_1_r12/user_tdvmcallinfo_1_r12,
>> as KVM cannot guarantee zero'ed value are returned to userspace.
> 
> This?  (untested)

Tested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f4d4fd5cc6e8..42cb328d8a7d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2270,25 +2270,26 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
>          const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
>          struct kvm_tdx_capabilities __user *user_caps;
>          struct kvm_tdx_capabilities *caps = NULL;
> +       u32 nr_user_entries;
>          int ret = 0;
>   
>          /* flags is reserved for future use */
>          if (cmd->flags)
>                  return -EINVAL;
>   
> -       caps = kmalloc(sizeof(*caps) +
> +       caps = kzalloc(sizeof(*caps) +
>                         sizeof(struct kvm_cpuid_entry2) * td_conf->num_cpuid_config,
>                         GFP_KERNEL);
>          if (!caps)
>                  return -ENOMEM;
>   
>          user_caps = u64_to_user_ptr(cmd->data);
> -       if (copy_from_user(caps, user_caps, sizeof(*caps))) {
> +       if (get_user(nr_user_entries, &user_caps->cpuid.nent)) {
>                  ret = -EFAULT;
>                  goto out;
>          }
>   
> -       if (caps->cpuid.nent < td_conf->num_cpuid_config) {
> +       if (nr_user_entries < td_conf->num_cpuid_config) {
>                  ret = -E2BIG;
>                  goto out;
>          }


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 14:19           ` Sean Christopherson
  2025-07-11 22:31             ` Edgecombe, Rick P
  2025-07-11 23:00             ` Edgecombe, Rick P
@ 2025-07-16  9:22             ` Xiaoyao Li
  2025-07-18 15:35               ` Sean Christopherson
  2025-07-17  9:14             ` Nikolay Borisov
  3 siblings, 1 reply; 43+ messages in thread
From: Xiaoyao Li @ 2025-07-16  9:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, Adrian Hunter, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On 7/11/2025 10:19 PM, Sean Christopherson wrote:
> On Fri, Jul 11, 2025, Xiaoyao Li wrote:
...
>>
>> I'm wondering if we need a TDX centralized enumeration interface, e.g., new
>> field in struct kvm_tdx_capabilities. I believe there will be more and more
>> TDX new features, and assigning each a KVM_CAP seems wasteful.
> 
> Oh, yeah, that's a much better idea.  In addition to not polluting KVM_CAP,
> 
> LOL, and we certainly have the capacity in the structure:
> 
> 	__u64 reserved[250];
> 
> Sans documentation, something like so?

I suppose it will be squashed into the original patch, so just gave

Tested-by: Xiaoyao Li <xiaoyao.li@intel.com>

> --
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 13da87c05098..70ffe6e8d216 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -963,6 +963,8 @@ struct kvm_tdx_cmd {
>          __u64 hw_error;
>   };
>   
> +#define KVM_TDX_CAP_TERMINATE_VM       _BITULL(0)
> +
>   struct kvm_tdx_capabilities {
>          __u64 supported_attrs;
>          __u64 supported_xfam;
> @@ -972,7 +974,9 @@ struct kvm_tdx_capabilities {
>          __u64 kernel_tdvmcallinfo_1_r12;
>          __u64 user_tdvmcallinfo_1_r12;
>   
> -       __u64 reserved[250];
> +       __u64 supported_caps;
> +
> +       __u64 reserved[249];
>   
>          /* Configurable CPUID bits for userspace */
>          struct kvm_cpuid2 cpuid;
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f4d4fd5cc6e8..783b1046f6c1 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -189,6 +189,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
>          if (!caps->supported_xfam)
>                  return -EIO;
>   
> +       caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM;
> +
>          caps->cpuid.nent = td_conf->num_cpuid_config;
>   
>          caps->user_tdvmcallinfo_1_r11 =

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-11 14:19           ` Sean Christopherson
                               ` (2 preceding siblings ...)
  2025-07-16  9:22             ` Xiaoyao Li
@ 2025-07-17  9:14             ` Nikolay Borisov
  2025-07-18 14:36               ` Sean Christopherson
  3 siblings, 1 reply; 43+ messages in thread
From: Nikolay Borisov @ 2025-07-17  9:14 UTC (permalink / raw)
  To: Sean Christopherson, Xiaoyao Li
  Cc: pbonzini, Adrian Hunter, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao



On 11.07.25 г. 17:19 ч., Sean Christopherson wrote:
> On Fri, Jul 11, 2025, Xiaoyao Li wrote:
>> On 7/11/2025 9:05 PM, Sean Christopherson wrote:
>>> On Fri, Jul 11, 2025, Xiaoyao Li wrote:
>>>> On 6/26/2025 11:58 PM, Sean Christopherson wrote:
>>>>> On Wed, Jun 25, 2025, Sean Christopherson wrote:
>>>>>> On Wed, 11 Jun 2025 12:51:57 +0300, Adrian Hunter wrote:
>>>>>>> Changes in V4:
>>>>>>>
>>>>>>> 	Drop TDX_FLUSHVP_NOT_DONE change.  It will be done separately.
>>>>>>> 	Use KVM_BUG_ON() instead of WARN_ON().
>>>>>>> 	Correct kvm_trylock_all_vcpus() return value.
>>>>>>>
>>>>>>> Changes in V3:
>>>>>>> 	Refer:
>>>>>>>                https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
>>>>>>>
>>>>>>> [...]
>>>>>>
>>>>>> Applied to kvm-x86 vmx, thanks!
>>>>>>
>>>>>> [1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
>>>>>>          https://github.com/kvm-x86/linux/commit/111a7311a016
>>>>>
>>>>> Fixed up to address a docs goof[*], new hash:
>>>>>
>>>>>          https://github.com/kvm-x86/linux/commit/e4775f57ad51
>>>>>
>>>>> [*] https://lore.kernel.org/all/20250626171004.7a1a024b@canb.auug.org.au
>>>>
>>>> Hi Sean,
>>>>
>>>> I think it's targeted for v6.17, right?
>>>>
>>>> If so, do we need the enumeration for the new TDX ioctl? Yes, the userspace
>>>> could always try and ignore the failure. But since the ship has not sailed,
>>>> I would like to report it and hear your opinion.
>>>
>>> Bugger, you're right.  It's sitting at the top of 'kvm-x86 vmx', so it should be
>>> easy enough to tack on a capability.
>>>
>>> This?
>>
>> I'm wondering if we need a TDX centralized enumeration interface, e.g., new
>> field in struct kvm_tdx_capabilities. I believe there will be more and more
>> TDX new features, and assigning each a KVM_CAP seems wasteful.
> 
> Oh, yeah, that's a much better idea.  In addition to not polluting KVM_CAP,
> 
> LOL, and we certainly have the capacity in the structure:
> 
> 	__u64 reserved[250];
> 
> Sans documentation, something like so?
> 
> --
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 13da87c05098..70ffe6e8d216 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -963,6 +963,8 @@ struct kvm_tdx_cmd {
>          __u64 hw_error;
>   };
>   
> +#define KVM_TDX_CAP_TERMINATE_VM       _BITULL(0) > +
>   struct kvm_tdx_capabilities {
>          __u64 supported_attrs;
>          __u64 supported_xfam;
> @@ -972,7 +974,9 @@ struct kvm_tdx_capabilities {
>          __u64 kernel_tdvmcallinfo_1_r12;
>          __u64 user_tdvmcallinfo_1_r12;
>   
> -       __u64 reserved[250];
> +       __u64 supported_caps;
> +
> +       __u64 reserved[249];
>   
>          /* Configurable CPUID bits for userspace */
>          struct kvm_cpuid2 cpuid;
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f4d4fd5cc6e8..783b1046f6c1 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -189,6 +189,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
>          if (!caps->supported_xfam)
>                  return -EIO;
>   
> +       caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM;

nit: For the sake of consistency make that a |= so that all subsequent 
additions to it will be uniform with the first.
> +
>          caps->cpuid.nent = td_conf->num_cpuid_config;
>   
>          caps->user_tdvmcallinfo_1_r11 =
> --
> 
> 
> Aha!  And if we squeeze in a patch for 6.16. to zero out the reserved array, we
> can even avoid adding a capability to enumerate the TDX capability functionality.
> 
> --
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f4d4fd5cc6e8..9c2997665762 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -181,6 +181,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
>   {
>          int i;
>   
> +       memset(caps->reserved, 0, sizeof(caps->reserved));
> +
>          caps->supported_attrs = tdx_get_supported_attrs(td_conf);
>          if (!caps->supported_attrs)
>                  return -EIO;
> --
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-17  9:14             ` Nikolay Borisov
@ 2025-07-18 14:36               ` Sean Christopherson
  0 siblings, 0 replies; 43+ messages in thread
From: Sean Christopherson @ 2025-07-18 14:36 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Xiaoyao Li, pbonzini, Adrian Hunter, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Thu, Jul 17, 2025, Nikolay Borisov wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index f4d4fd5cc6e8..783b1046f6c1 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -189,6 +189,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
> >          if (!caps->supported_xfam)
> >                  return -EIO;
> > +       caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM;
> 
> nit: For the sake of consistency make that a |= so that all subsequent
> additions to it will be uniform with the first.

Objection, speculation, your honor.  :-D

That assumes that the predominate pattern will be "|=".  But if we end up with a
collection of capabilities that are unconditionally enumerated by KVM, then I
definitely want to express that as:

	caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM |
			       KVM_TDX_CAP_FANCY_THING_1 |
			       KVM_TDX_CAP_FANCY_THING_2 |
			       KVM_TDX_CAP_FANCY_THING_3;

not as

	caps->supported_caps |= KVM_TDX_CAP_TERMINATE_VM;
	caps->supported_caps |= KVM_TDX_CAP_FANCY_THING1;
	caps->supported_caps |= KVM_TDX_CAP_FANCY_THING2;
	caps->supported_caps |= KVM_TDX_CAP_FANCY_THING3;

I find the former to be much easier to read, and it provides some amount of
defense-in-depth against uninitialized data.  The downside is that if there are
conditional capabilities, then we need to ensure that they are added after the
unconditional set is initialized.  But we absolutely should be able to detect
such bugs via selftests.  And again, I find that this:

	caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM |
			       KVM_TDX_CAP_FANCY_THING_1 |
			       KVM_TDX_CAP_FANCY_THING_2 |
			       KVM_TDX_CAP_FANCY_THING_3;

	if (i_can_has_cheezburger())
		caps->supported_caps |= KVM_TDX_CAP_CHEEZBURGER;

makes it easy to identify which capabilities are unconditional versus those that
are dependent on something else.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time
  2025-07-16  9:22             ` Xiaoyao Li
@ 2025-07-18 15:35               ` Sean Christopherson
  0 siblings, 0 replies; 43+ messages in thread
From: Sean Christopherson @ 2025-07-18 15:35 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: pbonzini, Adrian Hunter, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, tony.lindgren, binbin.wu,
	isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao

On Wed, Jul 16, 2025, Xiaoyao Li wrote:
> On 7/11/2025 10:19 PM, Sean Christopherson wrote:
> > On Fri, Jul 11, 2025, Xiaoyao Li wrote:
> ...
> > > 
> > > I'm wondering if we need a TDX centralized enumeration interface, e.g., new
> > > field in struct kvm_tdx_capabilities. I believe there will be more and more
> > > TDX new features, and assigning each a KVM_CAP seems wasteful.
> > 
> > Oh, yeah, that's a much better idea.  In addition to not polluting KVM_CAP,
> > 
> > LOL, and we certainly have the capacity in the structure:
> > 
> > 	__u64 reserved[250];
> > 
> > Sans documentation, something like so?
> 
> I suppose it will be squashed into the original patch,

I dropped the commit from kvm-x86/vmx, and will send a full v5.  There's enough
moving parts that I don't want to risk going 'round in circles trying to squash
fixes :-)

> so just gave
> 
> Tested-by: Xiaoyao Li <xiaoyao.li@intel.com>

Thanks! 

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2025-07-18 15:35 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11  9:51 [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time Adrian Hunter
2025-06-11  9:51 ` [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Adrian Hunter
2025-06-16  3:40   ` Vishal Annapurve
2025-06-18  5:50     ` Adrian Hunter
2025-06-18  6:00       ` Vishal Annapurve
2025-06-18  8:33         ` Adrian Hunter
2025-06-19  0:33           ` Sean Christopherson
2025-06-19 11:12             ` Adrian Hunter
2025-06-20 14:24               ` Sean Christopherson
2025-06-20 16:14                 ` Vishal Annapurve
2025-06-20 16:26                   ` Sean Christopherson
2025-06-23 20:36                   ` Vishal Annapurve
2025-06-23 21:39                     ` Sean Christopherson
2025-06-23 23:35                       ` Vishal Annapurve
2025-06-20 18:59                 ` Edgecombe, Rick P
2025-06-20 21:21                   ` Vishal Annapurve
2025-06-20 23:34                     ` Edgecombe, Rick P
2025-06-21  3:00                       ` Vishal Annapurve
2025-06-23 16:23                         ` Edgecombe, Rick P
2025-06-23 20:22                           ` Vishal Annapurve
2025-06-23 22:51                             ` Edgecombe, Rick P
2025-06-18 22:07         ` Edgecombe, Rick P
2025-06-23 20:40   ` Vishal Annapurve
2025-06-25 22:25 ` [PATCH V4 0/1] KVM: TDX: Decrease TDX VM shutdown time Sean Christopherson
2025-06-26 15:58   ` Sean Christopherson
2025-06-26 19:52     ` Adrian Hunter
2025-07-11  8:55     ` Xiaoyao Li
2025-07-11 13:05       ` Sean Christopherson
2025-07-11 13:40         ` Xiaoyao Li
2025-07-11 14:19           ` Sean Christopherson
2025-07-11 22:31             ` Edgecombe, Rick P
2025-07-11 22:54               ` Sean Christopherson
2025-07-11 23:04                 ` Edgecombe, Rick P
2025-07-11 23:00             ` Edgecombe, Rick P
2025-07-11 23:05               ` Sean Christopherson
2025-07-11 23:17                 ` Edgecombe, Rick P
2025-07-14  3:20                   ` Xiaoyao Li
2025-07-14 13:56                     ` Sean Christopherson
2025-07-14 15:06                       ` Xiaoyao Li
2025-07-16  9:22             ` Xiaoyao Li
2025-07-18 15:35               ` Sean Christopherson
2025-07-17  9:14             ` Nikolay Borisov
2025-07-18 14:36               ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).