Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Binbin Wu @ 2026-06-08  8:03 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-13-6983ae4c3a4d@google.com>

On 5/22/2026 7:16 AM, Lisa Wang wrote:
> Set the initial state of the first memory region as shared if it is
> backed by guest_memfd, so that the KVM selftest framework functions can
> populate mmap()-ed guest_memfd memory the same way memory from other
> memory providers are populated.
> 
> For CoCo VMs, pages that need to be private are explicitly set to
> private before executing the VM.
> 
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
>  tools/testing/selftests/kvm/lib/kvm_util.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 9a29540fff40..1bab7d76a59c 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -484,8 +484,10 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
>  	u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
>  						 nr_extra_pages);
>  	struct userspace_mem_region *slot0;
> +	u64 gmem_flags = 0;
>  	struct kvm_vm *vm;
> -	int i, flags;
> +	int flags = 0;
> +	int i;
>  
>  	kvm_set_files_rlimit(nr_runnable_vcpus);
>  
> @@ -495,14 +497,16 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
>  	vm = ____vm_create(shape);
>  
>  	/*
> -	 * Force GUEST_MEMFD for the primary memory region if necessary, e.g.
> -	 * for CoCo VMs that require GUEST_MEMFD backed private memory.
> +	 * Force GUEST_MEMFD for the primary memory region if necessary, and
> +	 * initialize it as shared so the selftest framework can populate it
> +	 * exactly like other memory providers.
>  	 */
> -	flags = 0;
> -	if (is_guest_memfd_required(shape))
> +	if (is_guest_memfd_required(shape)) {
>  		flags |= KVM_MEM_GUEST_MEMFD;
> +		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> +	}
>  
> -	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
> +	vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);

The build failed due to this:

lib/kvm_util.c: In function ‘__vm_create’:
lib/kvm_util.c:507:9: error: too many arguments to function ‘vm_mem_add’
  507 |         vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
      |         ^~~~~~~~~~
In file included from lib/kvm_util.c:9:
include/kvm_util.h:714:6: note: declared here
  714 | void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
      |      ^~~~~~~~~~
lib/kvm_util.c: At top level:
cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics

It seems the patch set doesn't wire gmem_flags parameter to vm_mem_add().

>  	for (i = 0; i < NR_MEM_REGIONS; i++)
>  		vm->memslots[i] = 0;
>  
> 


^ permalink raw reply

* Re: [PATCH v13 12/22] KVM: selftests: Back the first memory region with guest_memfd for TDX
From: Binbin Wu @ 2026-06-08  7:31 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-12-6983ae4c3a4d@google.com>

On 5/22/2026 7:16 AM, Lisa Wang wrote:
> Force GUEST_MEMFD for the primary memory region of TDX VMs.
> 
> TDX must use guest_memfd for private pages as there is no alternative
> mechanism supported by the TDX architecture.

This sounds a bit confusing to me.
It's the kernel/KVM only support guest_memfd for private pages.
The restriction is not from the "TDX architecture".

Also, the short log is also a bit confusing.
Why describe the "first memory region" here?

> 
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
>  tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index d1befa3f4b30..9a29540fff40 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -472,7 +472,7 @@ void kvm_set_files_rlimit(u32 nr_vcpus)
>  static bool is_guest_memfd_required(struct vm_shape shape)
>  {
>  #ifdef __x86_64__
> -	return shape.type == KVM_X86_SNP_VM;
> +	return (shape.type == KVM_X86_SNP_VM || shape.type == KVM_X86_TDX_VM);
>  #else
>  	return false;
>  #endif
> 


^ permalink raw reply

* Re: [PATCH v13 11/22] KVM: selftests: Set up TDX boot parameters region
From: Binbin Wu @ 2026-06-08  7:23 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-11-6983ae4c3a4d@google.com>

On 5/22/2026 7:16 AM, Lisa Wang wrote:
[...]> +void tdx_vm_load_common_boot_parameters(struct kvm_vm *vm)
> +{
> +	struct td_boot_parameters *params =
> +		addr_gpa2hva(vm, TD_BOOT_PARAMETERS_GPA);
> +	u32 cr4;
> +
> +	TEST_ASSERT_EQ(vm->mode, VM_MODE_PXXVYY_4K);
> +
> +	cr4 = kvm_get_default_cr4();
> +	if (vm->mmu.pgtable_levels == 5)
> +		cr4 |= X86_CR4_LA57;

This should be removed after the 5-level paging code is added back in
kvm_get_default_cr4().



^ permalink raw reply

* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Xu Yilun @ 2026-06-08  6:54 UTC (permalink / raw)
  To: Kishen Maloor
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <5c06bdbc-c847-4dfa-9c89-afbb52946b8f@intel.com>

On Sat, Jun 06, 2026 at 09:36:41PM -0700, Kishen Maloor wrote:
> On 5/21/26 8:41 PM, Xu Yilun wrote:
> > ...
> > This series has 2 distinct parts:
> > 
> >    Patches  1-4:  TDX Module Extensions enabling
> >    Patches  5-15: DICE-based TDX Quoting, primarily Peter's work.
> > 
> Perhaps the extensions enabling patches could be organized more simply as
> these three?
> 
> 1. Add TDX extensions metadata structure and accessor
> 2. Add TDH.EXT.MEM.ADD
> 3. Add TDH.EXT.INIT and wire extensions init into init_tdx_module()
> 
> This introduces the SEAMCALLs and lets the wiring land with the patch
> that completes the init flow, avoiding a separate "enable" patch.

Yes, several comments point to a same concern for patch organization - no
need a separate "enable" patch. Also a more sound justfication to me is,
the Extension will not actually been enabled until an add-on feature is
explicitly configured (See patch #15). So we could add steps in nature
order without worrying the incomplete flow breaks the kernel.

My reordering is:

 1. Add a placeholder for Extension initialization to hook into
    init_tdx_module(). Give a chance to explain the considerations of
    the enable-at-boot-up policy.

 2. Detect if Extension is required based on the metadata, if no, skip.
    So no side effect for following steps.

 3. Add TDH.EXT.MEM.ADD

 4. Add TDH.EXT.INIT
> 

^ permalink raw reply

* Re: [PATCH v13 09/22] KVM: selftests: Expose functions to get default sregs values
From: Binbin Wu @ 2026-06-08  6:39 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-9-6983ae4c3a4d@google.com>

On 5/22/2026 7:16 AM, Lisa Wang wrote:

[...]

> +
> +static inline u64 kvm_get_default_cr4(void)
> +{
> +	u64 cr4 = X86_CR4_PAE | X86_CR4_OSFXSR;
> +
> +	if (kvm_cpu_has(X86_FEATURE_XSAVE))
> +		cr4 |= X86_CR4_OSXSAVE;
> +	return cr4;
> +}
> +

[...]

> @@ -647,16 +643,12 @@ static void vcpu_init_sregs(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
>  	vcpu_sregs_get(vcpu, &sregs);
>  
>  	sregs.idt.base = vm->arch.idt;
> -	sregs.idt.limit = NUM_INTERRUPTS * sizeof(struct idt_entry) - 1;
> +	sregs.idt.limit = kvm_get_default_idt_limit();
>  	sregs.gdt.base = vm->arch.gdt;
> -	sregs.gdt.limit = getpagesize() - 1;
> -
> -	sregs.cr0 = X86_CR0_PE | X86_CR0_NE | X86_CR0_PG;
> -	sregs.cr4 |= X86_CR4_PAE | X86_CR4_OSFXSR;
> -	if (kvm_cpu_has(X86_FEATURE_XSAVE))
> -		sregs.cr4 |= X86_CR4_OSXSAVE;
> -	if (vm->mmu.pgtable_levels == 5)
> -		sregs.cr4 |= X86_CR4_LA57;

I guess the 5-level paging thing is dropped unexpectedly during rebase?


> +	sregs.gdt.limit = kvm_get_default_gdt_limit();
> 
> +	sregs.cr0 = kvm_get_default_cr0();
> +	sregs.cr4 |= kvm_get_default_cr4();
>  	sregs.efer |= (EFER_LME | EFER_LMA | EFER_NX);
>  
>  	kvm_seg_set_unusable(&sregs.ldt);
> 


^ permalink raw reply

* Re: [PATCH v13 06/22] tools: include: Add kbuild.h for assembly structure offsets
From: Binbin Wu @ 2026-06-08  6:12 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-6-6983ae4c3a4d@google.com>



On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Sagi Shahar <sagis@google.com>
> 
> Add the Kbuild macros needed to enable the filechk_offsets mechanism to
> generate C header files containing structure member offset information.
> 
> Tools depending on assembly code that operate on structures have to
> hardcode the offsets of structure members. The Kbuild infrastructure
> can instead generate C header files with these offsets automatically,
> allowing them to be included in assembly code as symbolic constants.
> 
> For example, the TDX guest boot code requires access to parameters
> passed in the C structure(struct td_boot_parameters). This header
                           ^
Nit: missing a space. 

> provides the macros needed to extract these offsets from C code and
> expose them to assembly, ensuring the two remain synchronized.
> 
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>  tools/include/linux/kbuild.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/tools/include/linux/kbuild.h b/tools/include/linux/kbuild.h
> new file mode 100644
> index 000000000000..957fd55cd159
> --- /dev/null
> +++ b/tools/include/linux/kbuild.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __TOOLS_LINUX_KBUILD_H
> +#define __TOOLS_LINUX_KBUILD_H
> +
> +#define DEFINE(sym, val) \
> +	asm volatile("\n.ascii \"->" #sym " %0 " #val "\"" : : "i" (val))
> +
> +#define OFFSET(sym, str, mem) \
> +	DEFINE(sym, __builtin_offsetof(struct str, mem))
> +
> +#endif /* __TOOLS_LINUX_KBUILD_H */
> 


^ permalink raw reply

* Re: [PATCH v13 03/22] KVM: selftests: Initialize the TDX VM
From: Binbin Wu @ 2026-06-08  5:57 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-3-6983ae4c3a4d@google.com>

On 5/22/2026 7:16 AM, Lisa Wang wrote:
[...]

> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> index f647e6ca6b34..48d4bd36c35b 100644
> --- a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> @@ -11,4 +11,34 @@ static inline bool is_tdx_vm(struct kvm_vm *vm)
>  	return vm->type == KVM_X86_TDX_VM;
>  }
>  
> +/*
> + * TDX ioctls
> + * Use underscores to avoid collisions with struct member names.
> + */
> +#define __tdx_vm_ioctl(vm, cmd, _flags, arg)				\
> +({									\
> +	int r;								\
> +									\
> +	union {								\
> +		struct kvm_tdx_cmd c;					\
> +		unsigned long raw;					\
> +	} tdx_cmd = { .c = {						\
> +		.id = (cmd),						\
> +		.flags = (u32)(_flags),				\
> +		.data = (u64)(arg),				\

Nit:
The two lines' backslashes are misaligned.

> +	} };								\
> +									\
> +	r = __vm_ioctl(vm, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd.raw);	\
> +	r ?: tdx_cmd.c.hw_error;					\
> +})
> +
> +#define tdx_vm_ioctl(vm, cmd, flags, arg)				\
> +({									\
> +	int ret = __tdx_vm_ioctl(vm, cmd, flags, arg);			\

tdx_cmd.c.hw_error is u64 and it could be assigned to ret, which is a int,
the upper bits could be truncated if the upper 32-bit is set.

> +									\
> +	__TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd,	ret, vm);		\
> +})
> +
> +void tdx_init_vm(struct kvm_vm *vm, u64 attributes);
> +
>  #endif /* SELFTESTS_TDX_TDX_UTIL_H */[...]

^ permalink raw reply

* Re: [PATCH v6 00/11] Dynamic PAMT
From: Tony Lindgren @ 2026-06-08  5:45 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang
In-Reply-To: <20260526023515.288829-1-rick.p.edgecombe@intel.com>

On Mon, May 25, 2026 at 07:35:04PM -0700, Rick Edgecombe wrote:
> For a simple small server with mostly physical contiguous RAM and no CXL
> complications, the basic implementation should be close to optimal anyway.
> And for big servers, an 8GB allocation is going to have less impact. In
> the end Dynamic PAMT *is* an optimization that we will force on as a
> good default option. Even with all the optimizations we could throw at it,
> if the system is 100% TDs, Dynamic PAMT could come out slightly behind. So
> judgment on good defaults is needed regardless.

From usage point of view it's not just a memory optimization though.
These patches make it easier to see what gets allocated for TDX IMO.

This based on rebasing other patches on the dynamic PAMT series a few
times over the past year. So for the series:

Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>

^ permalink raw reply

* Re: [PATCH v6 03/11] x86/virt/tdx: Add tdx_alloc/free_control_page() helpers
From: Yan Zhao @ 2026-06-08  2:18 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Rick Edgecombe, bp, dave.hansen, hpa, kas, kvm, linux-coco,
	linux-doc, linux-kernel, mingo, nik.borisov, pbonzini, seanjc,
	tglx, vannapurve, x86, chao.gao, kai.huang, Kirill A. Shutemov
In-Reply-To: <50566572-6379-4100-8845-404f695e59cd@linux.intel.com>

On Mon, Jun 08, 2026 at 10:11:58AM +0800, Binbin Wu wrote:
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index 82dc27aecf297..74e75db5728c7 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -37,6 +37,7 @@
> >  
> >  #include <uapi/asm/mce.h>
> >  #include <asm/tdx_global_metadata.h>
> > +#include <linux/mm.h>
> 
> I think the header is not needed here.
Right. This version does not invoke page_address() in tdx.h for
tdx_alloc_control_page() any more.

Also no need to include mm.h for tdx.c (which has invoked page_address() before
this patch), since tdx.c includes memblock.h which further includes mm.h.

^ permalink raw reply

* Re: [PATCH v6 03/11] x86/virt/tdx: Add tdx_alloc/free_control_page() helpers
From: Binbin Wu @ 2026-06-08  2:11 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang,
	Kirill A. Shutemov
In-Reply-To: <20260526023515.288829-4-rick.p.edgecombe@intel.com>

On 5/26/2026 10:35 AM, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Add helpers to use when allocating or preparing pages that are handed to
> the TDX-Module for use as control/S-EPT pages, and thus need Dynamic PAMT
> adjustments.
> 
> The TDX module tracks some state for each page of physical memory that it
> might use. It calls this state the PAMT. It includes separate state for
> each page size a physical page could be utilized at within the TDX module
> (1GB, 2MB, 4KB). In Dynamic PAMT, only the 4KB page size state is
> allocated dynamically. So for pages that TDX will use as 2MB physically
> contiguous pages, Dynamic PAMT backing is not needed.
> 
> KVM will need to hand pages to the TDX module that it will use at 4KB
> granularity. So these pages will need Dynamic PAMT backing added before
> they are used by the TDX module, and removed afterwards.
> 
> Add tdx_alloc_control_page() and tdx_free_control_page() to handle both
> page allocation and Dynamic PAMT installation. Make them behave like
> normal alloc/free functions where allocation can fail in the case of no
> memory, but free (with any necessary Dynamic PAMT release) always
> succeeds. Do this so they can support the existing TDX flows that require
> teardowns to succeed.
> 
> Also create tdx_pamt_get/put() to handle installing Dynamic PAMT 4KB
> backing for pages that are already allocated (such as KVM's use of S-EPT
> page tables or guest private memory). Have them take a pfn instead of a
> struct page, as future changes will want to use these helpers for guest
> pages which are tracked by PFN.
> 
> Don't CLFLUSH the Dynamic PAMT pages handed to the TDX module, as is done
> for some other SEAMCALLs, as the TDX docs specify that this is only
> needed on "TD private memory or TD control structure page".
> 
> Since these allocations will be easily user triggerable, account the
> memory.
> 
> Leave logic to handle concurrency issues for future changes.
> 
> Assisted-by: GitHub Copilot:claude-opus-4-6 Claude:claude-opus-4-7 Sashiko:claude-opus-4-6
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

One comment below.


> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 82dc27aecf297..74e75db5728c7 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -37,6 +37,7 @@
>  
>  #include <uapi/asm/mce.h>
>  #include <asm/tdx_global_metadata.h>
> +#include <linux/mm.h>

I think the header is not needed here.

>  #include <linux/pgtable.h>
>  
>  /*
> @@ -160,6 +161,12 @@ void tdx_guest_keyid_free(unsigned int keyid);
>  
>  void tdx_quirk_reset_paddr(unsigned long base, unsigned long size);
>  
> +/* Number PAMT pages to be provided to TDX module per 2MB region of PA */
> +#define TDX_DPAMT_ENTRY_PAGE_CNT 2
> +
> +struct page *tdx_alloc_control_page(void);
> +void tdx_free_control_page(struct page *page);
> +
>  struct tdx_td {
>  	/* TD root structure: */
>  	struct page *tdr_page;
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9ebd192cb5c17..9e0812d87ab06 100644[...]

^ permalink raw reply

* Re: [RFC PATCH 15/15] x86/virt/tdx: Enable TDX Quoting extension
From: Kishen Maloor @ 2026-06-07  4:41 UTC (permalink / raw)
  To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
  Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-16-yilun.xu@linux.intel.com>

On 5/21/26 8:41 PM, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> Enable the TDX Quoting feature via TDH.SYS.CONFIG when supported by the
> TDX module.
> 
> The TDX Quoting extension generates TDX attestation Quotes via a
> SEAMCALL, without using a discrete Quoting engine.
> 
> TDX Module supports add-on TDX features (e.g. TDX Quoting & TDX Module
> Extensions) that should be manually enabled by host. It extends
> TDH.SYS.CONFIG for host to choose to enable them on bootup.
> 
> Call TDH.SYS.CONFIG with a new bitmap input parameter to specify which
> features to enable. The bitmap uses the same definitions as
> TDX_FEATURES0. But note not all bits in TDX_FEATURES0 are valid for
> configuration, e.g. TDX Module Extensions is a service that supports TDX
> Quoting, it is implicitly enabled when TDX Quoting is enabled. Setting
> TDX_FEATURES0_EXT in the bitmap has no effect.
> 
> TDX Module advances the version of TDH.SYS.CONFIG for the change, so
> use the latest version (v1) for add-on feature enabling. But supporting
> existing Modules which only support v0 is still necessary until they are
> deprecated. In fact, it is unlikely that TDH.SYS.CONFIG ever needs to
> change again and the code would stay in v1. So there is little value
> in worrying about deprecating v0 to save a couple lines of code in 5-7
> years when these original TDX platforms sunset.
> 
> TDX Module updates global metadata when add-on features are enabled.
> Host should update the cached tdx_sysinfo to reflect these changes.
> 
> Co-developed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> ---
>   arch/x86/virt/vmx/tdx/tdx.h |  4 +++-
>   arch/x86/virt/vmx/tdx/tdx.c | 24 ++++++++++++++++++++++--
>   2 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 10aff23cd01f..524a14c01aa6 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -58,7 +58,8 @@
>   #define TDH_PHYMEM_CACHE_WB		40
>   #define TDH_PHYMEM_PAGE_WBINVD		41
>   #define TDH_VP_WR			43
> -#define TDH_SYS_CONFIG			45
> +#define TDH_SYS_CONFIG_V0		45

Is it necessary to add _Vx macros when multiple versions can co-exist?

Just wondering if it would be cleaner in the following way?

- Leave the macros set at the current (non-deprecated) baseline version.
- Select vX using SEAMCALL_LEAF_VER() in config_tdx_module() when a vX feature
   is enabled.

u64 seamcall_fn = TDH_SYS_CONFIG;
...
if (tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_QUOTE) {
...
     seamcall_fn = SEAMCALL_LEAF_VER(TDH_SYS_CONFIG, 1);

> +#define TDH_SYS_CONFIG			SEAMCALL_LEAF_VER(TDH_SYS_CONFIG_V0, 1)
>   #define TDH_EXT_INIT			60
>   #define TDH_EXT_MEM_ADD			61
>   #define TDH_SYS_DISABLE			69
> @@ -97,6 +98,7 @@ struct tdmr_info {
>   /* Bit definitions of TDX_FEATURES0 metadata field */
>   #define TDX_FEATURES0_NO_RBP_MOD	BIT(18)
>   #define TDX_FEATURES0_EXT		BIT_ULL(39)
> +#define TDX_FEATURES0_QUOTE		BIT_ULL(50)
>   
>   /*
>    * Do not put any hardware-defined TDX structure representations below
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index f7600f930c6e..86e5b7ad19b3 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1049,6 +1049,7 @@ static __init int construct_tdmrs(struct list_head *tmb_list,
>   static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
>   				    u64 global_keyid)
>   {
> +	u64 seamcall_fn = TDH_SYS_CONFIG_V0;
>   	struct tdx_module_args args = {};
>   	u64 *tdmr_pa_array;
>   	size_t array_sz;
> @@ -1074,8 +1075,22 @@ static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
>   	args.rcx = __pa(tdmr_pa_array);
>   	args.rdx = tdmr_list->nr_consumed_tdmrs;
>   	args.r8 = global_keyid;
> -	ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
>   
> +	if (tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_QUOTE) {
> +		args.r9 |= TDX_FEATURES0_QUOTE;
> +		/* These parameters require version >= 1 */
> +		seamcall_fn = TDH_SYS_CONFIG;
> +	}
> +
> +	ret = seamcall_prerr(seamcall_fn, &args);
> +	if (ret)
> +		goto free_tdmr;
> +
> +	/* enabling TDX Quoting may change tdx_sysinfo, update it */
> +	if (tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_QUOTE)
> +		ret = get_tdx_sys_info(&tdx_sysinfo);
> +
> +free_tdmr:
>   	/* Free the array as it is not required anymore. */
>   	kfree(tdmr_pa_array);
>   
> @@ -1384,12 +1399,17 @@ static void tdx_quote_init(void)
>   	unsigned int nr_quote_pages;
>   	u64 r;
>   
> +	if (!(tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_QUOTE))
> +		return;
> +
>   	do {
>   		r = seamcall(TDH_QUOTE_INIT, &args);
>   	} while (r == TDX_INTERRUPTED_RESUMABLE);
>   
> -	if (r)
> +	if (r) {
> +		pr_err("Failed to enable quoting extension: 0x%llx\n", r);
>   		return;
> +	}
>   
>   	/* Quoting metadata is valid only after initialization */
>   	if (get_tdx_sys_info_quote(&tdx_sysinfo.quote))


^ permalink raw reply

* Re: [PATCH 04/15] x86/virt/tdx: Enable the Extensions right after basic TDX Module init
From: Kishen Maloor @ 2026-06-07  4:38 UTC (permalink / raw)
  To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
  Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-5-yilun.xu@linux.intel.com>

On 5/21/26 8:41 PM, Xu Yilun wrote:
> The detailed initialization flow for TDX Module Extensions has been
> fully implemented. Enable the flow after basic TDX Module
> initialization.
> 
> Theoretically, the Extensions doesn't need to be enabled right after
> basic TDX initialization. It could be enabled right before the first
> Extension SEAMCALL is issued. That would save or postpone memory usage.
> But it isn't worth the complexity, the needs for the Extensions are vast
> but the savings are little for a typical TDX capable system (about
> 0.001% of memory). So the Linux decision is to just enable it along with
> the basic TDX.
> 
> Note that the Extensions initialization flow will still not start if no
> add-on features require Extensions. The enabling of add-on features will
> be in later patches. Until then, the system hasn't consumed extra memory.
> 
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
>   arch/x86/virt/vmx/tdx/tdx.c | 16 ++++++++++------
>   1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ff2b96c20d2b..dad5ec642723 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1180,7 +1180,7 @@ static __init int init_tdmrs(struct tdmr_info_list *tdmr_list)
>   	return 0;
>   }
>   
> -static void tdx_clflush_hpa_list(struct page *root, unsigned int nr_pages)
> +static __init void tdx_clflush_hpa_list(struct page *root, unsigned int nr_pages)
>   {
>   	u64 *entries = page_to_virt(root);
>   	int i;
> @@ -1193,7 +1193,7 @@ static void tdx_clflush_hpa_list(struct page *root, unsigned int nr_pages)
>   #define HPA_LIST_INFO_PFN		GENMASK_U64(51, 12)
>   #define HPA_LIST_INFO_LAST_ENTRY	GENMASK_U64(63, 55)
>   
> -static u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
> +static __init u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
>   {
>   	return FIELD_PREP(HPA_LIST_INFO_FIRST_ENTRY, 0) |
>   	       FIELD_PREP(HPA_LIST_INFO_PFN, page_to_pfn(root)) |
> @@ -1201,7 +1201,7 @@ static u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
>   }
>   
>   /* Initialize the TDX Module Extensions then Extension-SEAMCALLs can be used */
> -static int tdx_ext_init(void)
> +static __init int tdx_ext_init(void)
>   {
>   	struct tdx_module_args args = {};
>   	u64 r;
> @@ -1216,7 +1216,7 @@ static int tdx_ext_init(void)
>   	return 0;
>   }
>   
> -static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
> +static __init int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
>   {
>   	struct tdx_module_args args = {
>   		.rcx = to_hpa_list_info(root, nr_pages),
> @@ -1240,7 +1240,7 @@ static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
>   	return 0;
>   }
>   
> -static int tdx_ext_mem_setup(void)
> +static __init int tdx_ext_mem_setup(void)
>   {
>   	unsigned int nr_pages;
>   	struct page *page;
> @@ -1301,7 +1301,7 @@ static int tdx_ext_mem_setup(void)
>   	return ret;
>   }
>   
> -static int __maybe_unused init_tdx_ext(void)
> +static __init int init_tdx_ext(void)
>   {
>   	int ret;
>   
> @@ -1373,6 +1373,10 @@ static __init int init_tdx_module(void)
>   	if (ret)
>   		goto err_reset_pamts;
>   
> +	ret = init_tdx_ext();
> +	if (ret)
> +		goto err_reset_pamts;

Is it a reasonable policy to fail TDX bringup entirely upon failing
initialization of extensions (which are "add-on features")?

The handling of tdx_quote_init() in Patch 6 suggests a more
best-effort approach.

> +
>   	pr_info("%lu KB allocated for PAMT\n", tdmrs_count_pamt_kb(&tdx_tdmr_list));
>   
>   out_put_tdxmem:


^ permalink raw reply

* Re: [PATCH 02/15] x86/virt/tdx: Add extra memory to TDX Module for Extensions
From: Kishen Maloor @ 2026-06-07  4:38 UTC (permalink / raw)
  To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
  Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-3-yilun.xu@linux.intel.com>

On 5/21/26 8:41 PM, Xu Yilun wrote:
> TDX Module introduces a new concept called "TDX Module Extensions" to
> support long running / hard-irq preemptible flows inside. This makes TDX
> Module capable of handling complex tasks through "Extension SEAMCALLs".
> Adding more memory to TDX Module is the first step to enable Extensions.
> 
> Currently, TDX Module memory use is relatively static. But, the
> Extensions need to use memory more dynamically. While 'static' here
> means the kernel provides necessary amount of memory to TDX Module for
> its basic functionalities, 'dynamic' means extra memory is needed only
> if new add-on features are to be enabled. So add a new memory feeding
> process backed by a new SEAMCALL TDH.EXT.MEM.ADD.
> 
> The process is mostly the same as adding PAMT. The kernel queries TDX
> Module how much memory needed, allocates it, hands it over, and never
> gets it back.
> 
> TDH.EXT.MEM.ADD uses a new parameter type HPA_LIST_INFO to provide
> control (private) pages to TDX Module. This type represents a list of
> pages for TDX Module to access. It needs a 'root page' which contains
> the list of HPAs of the pages. It collapses the HPA of the root page
> and the number of valid HPAs into a 64 bit raw value for SEAMCALL
> parameters. The root page is always a medium, TDX Module never keeps
> the root page.
> 
> Introduce a tdx_clflush_hpa_list() helper to flush shared cache before
> SEAMCALL, to avoid shared cache writeback damaging these private pages.
> 
> For now, TDX Module Extensions consumes relatively large amount of
> memory (~50MB). Use contiguous page allocation to avoid permanently
> fragment too much memory. Print the allocation amount on TDX Module
> Extensions initialization for visibility.
> 
> Co-developed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
>   arch/x86/virt/vmx/tdx/tdx.h |   1 +
>   arch/x86/virt/vmx/tdx/tdx.c | 118 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 119 insertions(+)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index a5eec8e3cc71..2335f88bbb10 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -46,6 +46,7 @@
>   #define TDH_PHYMEM_PAGE_WBINVD		41
>   #define TDH_VP_WR			43
>   #define TDH_SYS_CONFIG			45
> +#define TDH_EXT_MEM_ADD			61
>   #define TDH_SYS_DISABLE			69
>   
>   /*
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index c0c6281b08a5..622399d8da68 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -31,6 +31,7 @@
>   #include <linux/syscore_ops.h>
>   #include <linux/idr.h>
>   #include <linux/kvm_types.h>
> +#include <linux/bitfield.h>
>   #include <asm/page.h>
>   #include <asm/special_insns.h>
>   #include <asm/msr-index.h>
> @@ -1179,6 +1180,123 @@ static __init int init_tdmrs(struct tdmr_info_list *tdmr_list)
>   	return 0;
>   }
>   
> +static void tdx_clflush_hpa_list(struct page *root, unsigned int nr_pages)
> +{
> +	u64 *entries = page_to_virt(root);
> +	int i;
> +
> +	for (i = 0; i < nr_pages; i++)
> +		clflush_cache_range(__va(entries[i]), PAGE_SIZE);
> +}
> +
> +#define HPA_LIST_INFO_FIRST_ENTRY	GENMASK_U64(11, 3)
> +#define HPA_LIST_INFO_PFN		GENMASK_U64(51, 12)
> +#define HPA_LIST_INFO_LAST_ENTRY	GENMASK_U64(63, 55)
> +
> +static u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
> +{
> +	return FIELD_PREP(HPA_LIST_INFO_FIRST_ENTRY, 0) |
> +	       FIELD_PREP(HPA_LIST_INFO_PFN, page_to_pfn(root)) |
> +	       FIELD_PREP(HPA_LIST_INFO_LAST_ENTRY, nr_pages - 1);
> +}
> +
> +static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = to_hpa_list_info(root, nr_pages),
> +	};
> +	u64 r;
> +
> +	tdx_clflush_hpa_list(root, nr_pages);
> +
> +	do {
> +		/*
> +		 * TDH_EXT_MEM_ADD is designed to use output parameter RCX to
> +		 * override/update input parameter RCX, so the caller doesn't
> +		 * have to do manual parameter update on retry call.
> +		 */
> +		r = seamcall_ret(TDH_EXT_MEM_ADD, &args);
> +	} while (r == TDX_INTERRUPTED_RESUMABLE);

The retry loop compares the full return value against TDX_INTERRUPTED_RESUMABLE. Should
it mask with TDX_SEAMCALL_STATUS_MASK first, in case the module sets any
lower detail bits?

Ditto for TDH.EXT.INIT in patch 3.

> +
> +	if (r != TDX_SUCCESS)
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static int tdx_ext_mem_setup(void)
> +{
> +	unsigned int nr_pages;
> +	struct page *page;
> +	u64 *root;
> +	unsigned int i;
> +	int ret;
> +
> +	nr_pages = tdx_sysinfo.ext.memory_pool_required_pages;
> +	/*
> +	 * memory_pool_required_pages == 0 means no need to add pages,
> +	 * skip the memory setup.
> +	 */
> +	if (!nr_pages)
> +		return 0;
> +
> +	root = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!root)
> +		return -ENOMEM;
> +
> +	page = alloc_contig_pages(nr_pages, GFP_KERNEL, numa_mem_id(),
> +				  &node_online_map);

The SEAMCALL takes a scatter list (HPA_LIST_INFO), so the module
doesn't require contiguity. If the goal is just to avoid scattering
pages across many 2MB regions, maybe dense, 2MB-aligned allocations should
achieve that without a single pool-wide contiguous block.

> +	if (!page) {
> +		ret = -ENOMEM;
> +		goto out_free_root;
> +	}
> +
> +	for (i = 0; i < nr_pages;) {
> +		unsigned int nents = min(nr_pages - i,
> +					 PAGE_SIZE / sizeof(*root));
> +		int j;
> +
> +		for (j = 0; j < nents; j++)
> +			root[j] = page_to_phys(page + i + j);

Would it be better to allocate per-batch (i.e. one root page's worth
at a time) rather than the whole pool up front?

That way an intermediate TDH.EXT.MEM.ADD failure wouldn't leak
all nr_pages. Also, a batch is up to 512 pages (= 2MB) and its allocation
could be 2MB-aligned, addressing your fragmentation concern.

> +
> +		ret = tdx_ext_mem_add(virt_to_page(root), nents);
> +		/*
> +		 * No SEAMCALLs to reclaim the added pages. For simple error
> +		 * handling, leak all pages.
> +		 */
> +		WARN_ON_ONCE(ret);
> +		if (ret)
> +			break;
> +
> +		i += nents;
> +	}
> +
> +	/*
> +	 * Extensions memory can't be reclaimed once added, print out the
> +	 * amount, stop tracking it and free the root page, no matter success
> +	 * or failure.
> +	 */
> +	pr_info("%lu KB allocated for TDX Module Extensions\n",
> +		nr_pages * PAGE_SIZE / 1024);
> +
> +out_free_root:
> +	kfree(root);
> +
> +	return ret;
> +}
> +
> +static int __maybe_unused init_tdx_ext(void)

Could this be named init_tdx_extensions() instead to disambiguate
from tdx_ext_init() in patch 3?

> +{
> +	if (!(tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_EXT))
> +		return 0;
> +
> +	/* No feature requires TDX Module Extensions. */
> +	if (!tdx_sysinfo.ext.ext_required)
> +		return 0;
> +
> +	return tdx_ext_mem_setup();
> +}
> +
>   static __init int init_tdx_module(void)
>   {
>   	int ret;


^ permalink raw reply

* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Kishen Maloor @ 2026-06-07  4:36 UTC (permalink / raw)
  To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
  Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-1-yilun.xu@linux.intel.com>

On 5/21/26 8:41 PM, Xu Yilun wrote:
> ...
> This series has 2 distinct parts:
> 
>    Patches  1-4:  TDX Module Extensions enabling
>    Patches  5-15: DICE-based TDX Quoting, primarily Peter's work.
> 
Perhaps the extensions enabling patches could be organized more simply as
these three?

1. Add TDX extensions metadata structure and accessor
2. Add TDH.EXT.MEM.ADD
3. Add TDH.EXT.INIT and wire extensions init into init_tdx_module()

This introduces the SEAMCALLs and lets the wiring land with the patch
that completes the init flow, avoiding a separate "enable" patch.


^ permalink raw reply

* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: David Woodhouse @ 2026-06-06 10:52 UTC (permalink / raw)
  To: Thomas Gleixner, Sean Christopherson, Paolo Bonzini, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
	Michael Kelley
In-Reply-To: <877boc554l.ffs@fw13>

[-- Attachment #1: Type: text/plain, Size: 3487 bytes --]

On Sat, 2026-06-06 at 12:34 +0200, Thomas Gleixner wrote:
> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> 
> > Now that all paravirt code that explicitly specifies the TSC frequency
> > also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> > and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
> > 
> > Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> > by the user.  Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> > line parameter"), one of the goals of the param is to allow the refined
> > calibration work "to do meaningful error checking".
> > 
> > Note, preferring the user-provided TSC frequency over the frequency from
> > the hypervisor or trusted firmware, while simultaneously not treating the
> > user-provided frequency as gospel, is obviously incongruous.  Sweep the
> > problem under the rug for now to avoid opening a big can of worms that
> > likely doesn't have a great answer.
> 
> There is a good answer I think.
> 
> early_tsc_khz exists to cater for the overclocking crowd. On their
> modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
> matching reality anymore. So they work around that by supplying a close
> enough tsc_early_khz and then they let the refined calibration work
> figure it out.
> 
> Arguably that's only relevant for bare metal systems and what's worse is
> that in virtual environments the refined calibration work can fail,
> which renders the TSC unstable.
> 
> So I'd rather say we change this logic to:
> 
>    if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
>       tsc_khz = x86_init.....();
>       force(X86_FEATURE_TSC_KNOWN_FREQ);
>    } else if (tsc_khz_early) {
>       ....
>    } else {
>       ...
>    }
> 
> Along with:
> 
>    if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
>       if (tsc_khz_early)
>          pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
> 
> or something daft like that.
> 
> The kernel has for various reasons always tried to cater for the needs
> of users who are plagued by bonkers firmware, but we have to stop to
> prioritize or treating equal ancient and modded out of spec hardware.
> 
> TBH, I consider that whole KVM clock nonsense to fall into the modded
> out of spec hardware realm. Do a reality check:
> 
>    How many production systems are out there still which run VMs on CPUs
>    with a broken TSC and the lack of VM TSC scaling?
> 
> I'm not saying that we should not support the few remaining systems
> anymore, but our tendency to pretend that we can keep all of this
> nonsense working and at the same time making progress is just a fallacy.

I don't know that we can take the KVM (and Xen) clock away from guests,
but all of the *horrid* part about it is the way it attempts to cope
with the possibility that the *host* timekeeping might flip away from
TSC-based mode at any point in time. By the end of my outstanding
cleanup series, that is the *only* thing the gtod_notifier remains for.

If we can trust the hardware *and* the host kernel, then KVM could
theoretically hardwire the kvmclock into 'master clock mode' where it
basically just advertises the TSC→kvmclock relationship *once* to all
CPUs and it never changes.

All the nonsense about updating it every time we enter a CPU could just
go away completely.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Thomas Gleixner @ 2026-06-06 10:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-11-seanjc@google.com>

On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:

> Now that all paravirt code that explicitly specifies the TSC frequency
> also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
>
> Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> by the user.  Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> line parameter"), one of the goals of the param is to allow the refined
> calibration work "to do meaningful error checking".
>
> Note, preferring the user-provided TSC frequency over the frequency from
> the hypervisor or trusted firmware, while simultaneously not treating the
> user-provided frequency as gospel, is obviously incongruous.  Sweep the
> problem under the rug for now to avoid opening a big can of worms that
> likely doesn't have a great answer.

There is a good answer I think.

early_tsc_khz exists to cater for the overclocking crowd. On their
modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
matching reality anymore. So they work around that by supplying a close
enough tsc_early_khz and then they let the refined calibration work
figure it out.

Arguably that's only relevant for bare metal systems and what's worse is
that in virtual environments the refined calibration work can fail,
which renders the TSC unstable.

So I'd rather say we change this logic to:

   if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
      tsc_khz = x86_init.....();
      force(X86_FEATURE_TSC_KNOWN_FREQ);
   } else if (tsc_khz_early) {
      ....
   } else {
      ...
   }

Along with:

   if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
      if (tsc_khz_early)
         pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");

or something daft like that.

The kernel has for various reasons always tried to cater for the needs
of users who are plagued by bonkers firmware, but we have to stop to
prioritize or treating equal ancient and modded out of spec hardware.

TBH, I consider that whole KVM clock nonsense to fall into the modded
out of spec hardware realm. Do a reality check:

   How many production systems are out there still which run VMs on CPUs
   with a broken TSC and the lack of VM TSC scaling?

I'm not saying that we should not support the few remaining systems
anymore, but our tendency to pretend that we can keep all of this
nonsense working and at the same time making progress is just a fallacy.

I rather want to have a more fine grained differentiation and
prioritization of:

  1) The actual real world relevant use cases which run on contemporary
     hardware.

  2) Still relevant use cases on slightly older hardware with less
     capabilities

  3) Broken firmware

  4) Modded out of spec nonsense

  5) Support for ancient museums pieces

Thanks,

        tglx


^ permalink raw reply

* Re: [PATCH v6 01/20] s390: Expose protected virtualization through cc_platform_has()
From: JAEHOON KIM @ 2026-06-06  0:34 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
	linux-coco
  Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun, linuxppc-dev,
	linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, Halil Pasic,
	Matthew Rosato
In-Reply-To: <20260604083959.1265923-2-aneesh.kumar@kernel.org>

On 6/4/2026 3:39 AM, Aneesh Kumar K.V (Arm) wrote:
> Protected virtualization guests use memory encryption, so advertise that to
> the rest of the kernel through cc_platform_has(CC_ATTR_MEM_ENCRYPT).
>
> s390 already forces DMA mappings to be unencrypted for protected
> virtualization guests through force_dma_unencrypted(). Add
> ARCH_HAS_CC_PLATFORM and provide the matching cc_platform_has()
> implementation
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>

Tested-by: Jaehoon Kim <jhkim@linux.ibm.com>

Tested on s390 PV guest with swiotlb_dynamic configuration. SWIOTLB
bounce buffer allocation and dynamic pool management work correctly.
Also concurrent I/O stress completed successfully.

Thanks,
Jaehoon.

> ---
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Jaehoon  Kim <jhkim@linux.ibm.com>
> ---
>   arch/s390/Kconfig   |  1 +
>   arch/s390/mm/init.c | 14 ++++++++++++++
>   2 files changed, 15 insertions(+)
>
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index ecbcbb781e40..9b5e6029e043 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -87,6 +87,7 @@ config S390
>   	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
>   	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
>   	select ARCH_HAS_CC_CAN_LINK
> +	select ARCH_HAS_CC_PLATFORM
>   	select ARCH_HAS_CPU_FINALIZE_INIT
>   	select ARCH_HAS_CURRENT_STACK_POINTER
>   	select ARCH_HAS_DEBUG_VIRTUAL
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 1f72efc2a579..ad3c6d92b801 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -50,6 +50,7 @@
>   #include <linux/virtio_anchor.h>
>   #include <linux/virtio_config.h>
>   #include <linux/execmem.h>
> +#include <linux/cc_platform.h>
>   
>   pgd_t swapper_pg_dir[PTRS_PER_PGD] __section(".bss..swapper_pg_dir");
>   pgd_t invalid_pg_dir[PTRS_PER_PGD] __section(".bss..invalid_pg_dir");
> @@ -140,6 +141,19 @@ bool force_dma_unencrypted(struct device *dev)
>   	return is_prot_virt_guest();
>   }
>   
> +
> +bool cc_platform_has(enum cc_attr attr)
> +{
> +	switch (attr) {
> +	case CC_ATTR_MEM_ENCRYPT:
> +		return is_prot_virt_guest();
> +
> +	default:
> +		return false;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(cc_platform_has);
> +
>   /* protected virtualization */
>   static void __init pv_init(void)
>   {



^ permalink raw reply

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-05 20:48 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <CAEvNRgF09SfDm=OgbrS8-wpfxbNecQkqAQwf1ELq1jWu7NjbUA@mail.gmail.com>

On Fri, Jun 05, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> >
> > [...snip...]
> >
> >> Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
> >> instead?
> >>
> >> The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
> >> __vm_create_with_vcpus().
> >>
> >> While building this POC to test conversions [1] I only wanted to create
> >> the vm and vcpus and didn't want to finalize yet, since I still needed
> >> to do more mappings in the guest (and I needed the vm pointer to do
> >> mappings in the guest).
> >
> > Hmm, I would argue this is a flaw in the selftests infrastructure.  IMO, as a
> > developer, it's quite surprising that the current value of a global variable
> > doesn't show up in the VM automagically.  I totally understand why selftests
> > work that way, but it's certainly odd and annoying.  If _that_ were solved, then
> > the kludginess of what you're doing goes away.
> >
> > The other way this could be solved is by adding support for annotating globals
> > with a __shared flag, a la the kernel's __bss_decrypted, so that loading memory
> > into the VM can automatically mark the associated globals' pages as shared.
> >
> 
> More generally, is your opinion that tests should not have to add extra
> memslots?

I don't care?  What I care about is making it as easy and intuitive as possible
for people to write tests, and to minimize maintenance costs.

> If I wanted a shared page, would I have to do
> 
>   static __shared test_page[4096] = {0};
> 
> and then rely on ELF loading to put that in the guest for me? Are there
> some compiler flags/how will I require that test_page be page aligned?

Compilere and linker shenanigans.

> If I mark 10 globals as __shared, would the compiler automatically
> consolidate the shared memory together?

Yes, follow the __bss_decrypted breadcrumbs.

  #define __bss_decrypted __section(".bss..decrypted")

> I think it's a bit constraining to require that all guest memory be set
> up statically. It's nice to have but I'd like another option...

You do have options, they just require more work.

> Many tests use vm_userspace_mem_region_add(), CoCo tests that require
> finalizing shouldn't be disallowed that option.

What does that have to do with finalizing the VM?

> >> It's also possible to have some kvm_vm_finalize() call that can be
> >> explicitly and manually invoked from selftests just for CoCo selftests.
> >
> > Why bother?  It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
> 
> Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
> is the right function where the TD is finalized?
> 
> For tests that need to do more setup after creating a vm, is the only
> way out to call __vm_create() then vm_vcpu_add() to avoid premature
> finalization in __vm_create_with_vcpus() when
> kvm_arch_vm_finalize_vcpus() is called?

Depends on what you're doing.  Sometimes, the answer will be yes.  That's why
there are "low level" APIs, so that some tests can do fancy things, while most
tests can leave the details to the infrastructure.

If there's a recurring problem, or we anticipate one, then we can and should
figure out how to minimize the pain so that tests don't have to deal with the
same boilerplate issues over and over.  Hence the __shared idea.

^ permalink raw reply

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Thomas Gleixner @ 2026-06-05 19:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <aiMPxl5vkvJDldi9@google.com>

On Fri, Jun 05 2026 at 11:04, Sean Christopherson wrote:
> On Fri, Jun 05, 2026, Thomas Gleixner wrote:
>> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
>> > Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
>> > frequency.
>> 
>> That's misleading because fixed frequency means that the frequency does
>> not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
>> X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
>> from CPUID/MSRs.
>
> Sorry, "if the TSC runs at a known, fixed frequency" would be a better way to
> phrase this.
>
>> > In practice, this is likely one big nop, as re-calibration is
>> > used only for SMP=n kernels, and only for hardware that is 20+ years old,
>> > i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
>> 
>> recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
>> frequency drivers, which means that's absolutely not interesting and
>> neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
>> set on those systems.
>
> It _shouldn't_ be set on those systems, but in the world of virtualization it's
> not completely impossible.
>
>> IOW, this patch is pointless voodoo ware.
>
> Would y'all be opposed to adding a WARN?  I don't actually care about P4 or K7
> CPUs, but without any reference to X86_FEATURE_TSC_KNOWN_FREQ in
> recalibrate_cpu_khz(), the code _looks_ wrong, and so is very confusing for
> readers that don't already know that in practice, it's limited to ancient CPUs.
>
> In other words, the point is to document expectations and mutual exclusion, not
> to "fix" anything.

Fair enough.

So yes, having a check there for actually X86_FEATURE_CONSTANT_TSC
(X86_FEATURE_CONSTANT_TSC is not interesting) and emitting a warning and
returning early is the right thing to do there.

But we also should have a check in the TSC init code somewhere which
validates that X86_FEATURE_CONSTANT_TSC is set when
X86_FEATURE_TSC_KNOWN_FREQ is set. X86_FEATURE_TSC_KNOWN_FREQ is useless
w/o X86_FEATURE_CONSTANT_TSC.

Thanks,

        tglx



^ permalink raw reply

* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-06-05 18:27 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgHz5GDjq0GqRmpQdHc-X45gCNr39VYWZH-T7XhPEtN5CQ@mail.gmail.com>

On Thu, Jun 04, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> >> + KVM: selftests: Test conversion with elevated page refcount
> >>     + Askar pointed out that soon vmsplice may not pin pages. Should I
> >>       pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
> >>       take a dependency on CONFIG_GUP_TEST.
> >
> > I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
> > it probably is the least awful choice.  E.g. KVM also pins pages is certain flows,
> > but we're _also_ actively working to remove the need to pin.
> >
> > Hmm, maybe IORING_REGISTER_PBUF_RING?  AFAICT, it's almost literally a "pin user
> > memory" syscall.
> >
> 
> Hmm that takes a dependency on io_uring, which isn't always compiled
> in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
> CONFIG_GUP_TEST.

Or try both?  If it's not a ridiculous amount of work.

^ permalink raw reply

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-05 18:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <aiMOWkiVBqoQDAPd@google.com>

Sean Christopherson <seanjc@google.com> writes:

>
> [...snip...]
>
>> Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
>> instead?
>>
>> The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
>> __vm_create_with_vcpus().
>>
>> While building this POC to test conversions [1] I only wanted to create
>> the vm and vcpus and didn't want to finalize yet, since I still needed
>> to do more mappings in the guest (and I needed the vm pointer to do
>> mappings in the guest).
>
> Hmm, I would argue this is a flaw in the selftests infrastructure.  IMO, as a
> developer, it's quite surprising that the current value of a global variable
> doesn't show up in the VM automagically.  I totally understand why selftests
> work that way, but it's certainly odd and annoying.  If _that_ were solved, then
> the kludginess of what you're doing goes away.
>
> The other way this could be solved is by adding support for annotating globals
> with a __shared flag, a la the kernel's __bss_decrypted, so that loading memory
> into the VM can automatically mark the associated globals' pages as shared.
>

More generally, is your opinion that tests should not have to add extra
memslots?

If I wanted a shared page, would I have to do

  static __shared test_page[4096] = {0};

and then rely on ELF loading to put that in the guest for me? Are there
some compiler flags/how will I require that test_page be page aligned?

If I mark 10 globals as __shared, would the compiler automatically
consolidate the shared memory together?

I think it's a bit constraining to require that all guest memory be set
up statically. It's nice to have but I'd like another option...

Many tests use vm_userspace_mem_region_add(), CoCo tests that require
finalizing shouldn't be disallowed that option.

>> Would calling tdx_vm_finalize() from within vcpu_run(), just once, be
>> too magical?
>
> Yes.
>
>> It's also possible to have some kvm_vm_finalize() call that can be
>> explicitly and manually invoked from selftests just for CoCo selftests.
>
> Why bother?  It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.

Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
is the right function where the TD is finalized?

For tests that need to do more setup after creating a vm, is the only
way out to call __vm_create() then vm_vcpu_add() to avoid premature
finalization in __vm_create_with_vcpus() when
kvm_arch_vm_finalize_vcpus() is called?

^ permalink raw reply

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Sean Christopherson @ 2026-06-05 18:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <87fr315fq9.ffs@fw13>

On Fri, Jun 05, 2026, Thomas Gleixner wrote:
> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> > Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
> > frequency.
> 
> That's misleading because fixed frequency means that the frequency does
> not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
> X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
> from CPUID/MSRs.

Sorry, "if the TSC runs at a known, fixed frequency" would be a better way to
phrase this.

> > In practice, this is likely one big nop, as re-calibration is
> > used only for SMP=n kernels, and only for hardware that is 20+ years old,
> > i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
> 
> recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
> frequency drivers, which means that's absolutely not interesting and
> neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
> set on those systems.

It _shouldn't_ be set on those systems, but in the world of virtualization it's
not completely impossible.

> IOW, this patch is pointless voodoo ware.

Would y'all be opposed to adding a WARN?  I don't actually care about P4 or K7
CPUs, but without any reference to X86_FEATURE_TSC_KNOWN_FREQ in
recalibrate_cpu_khz(), the code _looks_ wrong, and so is very confusing for
readers that don't already know that in practice, it's limited to ancient CPUs.

In other words, the point is to document expectations and mutual exclusion, not
to "fix" anything.

^ permalink raw reply

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-05 17:58 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <CAEvNRgHEq-hHcTivUw8TYBeMu-RpS=Ho4DXaNXKQKLPL_biTgg@mail.gmail.com>

On Fri, Jun 05, 2026, Ackerley Tng wrote:
> Lisa Wang <wyihan@google.com> writes:
> 
> > From: Sagi Shahar <sagis@google.com>
> >
> > Finalize TDX VM after creation to make it runnable.
> >
> > Signed-off-by: Sagi Shahar <sagis@google.com>
> > Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Lisa Wang <wyihan@google.com>
> > ---
> >  tools/testing/selftests/kvm/lib/x86/processor.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
> > index d84c629a1945..842cac168e99 100644
> > --- a/tools/testing/selftests/kvm/lib/x86/processor.c
> > +++ b/tools/testing/selftests/kvm/lib/x86/processor.c
> > @@ -1479,6 +1479,12 @@ bool kvm_arch_has_default_irqchip(void)
> >  	return true;
> >  }
> >
> > +void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
> > +{
> > +	if (is_tdx_vm(vm))
> > +		tdx_vm_finalize(vm);
> > +}
> > +
> 
> This doesn't necessarily block this series, we could (re)move this
> later: I'm not sure if kvm_arch_vm_finalize_vcpus() is the correct place
> to be finalizing the VM.
>
> Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
> instead?
> 
> The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
> __vm_create_with_vcpus().
> 
> While building this POC to test conversions [1] I only wanted to create
> the vm and vcpus and didn't want to finalize yet, since I still needed
> to do more mappings in the guest (and I needed the vm pointer to do
> mappings in the guest).

Hmm, I would argue this is a flaw in the selftests infrastructure.  IMO, as a
developer, it's quite surprising that the current value of a global variable
doesn't show up in the VM automagically.  I totally understand why selftests
work that way, but it's certainly odd and annoying.  If _that_ were solved, then
the kludginess of what you're doing goes away.

The other way this could be solved is by adding support for annotating globals
with a __shared flag, a la the kernel's __bss_decrypted, so that loading memory
into the VM can automatically mark the associated globals' pages as shared.

> Would calling tdx_vm_finalize() from within vcpu_run(), just once, be
> too magical?

Yes.

> It's also possible to have some kvm_vm_finalize() call that can be
> explicitly and manually invoked from selftests just for CoCo selftests.

Why bother?  It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.

^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Dave Hansen @ 2026-06-05 16:23 UTC (permalink / raw)
  To: Kiryl Shutsemau, Chao Gao
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Huang, Kai, Zhao, Yan Y,
	seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <aiK1_q8beMcIEiwO@thinkstation>

On 6/5/26 04:42, Kiryl Shutsemau wrote:
>>> I don't see a reason why we can't keep the scoped_guard() on get side.
>> One additional reason to drop scoped_guard() is that it mixes cleanup helpers
>> with goto, which is discouraged. See [*]
>>
>>  :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
>>  :that the “goto” statement can jump between scopes, the expectation is that
>>  :usage of “goto” and cleanup helpers is never mixed in the same function.
> Fair enough.
> 
> But it can also be address if we free the PAMT page array with the guard
> too :P

How important is this patch? I see "Optimize" but I read "Optional".

If we're arguing about it, maybe we should just kick it out and focus on
the more important bits.

^ permalink raw reply

* Re: [PATCH v14 23/44] arm64: RMI: Handle RMI_EXIT_RIPAS_CHANGE
From: Steven Price @ 2026-06-05 15:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <yq5ay0hfspok.fsf@kernel.org>

On 19/05/2026 10:40, Aneesh Kumar K.V wrote:
> Steven Price <steven.price@arm.com> writes:
> 
> ...
> 
>> +void kvm_realm_unmap_range(struct kvm *kvm, unsigned long start,
>> +			   unsigned long size, bool unmap_private,
>> +			   bool may_block)
>> +{
>> +	unsigned long end = start + size;
>> +	struct realm *realm = &kvm->arch.realm;
>> +
>> +	if (!kvm_realm_is_created(kvm))
>> +		return;
>> +
>> +	end = min(BIT(realm->ia_bits - 1), end);
>> +
>> +	realm_unmap_shared_range(kvm, start, end, may_block);
>> +	if (unmap_private)
>> +		realm_unmap_private_range(kvm, start, end, may_block);
>> +}
>> +
>  
> kvm_gmem_invalidate_begin() indicates a private-only invalidation. How
> is that supported?

Because we treat the private and shared spaces are aliasing we don't
really support a "private-only" invalidation. So the shared space will
be invalidated as well. Something has gone wrong if we've ended up with
the 'same' IPA being used in both the private and shared spaces.

Private has to be treated slightly specially because removing a private
mapping is observable by the guest (the page can't be reinserted without
the guest agreeing and the contents being wiped). For shared mappings
the page can simply be refaulted.

That said, I'll look into Wei-Lin's suggestion to use
kvm_gfn_range_filter which would allow all three combinations of
private-only, shared-only and private+shared.

Thanks,
Steve


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox