* Re: [PATCH v7 08/22] x86/virt/seamldr: Allocate and populate a module update request
From: Edgecombe, Rick P @ 2026-04-11 0:33 UTC (permalink / raw)
To: Hansen, Dave, Gao, Chao
Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
Huang, Kai, kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
binbin.wu@linux.intel.com, Weiny, Ira, nik.borisov@suse.com,
mingo@redhat.com, Verma, Vishal L, kas@kernel.org,
sagis@google.com, Annapurve, Vishal, hpa@zytor.com,
tglx@kernel.org, paulmck@kernel.org, bp@alien8.de,
yilun.xu@linux.intel.com, dan.j.williams@intel.com,
x86@kernel.org
In-Reply-To: <aczW7MZBqsCXw4gk@intel.com>
On Wed, 2026-04-01 at 16:27 +0800, Chao Gao wrote:
> And I assume we don't need WARN_ON_ONCE(PAGE_SIZE != SZ_4K) since this is
> unlikely to break soon and shouldn't be very hard to debug and fix if it
> does.
+1
^ permalink raw reply
* Re: [PATCH v7 08/22] x86/virt/seamldr: Allocate and populate a module update request
From: Edgecombe, Rick P @ 2026-04-11 1:14 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-9-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> P-SEAMLDR uses the SEAMLDR_PARAMS structure to describe TDX module
> update requests. This structure contains physical addresses pointing to
> the module binary and its signature file (or sigstruct), along with an
> update scenario field.
>
> TDX modules are distributed in the tdx_blob format defined in
> blob_structure.txt from the "Intel TDX module Binaries Repository". A
> tdx_blob contains a header, sigstruct, and module binary. This is also the
> format supplied by the userspace to the kernel.
>
> Parse the tdx_blob format and populate a SEAMLDR_PARAMS structure
> accordingly. This structure will be passed to P-SEAMLDR to initiate the
> update.
The thing that confused me about this earlier was the exact reason why we are
checking all the fields. We discussed that we need to check the fields that
kernel processes, but we don't need to double check data that the TDX module
processes.
Should we explain it? And how it explains the checks below?
>
> Note that the sigstruct_pa field in SEAMLDR_PARAMS has been extended to
> a 4-element array. The updated "SEAM Loader (SEAMLDR) Interface
> Specification" will be published separately. P-SEAMLDR compatibility
> validation (such as 4KB vs 16KB sigstruct support) is left to userspace,
> which must verify the P-SEAMLDR version meets the TDX module's minimum
> requirements.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> ---
> v7:
> - add blob size/alignment validation. Ensure 4KB chunking is clearly
> correct: serves as both defense and documentation
> - split a long one-line comment [Kiryl]
>
> v6:
> - clarify tdx_blob's @offset_of_module and @len fields [Kiryl]
> - clarify comment to explicitly call out the PAGE_SIZE != SZ_4K case
> [Kiryl]
>
> v5:
> - use a macro for tdx_blob version (0x100) [Yan]
> - don't do alignment checking for the binary/sigstruct [Rick]
> - drop blob's sigstruct and validation checking
> - set seamldr_params.version to 1 when necessary
> - drop the link to blob_structure.txt which might be unstable [Kai]
>
> v4:
> - Remove checksum verification as it is optional
> - Convert comments to is_vmalloc_addr() checks [Kai]
> - Explain size/alignment checks in alloc_seamldr_params() [Kai]
>
> v3:
> - Print tdx_blob version in hex [Binbin]
> - Drop redundant sigstruct alignment check [Yilun]
> - Note buffers passed from firmware upload infrastructure are
> vmalloc()'d above alloc_seamldr_params()
> ---
> arch/x86/virt/vmx/tdx/seamldr.c | 163 ++++++++++++++++++++++++++++++++
> 1 file changed, 163 insertions(+)
>
> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index e93a5d90a3ee..219a8e0c7127 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -7,6 +7,7 @@
> #define pr_fmt(fmt) "seamldr: " fmt
>
> #include <linux/mm.h>
> +#include <linux/slab.h>
> #include <linux/spinlock.h>
>
> #include <asm/seamldr.h>
> @@ -16,6 +17,33 @@
> /* P-SEAMLDR SEAMCALL leaf function */
> #define P_SEAMLDR_INFO 0x8000000000000000
>
> +#define SEAMLDR_MAX_NR_MODULE_4KB_PAGES 496
> +#define SEAMLDR_MAX_NR_SIG_4KB_PAGES 4
> +
> +/*
> + * The seamldr_params "scenario" field specifies the operation mode:
> + * 0: Install TDX module from scratch (not used by kernel)
> + * 1: Update existing TDX module to a compatible version
> + */
> +#define SEAMLDR_SCENARIO_UPDATE 1
> +
> +/*
> + * This is called the "SEAMLDR_PARAMS" data structure and is defined
> + * in "SEAM Loader (SEAMLDR) Interface Specification".
> + *
> + * It describes the TDX module that will be installed.
> + */
> +struct seamldr_params {
> + u32 version;
> + u32 scenario;
> + u64 sigstruct_pa[SEAMLDR_MAX_NR_SIG_4KB_PAGES];
> + u8 reserved[80];
> + u64 num_module_pages;
> + u64 mod_pages_pa_list[SEAMLDR_MAX_NR_MODULE_4KB_PAGES];
> +} __packed;
> +
> +static_assert(sizeof(struct seamldr_params) == 4096);
> +
> /*
> * Serialize P-SEAMLDR calls since the hardware only allows a single CPU to
> * interact with P-SEAMLDR simultaneously. Use raw version as the calls can
> @@ -42,6 +70,136 @@ int seamldr_get_info(struct seamldr_info *seamldr_info)
> }
> EXPORT_SYMBOL_FOR_MODULES(seamldr_get_info, "tdx-host");
>
> +static void free_seamldr_params(struct seamldr_params *params)
> +{
> + free_page((unsigned long)params);
> +}
Do we really need this helper? This doesn't work?
DEFINE_FREE(free_seamldr_params, struct seamldr_params *,
if (!IS_ERR_OR_NULL(_T)) free_page((unsigned long)_T))
> +
> +static struct seamldr_params *alloc_seamldr_params(const void *module, unsigned int module_size,
> + const void *sig, unsigned int sig_size)
> +{
> + struct seamldr_params *params;
> + const u8 *ptr;
> + int i;
> +
> + if (module_size > SEAMLDR_MAX_NR_MODULE_4KB_PAGES * SZ_4K)
> + return ERR_PTR(-EINVAL);
> +
> + if (sig_size > SEAMLDR_MAX_NR_SIG_4KB_PAGES * SZ_4K)
> + return ERR_PTR(-EINVAL);
I don't know if it's worth that much, but we could do a MIN thing here to
protect the loop, and lose the conditionals. If userspace passes a blob that is
out of spec they can deal with the module error, no?
> +
> + /*
> + * Check that input buffers satisfy P-SEAMLDR's size and alignment
> + * constraints so they can be passed directly to P-SEAMLDR without
> + * relocation or copy.
> + */
> + if (!IS_ALIGNED(module_size, SZ_4K) || !IS_ALIGNED(sig_size, SZ_4K) ||
> + !IS_ALIGNED((unsigned long)module, SZ_4K) ||
> + !IS_ALIGNED((unsigned long)sig, SZ_4K))
> + return ERR_PTR(-EINVAL);
I thought you are going to reduce this checking to just to reject invalid input
that the kernel processes.
What happens if we don't check this? The vmallocs are all going to be page
aligned anyway. But even still, does it mess up the below loops somehow in a way
that hurts anything?
I might be confused, but it seems different then we discussed.
> +
> + params = (struct seamldr_params *)get_zeroed_page(GFP_KERNEL);
> + if (!params)
> + return ERR_PTR(-ENOMEM);
> +
> + /*
> + * Only use version 1 when required (sigstruct > 4KB) for backward
> + * compatibility with P-SEAMLDR that lacks version 1 support.
> + */
> + if (sig_size > SZ_4K)
> + params->version = 1;
> + else
> + params->version = 0;
I'm a bit confused by this part. What does it mean to support old P-SEAMLDRs?
But also could it be:
params->version = sig_size > SZ_4K;
> +
> + params->scenario = SEAMLDR_SCENARIO_UPDATE;
> +
> + ptr = sig;
> + for (i = 0; i < sig_size / SZ_4K; i++) {
> + /*
> + * @sig is 4KB-aligned, but that does not imply PAGE_SIZE
> + * alignment when PAGE_SIZE != SZ_4K. Always include the
> + * in-page offset.
> + */
> + params->sigstruct_pa[i] = (vmalloc_to_pfn(ptr) << PAGE_SHIFT) +
> + ((unsigned long)ptr & ~PAGE_MASK);
> + ptr += SZ_4K;
> + }
> +
> + params->num_module_pages = module_size / SZ_4K;
> +
> + ptr = module;
> + for (i = 0; i < params->num_module_pages; i++) {
> + params->mod_pages_pa_list[i] = (vmalloc_to_pfn(ptr) << PAGE_SHIFT) +
> + ((unsigned long)ptr & ~PAGE_MASK);
> + ptr += SZ_4K;
> + }
> +
> + return params;
> +}
> +
> +/*
> + * Intel TDX module blob. Its format is defined at:
> + * https://github.com/intel/tdx-module-binaries/blob/main/blob_structure.txt
> + *
> + * Note this structure differs from the reference above: the two variable-length
> + * fields "@sigstruct" and "@module" are represented as a single "@data" field
> + * here and split programmatically using the offset_of_module value.
> + *
> + * Note @offset_of_module is relative to the start of struct tdx_blob, not
> + * @data, and @length is the total length of the blob, not the length of
> + * @data.
> + */
> +struct tdx_blob {
> + u16 version;
> + u16 checksum;
> + u32 offset_of_module;
> + u8 signature[8];
> + u32 length;
> + u32 reserved0;
> + u64 reserved1[509];
> + u8 data[];
> +} __packed;
> +
> +/* Supported versions of the tdx_blob */
> +#define TDX_BLOB_VERSION_1 0x100
> +
> +static struct seamldr_params *init_seamldr_params(const u8 *data, u32 size)
> +{
> + const struct tdx_blob *blob = (const void *)data;
> + int module_size, sig_size;
> + const void *sig, *module;
> +
> + /*
> + * Ensure the size is valid otherwise reading any field from the
> + * blob may overflow.
> + */
> + if (size <= sizeof(struct tdx_blob) || size <= blob->offset_of_module)
> + return ERR_PTR(-EINVAL);
> +
> + if (blob->version != TDX_BLOB_VERSION_1)
> + return ERR_PTR(-EINVAL);
> +
> + if (blob->reserved0 || memchr_inv(blob->reserved1, 0, sizeof(blob->reserved1)))
> + return ERR_PTR(-EINVAL);
> +
> + /* Split the blob into a sigstruct and a module. */
> + sig = blob->data;
> + sig_size = blob->offset_of_module - sizeof(struct tdx_blob);
> + module = data + blob->offset_of_module;
> + module_size = size - blob->offset_of_module;
Did you consider just passing the tdx_blob into alloc_seamldr_params()?
Basically, this function checks the blob fields, then alloc_seamldr_params()
turns blob into struct seamldr_params without checks. The way it is, the work
seems kind of spread around two functions with various checks.
> +
> + if (sig_size <= 0 || module_size <= 0 || blob->length != size)
> + return ERR_PTR(-EINVAL);
> +
> + if (memcmp(blob->signature, "TDX-BLOB", 8))
> + return ERR_PTR(-EINVAL);
> +
> + return alloc_seamldr_params(module, module_size, sig, sig_size);
> +}
> +
> +DEFINE_FREE(free_seamldr_params, struct seamldr_params *,
> + if (!IS_ERR_OR_NULL(_T)) free_seamldr_params(_T))
> +
> /**
> * seamldr_install_module - Install a new TDX module.
> * @data: Pointer to the TDX module update blob.
> @@ -51,6 +209,11 @@ EXPORT_SYMBOL_FOR_MODULES(seamldr_get_info, "tdx-host");
> */
> int seamldr_install_module(const u8 *data, u32 size)
> {
> + struct seamldr_params *params __free(free_seamldr_params) =
> + init_seamldr_params(data, size);
> + if (IS_ERR(params))
> + return PTR_ERR(params);
> +
> /* TODO: Update TDX module here */
> return 0;
> }
^ permalink raw reply
* Re: [PATCH v7 09/22] x86/virt/seamldr: Introduce skeleton for TDX module updates
From: Edgecombe, Rick P @ 2026-04-11 1:23 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-10-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> Potential alternative to stop_machine()
> =======================================
> An alternative approach is to lock all KVM entry points and kick all
> vCPUs. Here, KVM entry points refer to KVM VM/vCPU ioctl entry points,
> implemented in KVM common code (virt/kvm). Adding a locking mechanism
> there would affect all architectures KVM supports. And to lock only TDX
> vCPUs, new logic would be needed to identify TDX vCPUs, which the KVM
> common code currently lacks. This would add significant complexity and
> maintenance overhead to KVM for this TDX-specific use case.
I'd add a little imperative ending like ", so don't take this approach."
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> ---
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
^ permalink raw reply
* Re: [PATCH v7 10/22] x86/virt/seamldr: Abort updates if errors occurred midway
From: Edgecombe, Rick P @ 2026-04-11 1:26 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-11-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> The TDX module update process has multiple steps, each of which may
> encounter failures.
>
> The current state machine of updates proceeds to the next step regardless
> of errors. But continuing updates when errors occur midway is pointless.
This kind of begs the question of how much it matters if some pointless work
happens in error condition during a rare operation. I'm thinking at this point,
aha!, do we need this?
>
> Abort the update by setting a flag to indicate that a CPU has encountered
> an error, forcing all CPUs to exit the execution loop. Note that failing
> CPUs do not acknowledge the current step. This keeps all other CPUs waiting
> in the current step (since advancing to the next step requires all CPUs to
> acknowledge the current step) until they detect the fault flag and exit the
> loop.
So is the point of the patch to prevent the operation from getting stuck? Or
saving the user experiencing a failed update a little time?
^ permalink raw reply
* Re: [PATCH v7 11/22] x86/virt/seamldr: Shut down the current TDX module
From: Edgecombe, Rick P @ 2026-04-11 1:35 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-12-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> The first step of TDX module updates is shutting down the current TDX
> Module. This step also packs state information that needs to be
> preserved across updates as handoff data, which will be consumed by the
> updated module. The handoff data is stored internally in the SEAM range
> and is hidden from the kernel.
>
> To ensure a successful update, the new module must be able to consume
> the handoff data generated by the old module. Since handoff data layout
> may change between modules, the handoff data is versioned. Each module
> has a native handoff version and provides backward support for several
> older versions.
>
> The complete handoff versioning protocol is complex as it supports both
> module upgrades and downgrades. See details in Intel® Trust Domain
> Extensions (Intel® TDX) Module Base Architecture Specification, Chapter
> "Handoff Versioning".
>
> Ideally, the kernel needs to retrieve the handoff versions supported by
> the current module and the new module and select a version supported by
> both. But, since this implementation chooses to only support module
> upgrades, simply request the current module to generate handoff data
> using its highest supported version, expecting that the new module will
> likely support it.
I feel like somewhere it's missing what this patch does. It explains the
reasoning for the handoff version selection, but nothing about implement
"MODULE_UPDATE_SHUTDOWN" or anything like that.
>
> Note that only one CPU needs to call the TDX module's shutdown API.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> ---
> v5:
> - Massage changelog [Kai]
> - Avoid "refers to the global copy while populating the tdx_sys_info
> passed as a pointer" [Rick/Yilun]
>
> v4:
> - skip the whole handoff metadata if runtime updates are not supported
> [Yilun]
> v3:
> - remove autogeneration stuff in the changelog
> v2:
> - add a comment about how handoff version is chosen.
> - remove the first !ret in get_tdx_sys_info_handoff() as we edited the
> auto-generated code anyway
> - remove !! when determining whether a CPU is the primary one
> - remove unnecessary if-break nesting in TDP_SHUTDOWN
> ---
> arch/x86/include/asm/tdx_global_metadata.h | 5 +++++
> arch/x86/virt/vmx/tdx/seamldr.c | 11 ++++++++++-
> arch/x86/virt/vmx/tdx/tdx.c | 15 +++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.h | 3 +++
> arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 20 ++++++++++++++++++++
> 5 files changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/tdx_global_metadata.h b/arch/x86/include/asm/tdx_global_metadata.h
> index 40689c8dc67e..8a9ebd895e70 100644
> --- a/arch/x86/include/asm/tdx_global_metadata.h
> +++ b/arch/x86/include/asm/tdx_global_metadata.h
> @@ -40,12 +40,17 @@ struct tdx_sys_info_td_conf {
> u64 cpuid_config_values[128][2];
> };
>
> +struct tdx_sys_info_handoff {
> + u16 module_hv;
> +};
> +
> struct tdx_sys_info {
> struct tdx_sys_info_version version;
> struct tdx_sys_info_features features;
> struct tdx_sys_info_tdmr tdmr;
> struct tdx_sys_info_td_ctrl td_ctrl;
> struct tdx_sys_info_td_conf td_conf;
> + struct tdx_sys_info_handoff handoff;
> };
>
> #endif
> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index 5964bbc67944..a8bfa30ee55f 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -15,6 +15,7 @@
> #include <asm/seamldr.h>
>
> #include "seamcall_internal.h"
> +#include "tdx.h"
>
> /* P-SEAMLDR SEAMCALL leaf function */
> #define P_SEAMLDR_INFO 0x8000000000000000
> @@ -207,6 +208,7 @@ static struct seamldr_params *init_seamldr_params(const u8 *data, u32 size)
> */
> enum module_update_state {
> MODULE_UPDATE_START,
> + MODULE_UPDATE_SHUTDOWN,
> MODULE_UPDATE_DONE,
> };
>
> @@ -246,8 +248,12 @@ static void ack_state(void)
> static int do_seamldr_install_module(void *seamldr_params)
> {
> enum module_update_state newstate, curstate = MODULE_UPDATE_START;
> + int cpu = smp_processor_id();
> + bool primary;
> int ret = 0;
>
> + primary = cpumask_first(cpu_online_mask) == cpu;
> +
> do {
> /* Chill out and re-read update_data. */
> cpu_relax();
> @@ -256,7 +262,10 @@ static int do_seamldr_install_module(void *seamldr_params)
> if (newstate != curstate) {
> curstate = newstate;
> switch (curstate) {
> - /* TODO: add the update steps. */
> + case MODULE_UPDATE_SHUTDOWN:
> + if (primary)
> + ret = tdx_module_shutdown();
> + break;
> default:
> break;
> }
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 172f6d4133b5..f87fad429f4e 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1176,6 +1176,21 @@ int tdx_enable(void)
> }
> EXPORT_SYMBOL_FOR_KVM(tdx_enable);
>
> +int tdx_module_shutdown(void)
> +{
> + struct tdx_module_args args = {};
> +
> + /*
> + * Shut down the TDX module and prepare handoff data for the next
> + * TDX module. This SEAMCALL requires a handoff version. Use the
> + * module's handoff version, as it is the highest version the
> + * module can produce and is more likely to be supported by new
> + * modules as new modules likely have higher handoff version.
> + */
This one could probably be made less verbose too.
> + args.rcx = tdx_sysinfo.handoff.module_hv;
> + return seamcall_prerr(TDH_SYS_SHUTDOWN, &args);
> +}
> +
> static bool is_pamt_page(unsigned long phys)
> {
> struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 82bb82be8567..1c4da9540ae0 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -46,6 +46,7 @@
> #define TDH_PHYMEM_PAGE_WBINVD 41
> #define TDH_VP_WR 43
> #define TDH_SYS_CONFIG 45
> +#define TDH_SYS_SHUTDOWN 52
>
> /*
> * SEAMCALL leaf:
> @@ -118,4 +119,6 @@ struct tdmr_info_list {
> int max_tdmrs; /* How many 'tdmr_info's are allocated */
> };
>
> +int tdx_module_shutdown(void);
> +
> #endif
> diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> index 34c050d74b56..225e157805e8 100644
> --- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> +++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> @@ -100,6 +100,19 @@ static int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_td_conf
> return ret;
> }
>
> +static int get_tdx_sys_info_handoff(struct tdx_sys_info_handoff *sysinfo_handoff)
> +{
> + int ret;
> + u64 val;
> +
> + ret = read_sys_metadata_field(0x8900000100000000, &val);
> + if (ret)
> + return ret;
> +
> + sysinfo_handoff->module_hv = val;
> + return 0;
> +}
> +
> static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
> {
> int ret = 0;
> @@ -116,5 +129,12 @@ static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
> ret = ret ?: get_tdx_sys_info_td_ctrl(&sysinfo->td_ctrl);
> ret = ret ?: get_tdx_sys_info_td_conf(&sysinfo->td_conf);
>
> + /*
> + * Don't treat a module that doesn't support update as a failure.
> + * Only read the metadata optionally.
> + */
> + if (tdx_supports_runtime_update(sysinfo))
> + ret = ret ?: get_tdx_sys_info_handoff(&sysinfo->handoff);
> +
> return ret;
> }
^ permalink raw reply
* Re: [PATCH v7 11/22] x86/virt/seamldr: Shut down the current TDX module
From: Edgecombe, Rick P @ 2026-04-11 1:36 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <a50bb8a4d8c2bdcb0795eca4551567fb5a577215.camel@intel.com>
On Fri, 2026-04-10 at 18:35 -0700, Rick Edgecombe wrote:
> This one could probably be made less verbose too.
Doh! You already said this.
^ permalink raw reply
* Re: [PATCH v7 12/22] x86/virt/tdx: Reset software states during TDX module shutdown
From: Edgecombe, Rick P @ 2026-04-11 1:56 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-13-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> The TDX module requires a one-time global initialization (TDH.SYS.INIT) and
> per-CPU initialization (TDH.SYS.LP.INIT) before use. These initializations
> are guarded by software flags to prevent repetition.
>
> After TDX module updates, the new TDX module requires the same global and
> per-CPU initializations, but the existing software flags prevent
> re-initialization.
>
> Reset all software flags guarding the initialization flows to allow the
> global and per-CPU initializations to be triggered again after updates.
>
> Set tdx_module_status to ERROR to indicate the module is unavailable. This
> is to prevent re-initialization/tdx_sysinfo reporting on a failed update.
> Using ERROR instead of UNINITIALIZED as the latter implicitly depends on
> get_tdx_sys_info() failing early to prevent re-init after successful
> shutdown followed by failed update.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
The existing code around sysinit_done/ret is very strange to me. It makes this
patch seem a bit strained, but that is not it's fault.
^ permalink raw reply
* Re: [PATCH v7 13/22] x86/virt/seamldr: Install a new TDX module
From: Edgecombe, Rick P @ 2026-04-11 2:01 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-14-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> Following the shutdown of the existing TDX module, the update process
> continues with installing the new module. P-SEAMLDR provides the
> SEAMLDR.INSTALL SEAMCALL to perform this installation, which must be
> executed on all CPUs.
>
> Implement SEAMLDR.INSTALL and execute it on every CPU.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
It seems a pretty straight forward one. My only question would be if the log
needs a bit more info. Not sure what to put though...
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> v6:
> - wrap seamldr_call(P_SEAMLDR_INSTALL..) in a helper [Kiryl]
> v5:
> - drop "serially" from the changelog as it doesn't matter to
> this patch
> ---
> arch/x86/virt/vmx/tdx/seamldr.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index a8bfa30ee55f..3e46f3bfaa8b 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -19,6 +19,7 @@
>
> /* P-SEAMLDR SEAMCALL leaf function */
> #define P_SEAMLDR_INFO 0x8000000000000000
> +#define P_SEAMLDR_INSTALL 0x8000000000000001
>
> #define SEAMLDR_MAX_NR_MODULE_4KB_PAGES 496
> #define SEAMLDR_MAX_NR_SIG_4KB_PAGES 4
> @@ -73,6 +74,13 @@ int seamldr_get_info(struct seamldr_info *seamldr_info)
> }
> EXPORT_SYMBOL_FOR_MODULES(seamldr_get_info, "tdx-host");
>
> +static int seamldr_install(const struct seamldr_params *params)
> +{
> + struct tdx_module_args args = { .rcx = __pa(params) };
In an earlier patch you have a wrapper as:
struct tdx_module_args args = = {};
args.rxx = foo;
Why the style difference? It would be good to standardize, but the existing code
isn't standardized. What do you think about going with this style through the
series for the one arg ones?
> +
> + return seamldr_call(P_SEAMLDR_INSTALL, &args);
> +}
> +
> static void free_seamldr_params(struct seamldr_params *params)
> {
> free_page((unsigned long)params);
> @@ -209,6 +217,7 @@ static struct seamldr_params *init_seamldr_params(const u8 *data, u32 size)
> enum module_update_state {
> MODULE_UPDATE_START,
> MODULE_UPDATE_SHUTDOWN,
> + MODULE_UPDATE_CPU_INSTALL,
> MODULE_UPDATE_DONE,
> };
>
> @@ -266,6 +275,9 @@ static int do_seamldr_install_module(void *seamldr_params)
> if (primary)
> ret = tdx_module_shutdown();
> break;
> + case MODULE_UPDATE_CPU_INSTALL:
> + ret = seamldr_install(seamldr_params);
> + break;
> default:
> break;
> }
^ permalink raw reply
* Re: [PATCH v7 14/22] x86/virt/seamldr: Do TDX per-CPU initialization after updates
From: Edgecombe, Rick P @ 2026-04-11 2:03 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-15-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> After installing the new TDX module, each CPU needs to be initialized
> again to make the CPU ready to run any other SEAMCALLs. So, call
> tdx_cpu_enable() on all CPUs.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> ---
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
^ permalink raw reply
* Re: [PATCH v7 15/22] x86/virt/tdx: Restore TDX module state
From: Edgecombe, Rick P @ 2026-04-11 2:06 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <adTzamrJOOy+fGyF@intel.com>
On Tue, 2026-04-07 at 20:07 +0800, Chao Gao wrote:
> > +int tdx_module_run_update(void)
> > +{
> > + struct tdx_module_args args = {};
> > + int ret;
> > +
> > + ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
> > + if (ret) {
> > + pr_err("update failed (%d)\n", ret);
> > + tdx_module_status = TDX_MODULE_ERROR;
> > + return ret;
> > + }
>
> The pr_err() isn't needed as seamcall_prerr() will emit a
> message. and no need to set tdx_module_status to ERROR on
> failure as it is already done during shutdown.
>
> so, this can be simplified to:
>
> ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
> if (ret)
> return ret;
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
^ permalink raw reply
* [PATCH] ACPI: block AML access to confidential VM private memory
From: bfoing @ 2026-04-11 22:44 UTC (permalink / raw)
To: linux-acpi
Cc: rafael, lenb, robert.moore, kirill.shutemov, thomas.lendacky,
linux-coco, linux-kernel, Bertrand Foing
From: Bertrand Foing <40759640+bfoing@users.noreply.github.com>
Add a guard in the ACPICA SystemMemory space handler that prevents AML
bytecode from reading or writing pages belonging to the confidential VM
private address range.
On TDX and SEV-SNP guests the ACPI tables are under host/VMM control.
Malicious AML ("BadAML") can issue SystemMemory region reads and writes
to arbitrary guest physical addresses, extracting secrets or corrupting
guest state without triggering any existing kernel protection.
The guard walks the kernel page tables for the target virtual address
and checks whether the page-table entry carries the platform-specific
is private the access is denied with AE_AML_ILLEGAL_ADDRESS.
Signed-off-by: Bertrand Foing <40759640+bfoing@users.noreply.github.com>
---
drivers/acpi/Makefile | 1 +
drivers/acpi/acpica/exregion.c | 12 ++++
drivers/acpi/cvm_guard.c | 121 +++++++++++++++++++++++++++++++++
3 files changed, 134 insertions(+)
create mode 100644 drivers/acpi/cvm_guard.c
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index d1b0affb8..6743ece85 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -45,6 +45,7 @@ acpi-y += resource.o
acpi-y += acpi_processor.o
acpi-y += processor_core.o
acpi-$(CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC) += processor_pdc.o
+acpi-$(CONFIG_ARCH_HAS_CC_PLATFORM) += cvm_guard.o
acpi-$(CONFIG_ACPI_EC) += ec.o
acpi-$(CONFIG_ACPI_DOCK) += dock.o
acpi-$(CONFIG_PCI) += pci_root.o pci_link.o pci_irq.o
diff --git a/drivers/acpi/acpica/exregion.c b/drivers/acpi/acpica/exregion.c
index a390a1c2b..f12cacff3 100644
--- a/drivers/acpi/acpica/exregion.c
+++ b/drivers/acpi/acpica/exregion.c
@@ -14,6 +14,12 @@
#define _COMPONENT ACPI_EXECUTER
ACPI_MODULE_NAME("exregion")
+#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
+bool acpi_cvm_guard_deny_access(unsigned long virt_addr);
+#else
+static inline bool acpi_cvm_guard_deny_access(unsigned long v) { return false; }
+#endif
+
/*******************************************************************************
*
* FUNCTION: acpi_ex_system_memory_space_handler
@@ -176,6 +182,12 @@ acpi_ex_system_memory_space_handler(u32 function,
logical_addr_ptr = mm->logical_address +
((u64) address - (u64) mm->physical_address);
+#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
+ if (acpi_cvm_guard_deny_access((unsigned long)logical_addr_ptr)) {
+ return_ACPI_STATUS(AE_AML_ILLEGAL_ADDRESS);
+ }
+#endif
+
ACPI_DEBUG_PRINT((ACPI_DB_INFO,
"System-Memory (width %u) R/W %u Address=%8.8X%8.8X\n",
bit_width, function, ACPI_FORMAT_UINT64(address)));
diff --git a/drivers/acpi/cvm_guard.c b/drivers/acpi/cvm_guard.c
new file mode 100644
index 000000000..0524bf902
--- /dev/null
+++ b/drivers/acpi/cvm_guard.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CVM Guard - Block AML access to confidential VM private memory
+ *
+ * Copyright (C) 2026 Privasys
+ *
+ * On TDX and SEV-SNP guests the host VMM controls ACPI tables, so
+ * AML bytecode executing SystemMemory reads and writes can target
+ * arbitrary guest physical addresses. This file provides a guard
+ * function called from the ACPICA SystemMemory space handler that
+ * checks whether the target virtual address maps to a page marked
+ * as encrypted (private) in the page tables, and denies the access
+ * if so.
+ *
+ * Reference: "BadAML: Exploiting AML in Confidential Virtual Machines"
+ * Takekoshi et al., ACM CCS 2025
+ */
+
+#include <linux/cc_platform.h>
+#include <linux/mm.h>
+#include <linux/printk.h>
+#include <asm/coco.h>
+
+/* Prototype to satisfy -Wmissing-prototypes; declared here rather than in
+ * internal.h because this file does not need the full ACPI driver headers.
+ */
+bool acpi_cvm_guard_deny_access(unsigned long virt_addr);
+
+/*
+ * Walk the four-level kernel page tables for @addr and return the raw
+ * PTE/PMD/PUD value. Returns 0 if the walk fails at any level.
+ * Handles 1 GB (PUD) and 2 MB (PMD) large pages.
+ */
+static unsigned long cvm_guard_pte_val(unsigned long addr)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd))
+ return 0;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return 0;
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return 0;
+ if (pud_leaf(*pud))
+ return pud_val(*pud);
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return 0;
+ if (pmd_leaf(*pmd))
+ return pmd_val(*pmd);
+
+ pte = pte_offset_kernel(pmd, addr);
+ if (pte_none(*pte))
+ return 0;
+
+ return pte_val(*pte);
+}
+
+/*
+ * Check whether @addr maps to a private (encrypted) page.
+ *
+ * cc_mkenc() applies the platform-specific encryption mask:
+ * AMD SEV/SEV-SNP: sets the C-bit
+ * Intel TDX: clears the shared bit
+ *
+ * If the PTE already matches its encrypted form, the page is private
+ * and must not be accessible to AML. If the walk fails (returns 0)
+ * we deny access - fail-closed is the safe default.
+ */
+static bool cvm_guard_page_is_private(unsigned long addr)
+{
+ unsigned long val;
+
+ val = cvm_guard_pte_val(addr);
+ if (!val) {
+ pr_warn_ratelimited("CVM guard: page table walk failed for %lx\n",
+ addr);
+ return true;
+ }
+
+ return val == cc_mkenc(val);
+}
+
+/**
+ * acpi_cvm_guard_deny_access - block AML access to CVM private pages
+ * @virt_addr: kernel virtual address resolved by the SystemMemory handler
+ *
+ * Called from acpi_ex_system_memory_space_handler() after the virtual
+ * address has been computed but before any read or write.
+ *
+ * On non-CVM systems (CC_ATTR_MEM_ENCRYPT not set) this returns false.
+ *
+ * Return: true if the access must be denied, false if allowed.
+ */
+bool acpi_cvm_guard_deny_access(unsigned long virt_addr)
+{
+ if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))
+ return false;
+
+ pr_info_once("CVM guard: active, AML access to private pages will be denied\n");
+
+ virt_addr &= PAGE_MASK;
+
+ if (cvm_guard_page_is_private(virt_addr)) {
+ pr_warn_ratelimited("CVM guard: denied AML access to private page at %lx\n",
+ virt_addr);
+ return true;
+ }
+
+ return false;
+}
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v2 03/31] x86/virt/tdx: Add tdx_page_array helpers for new TDX Module objects
From: Dan Williams @ 2026-04-12 2:53 UTC (permalink / raw)
To: Xu Yilun, linux-coco, linux-pci, x86
Cc: chao.gao, dave.jiang, baolu.lu, yilun.xu, yilun.xu,
zhenzhong.duan, kvm, rick.p.edgecombe, dave.hansen, kas,
xiaoyao.li, vishal.l.verma, linux-kernel
In-Reply-To: <20260327160132.2946114-4-yilun.xu@linux.intel.com>
Xu Yilun wrote:
> Add struct tdx_page_array definition for new TDX Module object
> types - HPA_ARRAY_T and HPA_LIST_INFO. They are used as input/output
> parameters in newly defined SEAMCALLs. Also define some helpers to
> allocate, setup and free tdx_page_array.
>
> HPA_ARRAY_T and HPA_LIST_INFO are similar in most aspects. They both
> represent a list of pages for TDX Module accessing. There are several
> use cases for these 2 structures:
>
> - As SEAMCALL inputs. They are claimed by TDX Module as control pages.
> Control pages are private pages for TDX Module to hold its internal
> control structures or private data. TDR, TDCS, TDVPR... are existing
> control pages, just not added via tdx_page_array.
> - As SEAMCALL outputs. They were TDX Module control pages and now are
> released.
> - As SEAMCALL inputs. They are just temporary buffers for exchanging
> data blobs in one SEAMCALL. TDX Module will not hold them for long
> time.
>
> The 2 structures both need a 'root page' which contains a list of HPAs.
> They collapse the HPA of the root page and the number of valid HPAs
> into a 64 bit raw value for SEAMCALL parameters. The root page is
> always a medium for passing data pages, TDX Module never keeps the
> root page.
>
> A main difference is HPA_ARRAY_T requires singleton mode when
> containing just 1 functional page (page0). In this mode the root page is
> not needed and the HPA field of the raw value directly points to the
> page0. But in this patch, root page is always allocated for user
> friendly kAPIs.
I think this undersells the fact that "singleton mode" is a premature
optimization that requires complication to take advantage of the benefit
(sometimes save a single page allocation). The Linux implementation
forfeits that small benefit for the larger gain of cleaner kAPIs.
> Another small difference is HPA_LIST_INFO contains a 'first entry' field
> which could be filled by TDX Module. This simplifies host by providing
> the same structure when re-invoke the interrupted SEAMCALL. No need for
> host to touch this field.
>
> Typical usages of the tdx_page_array:
>
> 1. Add control pages:
> - struct tdx_page_array *array = tdx_page_array_create(nr_pages);
> - seamcall(TDH_XXX_CREATE, array, ...);
>
> 2. Release control pages:
> - seamcall(TDX_XXX_DELETE, array, &nr_released, &released_hpa);
It is simply a bug if TDH_XXX_DELETE does not return every resource
passed to TDH_XXX_CREATE. The only "leak" case to worry about is that
TDH_XXX_DELETE fails. In that case it should be "fatal" (TDX_BUG_ON,
system can keep hobbling along, but panic_on_warn() would not be
unreasonable). If TDH_XXX_DELETE fails it indicates some catastrophic
misunderstanding between Linux and the TDX Module.
So the seamcall in this case has no need for @nr_released or
@released_hpa, those should already be known to the kernel.
What is missing is an architectural guarantee that TDH_XXX_DELETE
success == "all resources you arranged at TDH_XXX_CREATE time are free".
I would hope that is already the case and AUX_PAGE_PA is only an
unfortunate distraction. If it can ever be the case that CREATE and
DELETE are asymmetric on success then that needs to be corrected and
Linux will wait for a future module that can make that guarantee.
I think that cleans up a bulk of the logic here to abandon caring that
the module tries to remind us what we are releasing.
^ permalink raw reply
* Re: [PATCH v2] KVM: TDX: Fix x2APIC MSR handling in tdx_has_emulated_msr()
From: Binbin Wu @ 2026-04-13 2:13 UTC (permalink / raw)
To: Rick Edgecombe
Cc: kas, kvm, linux-coco, linux-kernel, pbonzini, seanjc, dmaluka
In-Reply-To: <20260410232654.3864196-1-rick.p.edgecombe@intel.com>
On 4/11/2026 7:26 AM, Rick Edgecombe wrote:
> Rework tdx_has_emulated_msr() to explicitly enumerate the x2APIC MSRs
> that KVM can emulate, instead of trying to enumerate the MSRs that KVM
> cannot emulate. Drop the inner switch and list the emulatable x2APIC
> registers directly in the outer switch's "return true" block.
>
> The old code had multiple bugs in the x2APIC range handling.
> X2APIC_MSR(APIC_ISR + APIC_ISR_NR) was incorrect because APIC_ISR_NR is
> 0x8, not 0x80, so the X2APIC_MSR() shift lost the lower bits, collapsing
> each range to a single MSR. IA32_X2APIC_SELF_IPI was also missing from
> the non-emulatable list.
Is it better to describe that the bug is benign for a sane guest?
>
> KVM has no visibility into whether or not a guest has enabled #VE
> reduction, which changes which MSRs the TDX-Module handles itself versus
> triggering a #VE for the guest to make a TDVMCALL. So maintaining a list
> of non-emulatable MSRs is fragile. Listing only the MSRs KVM can always
> emulate sidesteps the problem.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Reported-by: Dmytro Maluka <dmaluka@chromium.org>
> Fixes: dd50294f3e3c ("KVM: TDX: Implement callbacks for MSR operations")
> Assisted-by: Claude:claude-opus-4-6
> [based on a diff from Sean, but added missed LVTCMCI case, log]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
One nit below.
> ---
>
> Thanks to Dmytro for finding this. They said to feel free to take this
> over, so here is another version with Sean's suggestions. Tested in the
> TDX CI.
>
> In Sean's suggestion LVTCMCI was missed, so it's added here.
>
> arch/x86/kvm/vmx/tdx.c | 36 ++++++++++++++++++++----------------
> 1 file changed, 20 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 1e47c194af53..76ab6805ab29 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2116,23 +2116,27 @@ bool tdx_has_emulated_msr(u32 index)
> case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
> /* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
> case MSR_KVM_POLL_CONTROL:
> + /*
> + * x2APIC registers that are virtualized by the CPU can't be
> + * emulated, KVM doesn't have access to the virtual APIC page.
> + */
Nit:
The original comment explains why certain MSRs are not emulated due to
its implementation. After the change, it lists the allowed emulated MSRs,
a quick read might give the false impression that the listed MSRs are the
ones that cannot be emulated. Maybe slightly tweak the comment to clarify
that the cases listed below are the MSRs that KVM is responsible for
emulating.
> + case X2APIC_MSR(APIC_ID):
> + case X2APIC_MSR(APIC_LVR):
> + case X2APIC_MSR(APIC_LDR):
> + case X2APIC_MSR(APIC_SPIV):
> + case X2APIC_MSR(APIC_ESR):
> + case X2APIC_MSR(APIC_LVTCMCI):
> + case X2APIC_MSR(APIC_ICR):
> + case X2APIC_MSR(APIC_LVTT):
> + case X2APIC_MSR(APIC_LVTTHMR):
> + case X2APIC_MSR(APIC_LVTPC):
> + case X2APIC_MSR(APIC_LVT0):
> + case X2APIC_MSR(APIC_LVT1):
> + case X2APIC_MSR(APIC_LVTERR):
> + case X2APIC_MSR(APIC_TMICT):
> + case X2APIC_MSR(APIC_TMCCT):
> + case X2APIC_MSR(APIC_TDCR):
> return true;
> - case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
> - /*
> - * x2APIC registers that are virtualized by the CPU can't be
> - * emulated, KVM doesn't have access to the virtual APIC page.
> - */
> - switch (index) {
> - case X2APIC_MSR(APIC_TASKPRI):
> - case X2APIC_MSR(APIC_PROCPRI):
> - case X2APIC_MSR(APIC_EOI):
> - case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
> - case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
> - case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
> - return false;
> - default:
> - return true;
> - }
> default:
> return false;
> }
^ permalink raw reply
* Re: [PATCH v2 2/6] KVM: x86: Drop the "EX" part of "EXREG" to avoid collision with APX
From: Huang, Kai @ 2026-04-13 11:23 UTC (permalink / raw)
To: pbonzini@redhat.com, kas@kernel.org, seanjc@google.com
Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Bae, Chang Seok,
linux-kernel@vger.kernel.org, x86@kernel.org
In-Reply-To: <20260409224236.2021562-3-seanjc@google.com>
On Thu, 2026-04-09 at 15:42 -0700, Sean Christopherson wrote:
> Now that NR_VCPU_REGS is no longer a thing, and now that now that RIP is
Nit: double "now that".
^ permalink raw reply
* Re: SVSM Development Call April 8th, 2026
From: Jörg Rödel @ 2026-04-13 11:17 UTC (permalink / raw)
To: Stefano Garzarella; +Cc: coconut-svsm, linux-coco
In-Reply-To: <CAGxU2F6OApB3K61_sPujnvK_gx_K8zFWyOSTPV9mzWOCyNkBJg@mail.gmail.com>
Meeting Minutes are here, please everyone check for accuracy and completeness:
https://github.com/coconut-svsm/governance/pull/103
-Joerg
^ permalink raw reply
* Re: [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Huang, Kai @ 2026-04-13 11:24 UTC (permalink / raw)
To: pbonzini@redhat.com, kas@kernel.org, seanjc@google.com
Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Bae, Chang Seok,
linux-kernel@vger.kernel.org, x86@kernel.org
In-Reply-To: <20260409224236.2021562-6-seanjc@google.com>
On Thu, 2026-04-09 at 15:42 -0700, Sean Christopherson wrote:
> Convert regs_{avail,dirty} and all related masks to "unsigned long" values
> as an intermediate step towards declaring the fields as actual bitmaps, and
> as a step toward support APX, which will push the total number of registers
Nit: a step toward supporting APX.
^ permalink raw reply
* Re: [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Huang, Kai @ 2026-04-13 11:28 UTC (permalink / raw)
To: pbonzini@redhat.com, kas@kernel.org, seanjc@google.com
Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Bae, Chang Seok,
linux-kernel@vger.kernel.org, x86@kernel.org
In-Reply-To: <20260409224236.2021562-6-seanjc@google.com>
On Thu, 2026-04-09 at 15:42 -0700, Sean Christopherson wrote:
> -#define TDX_REGS_AVAIL_SET (BIT_ULL(VCPU_REG_EXIT_INFO_1) | \
> - BIT_ULL(VCPU_REG_EXIT_INFO_2) | \
> - BIT_ULL(VCPU_REGS_RAX) | \
> - BIT_ULL(VCPU_REGS_RBX) | \
> - BIT_ULL(VCPU_REGS_RCX) | \
> - BIT_ULL(VCPU_REGS_RDX) | \
> - BIT_ULL(VCPU_REGS_RBP) | \
> - BIT_ULL(VCPU_REGS_RSI) | \
> - BIT_ULL(VCPU_REGS_RDI) | \
> - BIT_ULL(VCPU_REGS_R8) | \
> - BIT_ULL(VCPU_REGS_R9) | \
> - BIT_ULL(VCPU_REGS_R10) | \
> - BIT_ULL(VCPU_REGS_R11) | \
> - BIT_ULL(VCPU_REGS_R12) | \
> - BIT_ULL(VCPU_REGS_R13) | \
> - BIT_ULL(VCPU_REGS_R14) | \
> - BIT_ULL(VCPU_REGS_R15))
> +#define TDX_REGS_AVAIL_SET (BIT(VCPU_REG_EXIT_INFO_1) | \
> + BIT(VCPU_REG_EXIT_INFO_2) | \
> + BIT(VCPU_REGS_RAX) | \
> + BIT(VCPU_REGS_RBX) | \
> + BIT(VCPU_REGS_RCX) | \
> + BIT(VCPU_REGS_RDX) | \
> + BIT(VCPU_REGS_RBP) | \
> + BIT(VCPU_REGS_RSI) | \
> + BIT(VCPU_REGS_RDI) | \
> + BIT(VCPU_REGS_R8) | \
> + BIT(VCPU_REGS_R9) | \
> + BIT(VCPU_REGS_R10) | \
> + BIT(VCPU_REGS_R11) | \
> + BIT(VCPU_REGS_R12) | \
> + BIT(VCPU_REGS_R13) | \
> + BIT(VCPU_REGS_R14) | \
> + BIT(VCPU_REGS_R15))
>
Not related to this series, but this made me look into whether these
registers are truly needed to be set as available for TDX.
Firstly, all the listed registers are marked as available immediately after
exiting from tdh_vp_enter(), but except VCPU_REG_EXIT_INFO_1 and
VCPU_REG_EXIT_INFO_2 are immediately saved to the common 'struct vcpu_vt',
all other GPRs are not saved to vcpu->arch.regs[], which means marking GPRs
available immediately doesn't quite make sense.
In fact, IIUC other than when the TD exits with TDVMCALL on which TD shares
couple of GPRs with KVM, KVM has no way to get TD's GPRs. So perhaps it
makes more sense is to mark the shared GPRs available upon TDVMCALL.
But even that does not make sense from KVM's "GPR available" perspective,
because TDVMCALL has a different ABI from KVM's existing infrastructure for
e.g., CPUID/MSR emulation. E.g., KVM uses RCX/RAX/RDX for MSR emulation,
but TDVMCALL<MSR.WRITE> uses R12 and R13 to convey MSR index/value:
case EXIT_REASON_MSR_WRITE:
kvm_rcx_write(vcpu, tdx->vp_enter_args.r12);
kvm_rax_write(vcpu, tdx->vp_enter_args.r13 & -1u);
kvm_rdx_write(vcpu, tdx->vp_enter_args.r13 >> 32);
So I think the most accurate way is to explicitly mark the relevant GPRs
available for each type of TDVMCALL. I am not sure whether it's worth to do
though, because AFAICT there's no real bug in the existing code, other than
"marking GPRs not in vcpu->arch.regs[] as available looks wrong".
A less invasive way is to mark all possible GPRs that can be used in
TDVMCALL emulation available once after TD exits. AFAICT the KVM hypercall
uses most GPRs (RAX/RBX/RCX/RDX/RSI) and all other TDVMCALLs only use a
subset, so maybe we can remove other GPRs from the available list (the diff
in [*] passed my test of booting/destroying TD).
Bug again, not sure whether it's worth doing.
[*]:
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 85f28363e4cc..7b4c182c22cf 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1019,17 +1019,7 @@ static fastpath_t tdx_exit_handlers_fastpath(struct
kvm_vcpu *vcpu)
BIT(VCPU_REGS_RBX) | \
BIT(VCPU_REGS_RCX) | \
BIT(VCPU_REGS_RDX) | \
- BIT(VCPU_REGS_RBP) | \
- BIT(VCPU_REGS_RSI) | \
- BIT(VCPU_REGS_RDI) | \
- BIT(VCPU_REGS_R8) | \
- BIT(VCPU_REGS_R9) | \
- BIT(VCPU_REGS_R10) | \
- BIT(VCPU_REGS_R11) | \
- BIT(VCPU_REGS_R12) | \
- BIT(VCPU_REGS_R13) | \
- BIT(VCPU_REGS_R14) | \
- BIT(VCPU_REGS_R15))
+ BIT(VCPU_REGS_RSI))
^ permalink raw reply related
* Re: [PATCH v2 0/6] KVM: x86: Reg cleanups / prep work for APX
From: Huang, Kai @ 2026-04-13 11:31 UTC (permalink / raw)
To: pbonzini@redhat.com, kas@kernel.org, seanjc@google.com
Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Bae, Chang Seok,
linux-kernel@vger.kernel.org, x86@kernel.org
In-Reply-To: <20260409224236.2021562-1-seanjc@google.com>
On Thu, 2026-04-09 at 15:42 -0700, Sean Christopherson wrote:
> Clean up KVM's register tracking and storage, primarily to prepare for landing
> APX, which expands the maximum number of GPRs from 16 to 32.
>
Tested booting/destroying both normal VMX guest and TD worked fine:
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
^ permalink raw reply
* Re: [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Sean Christopherson @ 2026-04-13 14:54 UTC (permalink / raw)
To: Kai Huang
Cc: pbonzini@redhat.com, kas@kernel.org, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, Chang Seok Bae,
linux-kernel@vger.kernel.org, x86@kernel.org
In-Reply-To: <e6e82905e6b9b1637ccb640097e67d21793f5895.camel@intel.com>
On Mon, Apr 13, 2026, Kai Huang wrote:
> On Thu, 2026-04-09 at 15:42 -0700, Sean Christopherson wrote:
> > -#define TDX_REGS_AVAIL_SET (BIT_ULL(VCPU_REG_EXIT_INFO_1) | \
> > - BIT_ULL(VCPU_REG_EXIT_INFO_2) | \
> > - BIT_ULL(VCPU_REGS_RAX) | \
> > - BIT_ULL(VCPU_REGS_RBX) | \
> > - BIT_ULL(VCPU_REGS_RCX) | \
> > - BIT_ULL(VCPU_REGS_RDX) | \
> > - BIT_ULL(VCPU_REGS_RBP) | \
> > - BIT_ULL(VCPU_REGS_RSI) | \
> > - BIT_ULL(VCPU_REGS_RDI) | \
> > - BIT_ULL(VCPU_REGS_R8) | \
> > - BIT_ULL(VCPU_REGS_R9) | \
> > - BIT_ULL(VCPU_REGS_R10) | \
> > - BIT_ULL(VCPU_REGS_R11) | \
> > - BIT_ULL(VCPU_REGS_R12) | \
> > - BIT_ULL(VCPU_REGS_R13) | \
> > - BIT_ULL(VCPU_REGS_R14) | \
> > - BIT_ULL(VCPU_REGS_R15))
> > +#define TDX_REGS_AVAIL_SET (BIT(VCPU_REG_EXIT_INFO_1) | \
> > + BIT(VCPU_REG_EXIT_INFO_2) | \
> > + BIT(VCPU_REGS_RAX) | \
> > + BIT(VCPU_REGS_RBX) | \
> > + BIT(VCPU_REGS_RCX) | \
> > + BIT(VCPU_REGS_RDX) | \
> > + BIT(VCPU_REGS_RBP) | \
> > + BIT(VCPU_REGS_RSI) | \
> > + BIT(VCPU_REGS_RDI) | \
> > + BIT(VCPU_REGS_R8) | \
> > + BIT(VCPU_REGS_R9) | \
> > + BIT(VCPU_REGS_R10) | \
> > + BIT(VCPU_REGS_R11) | \
> > + BIT(VCPU_REGS_R12) | \
> > + BIT(VCPU_REGS_R13) | \
> > + BIT(VCPU_REGS_R14) | \
> > + BIT(VCPU_REGS_R15))
> >
>
> Not related to this series, but this made me look into whether these
> registers are truly needed to be set as available for TDX.
>
> Firstly, all the listed registers are marked as available immediately after
> exiting from tdh_vp_enter(), but except VCPU_REG_EXIT_INFO_1 and
> VCPU_REG_EXIT_INFO_2 are immediately saved to the common 'struct vcpu_vt',
> all other GPRs are not saved to vcpu->arch.regs[], which means marking GPRs
> available immediately doesn't quite make sense.
>
> In fact, IIUC other than when the TD exits with TDVMCALL on which TD shares
> couple of GPRs with KVM, KVM has no way to get TD's GPRs. So perhaps it
> makes more sense is to mark the shared GPRs available upon TDVMCALL.
>
> But even that does not make sense from KVM's "GPR available" perspective,
> because TDVMCALL has a different ABI from KVM's existing infrastructure for
> e.g., CPUID/MSR emulation. E.g., KVM uses RCX/RAX/RDX for MSR emulation,
> but TDVMCALL<MSR.WRITE> uses R12 and R13 to convey MSR index/value:
>
> case EXIT_REASON_MSR_WRITE:
> kvm_rcx_write(vcpu, tdx->vp_enter_args.r12);
> kvm_rax_write(vcpu, tdx->vp_enter_args.r13 & -1u);
> kvm_rdx_write(vcpu, tdx->vp_enter_args.r13 >> 32);
>
> So I think the most accurate way is to explicitly mark the relevant GPRs
> available for each type of TDVMCALL. I am not sure whether it's worth to do
> though, because AFAICT there's no real bug in the existing code, other than
> "marking GPRs not in vcpu->arch.regs[] as available looks wrong".
>
> A less invasive way is to mark all possible GPRs that can be used in
> TDVMCALL emulation available once after TD exits. AFAICT the KVM hypercall
> uses most GPRs (RAX/RBX/RCX/RDX/RSI) and all other TDVMCALLs only use a
> subset, so maybe we can remove other GPRs from the available list (the diff
> in [*] passed my test of booting/destroying TD).
>
> Bug again, not sure whether it's worth doing.
Not worth doing. Because VMX and SVM make all GRPs available immediately, except
for RSP, KVM ignores avail/dirty for GPRs. I.e. "fixing" TDX will just shift the
"bugs" elsewhere.
More importantly, because the TDX-Module *requires* RCX (the GPR that holds the
mask of registers to expose to the VMM) to be hidden on TDVMCALL, KVM *can't*
do any kind of meaningful "available" tracking. Versus sev_es_validate_vmgexit(),
which can at least sanity check that the registers needed to service a hypercall
have valid data.
So unfortunately, since we need to rely on testing to verify KVM's implementation
no matter what, I don't think it'd be a net positive to overhaul KVM's handling
of GPRs to support SEV-ES+'s and TDX's "sometimes available" GPR set.
^ permalink raw reply
* Re: [PATCH v7 06/22] coco/tdx-host: Expose P-SEAMLDR information via sysfs
From: Edgecombe, Rick P @ 2026-04-13 19:08 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-7-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> TDX module updates require userspace to select the appropriate module
> to load. Expose necessary information to facilitate this decision. Two
> values are needed:
>
> - P-SEAMLDR version: for compatibility checks between TDX module and
> P-SEAMLDR
> - num_remaining_updates: indicates how many updates can be performed
Can you explain how all of these overlap?
- TDX module supports module update
- SEAMLDR supports NUM_REMAINING_UPDATES info
- SEAMLDR supports VERSION info
If the TDX module supports module update, do we know the SEAMLDR supports this
other stuff somehow? It might be worth a comment the reasoning.
>
> Expose them as tdx-host device attributes. Make seamldr attributes
> visible only when the update feature is supported, as that's their sole
> purpose.
>
> Unconditional exposure is also problematic because reading them
> triggers P-SEAMLDR calls that break KVM on CPUs with a specific erratum
> (to be enumerated and handled in a later patch).
Since this is later handled with the errata check, what is the point being made
here?
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
^ permalink raw reply
* Re: [PATCH v7 18/22] coco/tdx-host: Don't expose P-SEAMLDR features on CPUs with erratum
From: Edgecombe, Rick P @ 2026-04-13 19:28 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-19-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> Some TDX-capable CPUs have an erratum, as documented in Intel® Trust
> Domain CPU Architectural Extensions (May 2021 edition) Chapter 2.3:
>
> SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
> to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR using
> SEAMCALL must reload the current-VMCS, if required, using the VMPTRLD
> instruction.
>
> Clearing the current VMCS behind KVM's back will break KVM.
>
> This erratum is not present when IA32_VMX_BASIC[60] is set. Add a CPU
> bug bit for this erratum and refuse to expose P-SEAMLDR features (e.g.,
> TDX module updates) on affected CPUs.
>
> == Alternatives ==
> Two workarounds were considered but both were rejected:
>
> 1. Save/restore the current VMCS around P-SEAMLDR calls. This produces ugly
> assembly code [1] and doesn't play well with #MCE or #NMI if they
> need to use the current VMCS.
>
> 2. Move KVM's VMCS tracking logic to the TDX core code, which would break
> the boundary between KVM and the TDX core code [2].
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> Link: https://lore.kernel.org/kvm/fedb3192-e68c-423c-93b2-a4dc2f964148@intel.com/ # [1]
> Link: https://lore.kernel.org/kvm/aYIXFmT-676oN6j0@google.com/ # [2]
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/vmx.h | 1 +
> arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++
> drivers/virt/coco/tdx-host/tdx-host.c | 8 ++++++++
> 4 files changed, 21 insertions(+)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index dbe104df339b..377d009b7e2e 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -572,4 +572,5 @@
> #define X86_BUG_ITS_NATIVE_ONLY X86_BUG( 1*32+ 8) /* "its_native_only" CPU is affected by ITS, VMX is not affected */
> #define X86_BUG_TSA X86_BUG( 1*32+ 9) /* "tsa" CPU is affected by Transient Scheduler Attacks */
> #define X86_BUG_VMSCAPE X86_BUG( 1*32+10) /* "vmscape" CPU is affected by VMSCAPE attacks from guests */
> +#define X86_BUG_SEAMRET_INVD_VMCS X86_BUG( 1*32+11) /* "seamret_invd_vmcs" SEAMRET from P-SEAMLDR clears the current VMCS */
> #endif /* _ASM_X86_CPUFEATURES_H */
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index b92ff87e3560..a5a5b373ec42 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -136,6 +136,7 @@
> #define VMX_BASIC_INOUT BIT_ULL(54)
> #define VMX_BASIC_TRUE_CTLS BIT_ULL(55)
> #define VMX_BASIC_NO_HW_ERROR_CODE_CC BIT_ULL(56)
> +#define VMX_BASIC_NO_SEAMRET_INVD_VMCS BIT_ULL(60)
>
> static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
> {
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index d144860e17c2..92ab1d98e1b8 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -39,6 +39,7 @@
> #include <asm/cpu_device_id.h>
> #include <asm/processor.h>
> #include <asm/mce.h>
> +#include <asm/vmx.h>
>
> #include "seamcall_internal.h"
> #include "tdx.h"
> @@ -1462,6 +1463,8 @@ static struct notifier_block tdx_memory_nb = {
>
> static void __init check_tdx_erratum(void)
> {
> + u64 basic_msr;
> +
> /*
> * These CPUs have an erratum. A partial write from non-TD
> * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
> @@ -1473,6 +1476,14 @@ static void __init check_tdx_erratum(void)
> case INTEL_EMERALDRAPIDS_X:
> setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
> }
> +
> + /*
> + * Some TDX-capable CPUs have an erratum where the current VMCS is
> + * cleared after calling into P-SEAMLDR.
> + */
> + rdmsrq(MSR_IA32_VMX_BASIC, basic_msr);
> + if (!(basic_msr & VMX_BASIC_NO_SEAMRET_INVD_VMCS))
> + setup_force_cpu_bug(X86_BUG_SEAMRET_INVD_VMCS);
> }
>
> void __init tdx_init(void)
> diff --git a/drivers/virt/coco/tdx-host/tdx-host.c b/drivers/virt/coco/tdx-host/tdx-host.c
> index 746a5eef004d..71ea94da8e22 100644
> --- a/drivers/virt/coco/tdx-host/tdx-host.c
> +++ b/drivers/virt/coco/tdx-host/tdx-host.c
> @@ -100,6 +100,14 @@ static bool can_expose_seamldr(void)
> if (!sysinfo)
> return false;
>
> + /*
> + * Calling P-SEAMLDR on CPUs with the seamret_invd_vmcs bug clears
> + * the current VMCS, which breaks KVM. Verify the erratum is not
> + * present before exposing P-SEAMLDR features.
> + */
> + if (boot_cpu_has_bug(X86_BUG_SEAMRET_INVD_VMCS))
> + return false;
> +
> return tdx_supports_runtime_update(sysinfo);
> }
>
^ permalink raw reply
* Re: [PATCH v7 19/22] x86/virt/tdx: Enable TDX module runtime updates
From: Edgecombe, Rick P @ 2026-04-13 19:40 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-20-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> All pieces of TDX module runtime updates are in place. Enable it if it
> is supported.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> ---
> arch/x86/include/asm/tdx.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
^ permalink raw reply
* [PATCH v4 0/7] Add RMPOPT support.
From: Ashish Kalra @ 2026-04-13 19:42 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
From: Ashish Kalra <ashish.kalra@amd.com>
In the SEV-SNP architecture, hypervisor and non-SNP guests are subject
to RMP checks on writes to provide integrity of SEV-SNP guest memory.
The RMPOPT architecture enables optimizations whereby the RMP checks
can be skipped if 1GB regions of memory are known to not contain any
SNP guest memory.
RMPOPT is a new instruction designed to minimize the performance
overhead of RMP checks for the hypervisor and non-SNP guests.
RMPOPT instruction currently supports two functions. In case of the
verify and report status function the CPU will read the RMP contents,
verify the entire 1GB region starting at the provided SPA is HV-owned.
For the entire 1GB region it checks that all RMP entries in this region
are HV-owned (i.e, not in assigned state) and then accordingly updates
the RMPOPT table to indicate if optimization has been enabled and
provide indication to software if the optimization was successful.
In case of report status function, the CPU returns the optimization
status for the 1GB region.
The RMPOPT table is managed by a combination of software and hardware.
Software uses the RMPOPT instruction to set bits in the table,
indicating that regions of memory are entirely HV-owned. Hardware
automatically clears bits in the RMPOPT table when RMP contents are
changed during RMPUPDATE instruction.
For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.
As SNP is enabled by default the hypervisor and non-SNP guests are
subject to RMP write checks to provide integrity of SNP guest memory.
This patch-series adds support to enable RMP optimizations for up to
2TB of system RAM across the system and allow RMPUPDATE to disable
those optimizations as SNP guests are launched.
Support for RAM larger than 2 TB will be added in follow-on series.
This series also introduces support to re-enable RMP optimizations
during SNP guest termination, after guest pages have been converted
back to shared.
RMP optimizations are performed asynchronously by queuing work on a
dedicated workqueue after a 10 second delay.
Delaying work allows batching of multiple SNP guest terminations.
Once 1GB hugetlb guest_memfd support is merged, support for
re-enabling RMPOPT optimizations during 1GB page cleanup will be added
in follow-on series.
Additionally add debugfs interface to report per-CPU RMPOPT status
across all system RAM.
v4:
- Add new wrmsrq_on_cpus() helper to write same u64 value to a
per-CPU MSR across a cpumask without per-cpu struct allocation
overhead.
- Rename configure_and_enable_rmpopt() to snp_setup_rmpopt().
- Use wrmsrq_on_cpus() instead of wrmsrq_on_cpu() loop for
programming RMPOPT_BASE MSRs.
- Add setup_clear_cpu_cap(X86_FEATURE_RMPOPT) if segmented RMP
setup fails or workqueue allocation fails.
- Add X86_FEATURE_RMPOPT feature clear logic in amd_cc_platform_clear()
for CC_ATTR_HOST_SEV_SNP.
- All of the above allow checking for only X86_FEATURE_RMPOPT for both
RMPOPT setup/enable and RMP re-optimizations.
- Rename snp_perform_rmp_optimization() to snp_rmpopt_all_physmem().
- Split rmpopt() into rmpopt() and rmpopt_smp() for SMP callback use.
- Introduce separate rmpopt_report_cpumask for debugfs reporting,
distinct from rmpopt_cpumask used for primary thread tracking.
- Remove snp_perform_rmp_optimization() call from __sev_snp_init_locked()
and instead setup and enable RMPOPT after SNP is enabled and
initialized.
v3:
- Drop all RMPOPT kthread support and introduce adding custom and
dedicated workqueue to schedule delayed and asynchronous RMPOPT work.
- Drop the guest_memfd inode cleanup interface and add support to
re-enable RMP optimizations during guest shutdown using the
asynchronous and delayed workqueue interface.
- Introduce new __rmpopt() helper and rmpopt() and
rmpopt_report_status() wrappers on top which use rax and rcx
parameters to closely match RMPOPT specs.
- Use new optimized RMPOPT loop to issue RMPOPT instructions on all
system RAM upto 2TB and all CPUs, by optimizing each range on one CPU
first, then let other CPUs execute RMPOPT in parallel so they can skip
most work as the range has already been optimized.
- Also add support for running the optimized RMPOPT loop only on
one thread per core.
- Replace all PUD_SIZE references with SZ_1G to conform to 1GB regions
as specified by RMPOPT specifications and not be dependent on PUD_SIZE
which makes the RMPOPT patch-set independent of x86 page table sizes.
- Use wrmsrq_on_cpu() to program the RMPOPT_BASE MSR registers on
all CPUs that removes all ugly casting to use on_each_cpu_mask().
- Fix inline commits and patch commit messages
v2:
- Drop all NUMA and Socket configuration and enablement support and
enable RMPOPT support for up to 2TB of system RAM.
- Drop get_cpumask_of_primary_threads() and enable per-core RMPOPT
base MSRs and issue RMPOPT instruction on all CPUs.
- Drop the configfs interface to manually re-enable RMP optimizations.
- Add new guest_memfd cleanup interface to automatically re-enable
RMP optimizations during guest shutdown.
- Include references to the public RMPOPT documentation.
- Move debugfs directory for RMPOPT under architecuture specific
parent directory.
Ashish Kalra (7):
x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
x86/msr: add wrmsrq_on_cpus helper
x86/sev: Initialize RMPOPT configuration MSRs
x86/sev: Add support to perform RMP optimizations asynchronously
x86/sev: Add interface to re-enable RMP optimizations.
KVM: SEV: Perform RMP optimizations on SNP guest shutdown
x86/sev: Add debugfs support for RMPOPT
arch/x86/coco/core.c | 1 +
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/include/asm/msr-index.h | 3 +
arch/x86/include/asm/msr.h | 5 +
arch/x86/include/asm/sev.h | 4 +
arch/x86/kernel/cpu/scattered.c | 1 +
arch/x86/kvm/svm/sev.c | 2 +
arch/x86/lib/msr-smp.c | 20 +++
arch/x86/virt/svm/sev.c | 271 ++++++++++++++++++++++++++++-
drivers/crypto/ccp/sev-dev.c | 3 +
10 files changed, 310 insertions(+), 2 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH v4 1/7] x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
From: Ashish Kalra @ 2026-04-13 19:42 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1775874970.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
Add a flag indicating whether RMPOPT instruction is supported.
RMPOPT is a new instruction designed to minimize the performance
overhead of RMP checks on the hypervisor and on non-SNP guests by
allowing RMP checks to be skipped when 1G regions of memory are known
not to contain any SEV-SNP guest memory.
For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/kernel/cpu/scattered.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index dbe104df339b..bce1b2e2a35c 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
#define X86_FEATURE_K8 ( 3*32+ 4) /* Opteron, Athlon64 */
#define X86_FEATURE_ZEN5 ( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
#define X86_FEATURE_ZEN6 ( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT ( 3*32+ 7) /* Support for AMD RMPOPT instruction */
#define X86_FEATURE_CONSTANT_TSC ( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
#define X86_FEATURE_UP ( 3*32+ 9) /* "up" SMP kernel running on UP */
#define X86_FEATURE_ART ( 3*32+10) /* "art" Always running timer (ART) */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 42c7eac0c387..7ac3818c4502 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -65,6 +65,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_PERFMON_V2, CPUID_EAX, 0, 0x80000022, 0 },
{ X86_FEATURE_AMD_LBR_V2, CPUID_EAX, 1, 0x80000022, 0 },
{ X86_FEATURE_AMD_LBR_PMC_FREEZE, CPUID_EAX, 2, 0x80000022, 0 },
+ { X86_FEATURE_RMPOPT, CPUID_EDX, 0, 0x80000025, 0 },
{ X86_FEATURE_AMD_HTR_CORES, CPUID_EAX, 30, 0x80000026, 0 },
{ 0, 0, 0, 0, 0 }
};
--
2.43.0
^ permalink raw reply related
* [PATCH v4 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Ashish Kalra @ 2026-04-13 19:42 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1775874970.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
The existing wrmsr_on_cpus() takes a per-cpu struct msr array, requiring
callers to allocate and populate per-cpu storage even when every CPU
receives the same value. This is unnecessary overhead for the common
case of writing a single uniform u64 to a per-CPU MSR across multiple
CPUs.
Add wrmsrq_on_cpus() which writes the same u64 value to the specified
MSR on all CPUs in the given cpumask.
Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/include/asm/msr.h | 5 +++++
arch/x86/lib/msr-smp.c | 20 ++++++++++++++++++++
2 files changed, 25 insertions(+)
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 9c2ea29e12a9..f5f63b4115c8 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -260,6 +260,7 @@ int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
int rdmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 *q);
int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q);
+void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q);
void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr __percpu *msrs);
void wrmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr __percpu *msrs);
int rdmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
@@ -289,6 +290,10 @@ static inline int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q)
wrmsrq(msr_no, q);
return 0;
}
+static inline void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q)
+{
+ wrmsrq_on_cpu(0, msr_no, q);
+}
static inline void rdmsr_on_cpus(const struct cpumask *m, u32 msr_no,
struct msr __percpu *msrs)
{
diff --git a/arch/x86/lib/msr-smp.c b/arch/x86/lib/msr-smp.c
index b8f63419e6ae..d2c91c9bb47b 100644
--- a/arch/x86/lib/msr-smp.c
+++ b/arch/x86/lib/msr-smp.c
@@ -94,6 +94,26 @@ int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q)
}
EXPORT_SYMBOL(wrmsrq_on_cpu);
+void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q)
+{
+ struct msr_info rv;
+ int this_cpu;
+
+ memset(&rv, 0, sizeof(rv));
+
+ rv.msr_no = msr_no;
+ rv.reg.q = q;
+
+ this_cpu = get_cpu();
+
+ if (cpumask_test_cpu(this_cpu, mask))
+ __wrmsr_on_cpu(&rv);
+
+ smp_call_function_many(mask, __wrmsr_on_cpu, &rv, 1);
+ put_cpu();
+}
+EXPORT_SYMBOL(wrmsrq_on_cpus);
+
static void __rwmsr_on_cpus(const struct cpumask *mask, u32 msr_no,
struct msr __percpu *msrs,
void (*msr_func) (void *info))
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox