Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Michael Roth @ 2026-04-29 23:51 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

On Tue, Apr 28, 2026 at 04:24:55PM -0700, Ackerley Tng via B4 Relay wrote:
> [Some people who received this message don't often get email from devnull+ackerleytng.google.com@kernel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> This is RFC v5 of guest_memfd in-place conversion support.
> 
> Up till now, guest_memfd supports the entire inode worth of memory being
> used as all-shared, or all-private. CoCo VMs may request guest memory to be
> converted between private and shared states, and the only way to support
> that currently would be to have the userspace VMM provide two sources of
> backing memory from completely different areas of physical memory.
> 
> pKVM has a use case for in-place sharing: the guest and host may be
> cooperating on given data, and pKVM doesn't protect data through
> encryption, so copying that given data between different areas of physical
> memory as part of conversions would be unnecessary work.
> 
> This series also serves as a foundation for guest_memfd huge page
> support. Now, guest_memfd only supports PAGE_SIZE pages, so if two sources
> of backing memory are used, the userspace VMM could maintain a steady total
> memory utilized by punching out the pages that are not used. When huge
> pages are available in guest_memfd, even if the backing memory source
> supports hole punching within a huge page, punching out pages to maintain
> the total memory utilized by a VM would be introducing lots of
> fragmentation.
> 
> In-place conversion avoids fragmentation by allowing the same physical
> memory to be used for both shared and private memory, with guest_memfd
> tracks the shared/private status of all the pages at a per-page
> granularity.
> 
> The central principle, which guest_memfd continues to uphold, is that any
> guest-private page will not be mappable to host userspace. All pages will
> be mmap()-able in host userspace, but accesses to guest-private pages (as
> tracked by guest_memfd) will result in a SIGBUS.
> 
> This series introduces a guest_memfd ioctl (not kvm, vm or vcpu, but
> guest_memfd ioctl) that allows userspace to set memory
> attributes (shared/private) directly through the guest_memfd. This is the
> appropriate interface because shared/private-ness is a property of memory
> and hence the request should be sent directly to the memory provider -
> guest_memfd.
> 
> Tested with both CONFIG_KVM_VM_MEMORY_ATTRIBUTES enabled and disabled:
> 
> + tools/testing/selftests/kvm/guest_memfd_test.c
> + tools/testing/selftests/kvm/pre_fault_memory_test.c
> + tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> + tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> + tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
> + tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
> 
> Updates for this revision:
> 
> + For TDX and SNP, PRESERVE supported only before VM is finalized only for
>   to_private conversions.
>     + This allows PRESERVE to be used as part of the VM memory
>       loading/encryption flow
>     + Only support PRESERVE for to_private conversions (to_shared on
>       populated memory on TDX would cause zeroing)
>     + Relaxed constraints for SNP and TDX to allow NULL to be passed as
>       source address.
> + Dropped KVM_CAP_MEMORY_ATTRIBUTES2. KVM_CAP_MEMORY_ATTRIBUTES reports
>   attributes supported by the KVM_SET_MEMORY_ATTRIBUTES VM ioctl, and
>   KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES reports attributes supported bt the
>   KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl.
>     + KVM_SET_MEMORY_ATTRIBUTES2 is not supported by the VM ioctl
> + Resolve locking issue when kvm_gmem_get_attribute() is called from
>   kvm_mmu_zap_collapsible_spte() by bugging the VM. guest_memfd memslots
>   don't support dirty tracking, so the locking issue is not on an
>   accessible code path.
> + Moved guest_memfd_conversions_test.c to only be compiled and tested for
>   x86, since it depends so heavily on KVM_X86_SW_PROTECTED_VM's as a
>   testing vehicle
> 
> TODOs
> 
> + Perhaps further clarify PRESERVE flag: [8]

I made a super-long-winded reply to that thread, but to summarize:

PRESERVE flag has different enumeration/behavior/enforcement for pre-launch
vs. post-launch, and similar considerations might come into play for
other flags, so to make it easier to enumerate what flags are available
for pre-launch/post-launch, maybe we could have 2 capabilities instead
of 1:

  KVM_CAP_MEMORY_ATTRIBUTES2_PRE_LAUNCH_FLAGS
  KVM_CAP_MEMORY_ATTRIBUTES2_FLAGS

where SNP/TDX would only advertise PRESERVE for PRE_LAUNCH, and pKVM I
guess would enumerate it for both (or maybe just POST_LAUNCH?)

That lets us keep the flags definitions more straightforward but still
allows userspace to easily enumerate what exactly should be available at
pre vs. post launch time, and give us some flexibility to detail
variations in behavior between the 2 phases without documenting
edge-cases in terms of VM types.

> + Resolve issue where guest_memfd_conversions_test, which uses the
>   kselftest framework, doesn't perform teardown on assertion
>   failure. Please see proposal at [9]
> + Test with TDX selftests. We're in the process of rebasing TDX selftests
>   on this series and will post updates when that's tested.
> 
> I would like feedback on:
> 
> + Content modes: 0 (MODE_UNSPECIFIED), ZERO, and PRESERVE. Is that all
>   good, or does anyone think there is a use case for something else?
> + Should the content modes apply even if no attribute changes are required?
>     + See notes added in "KVM: guest_memfd: Apply content modes while
>       setting memory attributes"

Looking at the example you have there:

  + Note: These content modes apply to the entire requested range, not
  + just the parts of the range that underwent conversion. For example, if
  + this was the initial state:
  + 
  +   * [0x0000, 0x1000): shared
  +   * [0x1000, 0x2000): private
  +   * [0x2000, 0x3000): shared
  + and range [0x0000, 0x3000) was set to shared, the content mode would
  + apply to all memory in [0x0000, 0x3000), not just the range that
  + underwent conversion [0x1000, 0x2000).

Userspace would be aware of whether the range contains pages that were
already set to private, so if it really wants to set the just the
[0x1000, 0x2000) range to shared with appropriate content mode, it is
fully able to do so by just issuing the ioctl for that specific range.
If it attempts to issue it for the entire range, it only seems like it
would defy normal expectations and cause confusion to skip ranges, and
I'm not sure it gains us anything useful in exchange for that potential
confusion.

>     + Possibly related: should setting attributes be allowed if some
>       sub-range requested already has the requested attribute?

As it is now, userspace has that capability (to use finer-grained ranges
if it doesn't want to re-issue unecessary/unwanted conversions), similar
to above. And KVM internally will just issue kvm_arch_gmem_prepare()
calls so that architecture-specific handling can deal with this case
(e.g. SNP's sev_gmem_prepare() already checks if the corresponding
attribute is set in the RMP table and just skips it otherwise). So I
don't think we really gain anything but added complexity if we try to
make gmem more selective about it.

-Mike

> + Structure of how various content modes are checked for support or
>   applied? I used overridable weak functions for architectures that haven't
>   defined support, and defined overrides for x86 to show how I think it would
>   work. For CoCo platforms, I only implemented TDX for illustration purposes
>   and might need help with the other platforms. Should I have used
>   kvm_x86_ops? I tried and found myself defining lots of boilerplate.
> + The use of private_mem_conversions_test.sh to run different options in
>   private_mem_conversions_test. If this makes sense, I'll adjust the
>   Makefile to have private_mem_conversions_test tested only via the script.
> 
> This series is based on kvm/next, and here's the tree for your convenience:
> 
> https://github.com/googleprodkernel/linux-cc/commits/guest_memfd-inplace-conversion-v5
> 
> Older series:
> 
> + RFCv4 is at [7]
> + RFCv3 is at [6]
> + RFCv2 is at [5]
> + RFCv1 is at [4]
> + Previous versions of this feature, part of other series, are available at
>   [1][2][3].
> 
> [1] https://lore.kernel.org/all/bd163de3118b626d1005aa88e71ef2fb72f0be0f.1726009989.git.ackerleytng@google.com/
> [2] https://lore.kernel.org/all/20250117163001.2326672-6-tabba@google.com/
> [3] https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com/
> [4] https://lore.kernel.org/all/cover.1760731772.git.ackerleytng@google.com/T/
> [5] https://lore.kernel.org/all/cover.1770071243.git.ackerleytng@google.com/T/
> [6] https://lore.kernel.org/r/20260313-gmem-inplace-conversion-v3-0-5fc12a70ec89@google.com/T/
> [7] https://lore.kernel.org/all/20260326-gmem-inplace-conversion-v4-0-e202fe950ffd@google.com/T/
> [8] https://lore.kernel.org/all/CAEvNRgGbMhkX310CkFY_M5x-zod=BDTiuznrZ0XvFPUK7weL1A@mail.gmail.com/
> [9] https://lore.kernel.org/all/20260414-selftest-global-metadata-v1-0-fd223922bc57@google.com/T/
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> Ackerley Tng (34):
>       KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
>       KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
>       KVM: guest_memfd: Only prepare folios for private pages
>       KVM: Move kvm_supported_mem_attributes() to kvm_host.h
>       KVM: guest_memfd: Add basic support for KVM_SET_MEMORY_ATTRIBUTES2
>       KVM: guest_memfd: Ensure pages are not in use before conversion
>       KVM: guest_memfd: Call arch invalidate hooks on conversion
>       KVM: guest_memfd: Return early if range already has requested attributes
>       KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
>       KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
>       KVM: guest_memfd: Use actual size for invalidation in kvm_gmem_release()
>       KVM: guest_memfd: Determine invalidation filter from memory attributes
>       KVM: guest_memfd: Introduce default handlers for content modes
>       KVM: guest_memfd: Apply content modes while setting memory attributes
>       KVM: x86: Support SW_PROTECTED_VM in applying content modes
>       KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
>       KVM: x86: Support SNP and TDX applying content modes
>       KVM: x86: Bug CoCo VM on page fault before finalizing
>       KVM: Add CAP to enumerate supported SET_MEMORY_ATTRIBUTES2 flags
>       KVM: selftests: Test basic single-page conversion flow
>       KVM: selftests: Test conversion flow when INIT_SHARED
>       KVM: selftests: Test conversion precision in guest_memfd
>       KVM: selftests: Test conversion before allocation
>       KVM: selftests: Convert with allocated folios in different layouts
>       KVM: selftests: Test that truncation does not change shared/private status
>       KVM: selftests: Test conversion with elevated page refcount
>       KVM: selftests: Test that conversion to private does not support ZERO
>       KVM: selftests: Support checking that data not equal expected
>       KVM: selftests: Test that not specifying a conversion flag scrambles memory contents
>       KVM: selftests: Reset shared memory after hole-punching
>       KVM: selftests: Provide function to look up guest_memfd details from gpa
>       KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
>       KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd
>       KVM: selftests: Add script to exercise private_mem_conversions_test
> 
> Michael Roth (1):
>       KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
> 
> Sean Christopherson (18):
>       KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
>       KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
>       KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
>       KVM: Stub in ability to disable per-VM memory attribute tracking
>       KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
>       KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
>       KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes
>       KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
>       KVM: selftests: Create gmem fd before "regular" fd when adding memslot
>       KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset}
>       KVM: selftests: Add support for mmap() on guest_memfd in core library
>       KVM: selftests: Add selftests global for guest memory attributes capability
>       KVM: selftests: Add helpers for calling ioctls on guest_memfd
>       KVM: selftests: Test that shared/private status is consistent across processes
>       KVM: selftests: Provide common function to set memory attributes
>       KVM: selftests: Check fd/flags provided to mmap() when setting up memslot
>       KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes
>       KVM: selftests: Update private memory exits test to work with per-gmem attributes
> 
>  Documentation/virt/kvm/api.rst                     | 139 ++++-
>  .../virt/kvm/x86/amd-memory-encryption.rst         |  19 +-
>  Documentation/virt/kvm/x86/intel-tdx.rst           |   4 +
>  arch/x86/include/asm/kvm_host.h                    |   2 +-
>  arch/x86/kvm/Kconfig                               |  15 +-
>  arch/x86/kvm/mmu/mmu.c                             |  20 +-
>  arch/x86/kvm/svm/sev.c                             |  18 +-
>  arch/x86/kvm/vmx/tdx.c                             |   8 +-
>  arch/x86/kvm/x86.c                                 | 145 ++++-
>  include/linux/kvm_host.h                           |  74 ++-
>  include/trace/events/kvm.h                         |   4 +-
>  include/uapi/linux/kvm.h                           |  21 +
>  mm/swap.c                                          |   2 +
>  tools/testing/selftests/kvm/Makefile.kvm           |   5 +
>  tools/testing/selftests/kvm/include/kvm_util.h     | 141 ++++-
>  tools/testing/selftests/kvm/include/test_util.h    |  34 +-
>  .../selftests/kvm/kvm_has_gmem_attributes.c        |  17 +
>  tools/testing/selftests/kvm/lib/kvm_util.c         | 130 +++--
>  tools/testing/selftests/kvm/lib/test_util.c        |   7 -
>  tools/testing/selftests/kvm/lib/x86/sev.c          |   2 +-
>  .../testing/selftests/kvm/pre_fault_memory_test.c  |   4 +-
>  .../kvm/x86/guest_memfd_conversions_test.c         | 552 +++++++++++++++++++
>  .../kvm/x86/private_mem_conversions_test.c         |  55 +-
>  .../kvm/x86/private_mem_conversions_test.sh        | 128 +++++
>  .../selftests/kvm/x86/private_mem_kvm_exits_test.c |  38 +-
>  virt/kvm/Kconfig                                   |   3 +-
>  virt/kvm/guest_memfd.c                             | 591 ++++++++++++++++++++-
>  virt/kvm/kvm_main.c                                |  87 ++-
>  28 files changed, 2075 insertions(+), 190 deletions(-)
> ---
> base-commit: 39f1c201b93f4ff71631bac72cff6eb155f976a4
> change-id: 20260225-gmem-inplace-conversion-bd0dbd39753a
> 
> Best regards,
> --
> Ackerley Tng <ackerleytng@google.com>
> 
> 

^ permalink raw reply

* Re: [PATCH v8 08/21] x86/virt/seamldr: Allocate and populate a module update request
From: Dave Hansen @ 2026-04-30  0:45 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-9-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> P-SEAMLDR uses the SEAMLDR_PARAMS structure to describe TDX module
> update requests. This structure contains physical addresses pointing to
> the module binary and its signature file (or sigstruct), along with an
> update scenario field.
> 
> TDX modules are distributed in the tdx_blob format defined in
> blob_structure.txt from the "Intel TDX module Binaries Repository". A
> tdx_blob contains a header, sigstruct, and module binary. This is also the
> format supplied by the userspace to the kernel.
> 
> Parse the tdx_blob format and populate a SEAMLDR_PARAMS structure. The
> header is consumed solely by the kernel to extract the sigstruct and
> module, so validate it before processing to protect the kernel ABI. The
> sigstruct and module are passed to and validated by P-SEAMLDR, so don't
> duplicate any validation in the kernel.
> 
> Note: the sigstruct_pa field in SEAMLDR_PARAMS has been extended to
> a 4-element array. The updated "SEAM Loader (SEAMLDR) Interface
> Specification" will be published separately.

These changelogs have all the right info, but I find them really hard to
parse. For instance, if you're going to have a 'struct seamldr_params',
then just stick with that name. Don't use the "SEAMLDR_PARAMS" name too.

Start with the data structures:

There are two important ABIs here:

'struct tdx_blob'       - the on-disk and in-memory format for a TDX
 		          module update image.
'struct seamldr_params' - The in-memory ABI passed to the TDX module
			  loader. Points to a single 'struct tdx_blob'

> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index 650c0f097aac..f70be8e2a07b 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -7,6 +7,7 @@
>  #define pr_fmt(fmt)	"seamldr: " fmt
>  
>  #include <linux/mm.h>
> +#include <linux/slab.h>
>  #include <linux/spinlock.h>
>  
>  #include <asm/seamldr.h>
> @@ -16,6 +17,33 @@
>  /* P-SEAMLDR SEAMCALL leaf function */
>  #define P_SEAMLDR_INFO			0x8000000000000000
>  
> +#define SEAMLDR_MAX_NR_MODULE_PAGES	496
> +#define SEAMLDR_MAX_NR_SIG_PAGES	4

Gah. All this complexity for the variable-length sigstruct to save a
maximum of 4 pages. Wow.

This whole thing could have been:

struct tdx_image {
	u16	version; // This ABI is always 0x100
	u16	checksum;
	u8	signature[8];
	u32	length;
	u8	reserved[4076];
	u8	sigstruct[SIGSTRUCT_SIZE];
	u8	module[];
}

One variable array. No module offset calculations or munging.

Why do we do this to ourselves for 3 measly pages? ;)

> +/*
> + * The seamldr_params "scenario" field specifies the operation mode:
> + * 0: Install TDX module from scratch (not used by kernel)
> + * 1: Update existing TDX module to a compatible version
> + */
> +#define SEAMLDR_SCENARIO_UPDATE		1
> +
> +/*
> + * This is called the "SEAMLDR_PARAMS" data structure and is defined
> + * in "SEAM Loader (SEAMLDR) Interface Specification".
> + *
> + * It describes the TDX module that will be installed.
> + */
> +struct seamldr_params {
> +	u32	version;
> +	u32	scenario;
> +	u64	sigstruct_pa[SEAMLDR_MAX_NR_SIG_PAGES];
> +	u8	reserved[80];
> +	u64	num_module_pages;
> +	u64	mod_pages_pa_list[SEAMLDR_MAX_NR_MODULE_PAGES];
> +} __packed;
> +
> +static_assert(sizeof(struct seamldr_params) == 4096);
> +
>  /*
>   * Serialize P-SEAMLDR calls since the hardware only allows a single CPU to
>   * interact with P-SEAMLDR simultaneously. Use raw version as the calls can
> @@ -43,6 +71,128 @@ int seamldr_get_info(struct seamldr_info *seamldr_info)
>  }
>  EXPORT_SYMBOL_FOR_MODULES(seamldr_get_info, "tdx-host");
>  
> +/*
> + * Intel TDX module blob. Its format is defined at:
> + * https://github.com/intel/tdx-module-binaries/blob/main/blob_structure.txt

Heh, so URLs are not OK in changelogs because they go stale, but they're
fine in the code?

> + * Note this structure differs from the reference above: the two variable-length
> + * fields "@sigstruct" and "@module" are represented as a single "@data" field
> + * here and split programmatically using the offset_of_module value.

This is good info. But, it's copied and pasted between the changelog and
here. I'd choose one, honestly.

> + * Note @offset_of_module is relative to the start of struct tdx_blob, not
> + * @data, and @length is the total length of the blob, not the length of
> + * @data.
> + */

Out of line comments aren't great. Do these in the data structure if at
all possible. Or, in the code. For instance:

> +struct tdx_blob {
> +	u16	version; // This ABI is always 0x100
> +	u16	checksum;
> +	u32	offset_of_module; // from start of tdx_blob
> +	u8	signature[8];
> +	u32	length;
> +	u32	reserved0;
> +	u64	reserved1[509];
> +	u8	data[]; // contains sigstruct[] and module[]
> +} __packed;

That's probably _better_ than the two duplicated comments that are there
now.

Also, why bother having two reserved arrays instead of:

	u8 reserved[4076];

?

> +/* Supported versions of the tdx_blob */
> +#define TDX_BLOB_VERSION_1	0x100

The comment here doesn't help much.

> +/*
> + * Blob fields are processed by the kernel and the payloads
> + * are passed to the TDX module. Do normal user input type
> + * check for any fields that don't get passed to the TDX module.
> + */

I made it this far, but I rather despise the 'blob' terminology. It's
just bad naming. We should really just call it 'tdx_update_image' or
'tdx_image' everywhere and stop saying 'blob'. Blob is one of those
names that people throw at things when they give up on naming.

> +static const struct tdx_blob *get_and_check_blob(const u8 *data, u32 size)
> +{
> +	const struct tdx_blob *blob = (const void *)data;
> +
> +	/*
> +	 * Ensure the size is valid otherwise reading any field from the
> +	 * blob may overflow.
> +	 */
> +	if (size <= sizeof(struct tdx_blob))
> +		return ERR_PTR(-EINVAL);

Couple of things here:

First, using sizeof() on a type with a variable-length array is a big
warning sign. It needs commenting. It's especially subtle because this
will go on and parse patently invalid 'data' images that don't even have
room for sigstruct[] or module[].

This is *specifically* about the pre-data[] fields that are going to be
read below.

> +	/*
> +	 * Don't care about user passing the wrong file, but protect
> +	 * kernel ABI by preventing accepting garbage.
> +	 */
> +	if (memcmp(blob->signature, "TDX-BLOB", 8))
> +		return ERR_PTR(-EINVAL);

Is there really no helper in the kernel anywhere that can safely do the
8-byte compare against two known-to-the-compiler 8-byte-wide fields
without hard-coding the 8?

> +	/*
> +	 * Ensure the offset of the module is within valid bounds and
> +	 * page-aligned.
> +	 */
> +	if (blob->offset_of_module >= size || blob->offset_of_module <= sizeof(struct tdx_blob))
> +		return ERR_PTR(-EINVAL);

Again, the sizeof(struct tdx_blob) is wonky. Why does this disallow
pointing blob->offset_of_module at reserved1[508] but not sigstruct[]?

> +	if (!IS_ALIGNED(blob->offset_of_module, PAGE_SIZE))
> +		return ERR_PTR(-EINVAL);

Wait a sec. Unless blob->offset_of_module==0, how could this check pass
and "blob->offset_of_module <= sizeof(struct tdx_blob)" fail?

> +	if (blob->version != TDX_BLOB_VERSION_1)
> +		return ERR_PTR(-EINVAL);

This should be the first check, IMNHO. If this doesn't pass then the
rest of the fields are invalid. No?

> +	if (blob->reserved0 || memchr_inv(blob->reserved1, 0, sizeof(blob->reserved1)))
> +		return ERR_PTR(-EINVAL);

There should not be two reserved, must-be-0 fields. There should be 1.
There must be 1.

Also I don't like the proposed data structure. It would make a lot more
sense to me if it were:

struct tdx_image_header {
	u16	version; // This ABI is always 0x100
	u16	checksum;
	u32	offset_of_module; // from start of the header
	u8	signature[8];
	u32	length;
	u8	reserved[4076];
}

struct p {
	u8[PAGE_SIZE];
};

struct tdx_image {
	struct tdx_image_header h;
	struct p pages[];
};

Then you can do things like check if sizeof(struct tdx_image_header) ==
PAGE_SIZE. Or whether offset_of_module points past the header.

That stuff only makes sense if you separate out the header structure
from the payloads which are the page-aligned sigstruct and module image
itself.

But exposing the double-variable-length arrays seems really wonky to me.

> +	return blob;
> +}
> +
> +static struct seamldr_params *alloc_seamldr_params(const struct tdx_blob *blob, unsigned int blob_size)
> +{

This does far more than "alloc" something.

> +	struct seamldr_params *params;
> +	int module_pg_cnt, sig_pg_cnt;
> +	const u8 *sig, *module;
> +	int i;
> +
> +	params = (struct seamldr_params *)get_zeroed_page(GFP_KERNEL);
> +	if (!params)
> +		return ERR_PTR(-ENOMEM);

kzmalloc(PAGE_SIZE, GFP_KERNEL) will save you a cast.

> +	/*
> +	 * Split the blob into a sigstruct and a module. Assume all
> +	 * size/offsets are within bounds of blob_size due to prior checks.
> +	 */
> +	sig		= blob->data;
> +	sig_pg_cnt	= (blob->offset_of_module - sizeof(struct tdx_blob)) >> PAGE_SHIFT;

Of course, the size of the first thing is defined by the offset of the
second thing.

This really should just be called ->end_of_sig.

> +	module		= (const u8 *)blob + blob->offset_of_module;
> +	module_pg_cnt	= (blob_size - blob->offset_of_module) >> PAGE_SHIFT;

This looks halfway sane:

	 /* adjust for size of the header: */
	sig_size    = blob->end_of_sig - PAGE_SIZE;
	module_size = module_image_size - blob->end_of_sig;

Then, page-adjust it later. One bit of magic at a time, please.

> +	/*
> +	 * Only use version 1 when required (sigstruct > 4KB) for backward
> +	 * compatibility with P-SEAMLDR that lacks version 1 support.
> +	 */
> +	params->version = sig_pg_cnt > 1;

Ewwww.

But what do we do if we're on an old P-SEAMLDR but get a big sigstruct?
It'll just fail?

How many old P-SEAMLDRs are there in the wild? Do we even care about this?

> +	params->scenario = SEAMLDR_SCENARIO_UPDATE;
> +
> +	for (i = 0; i < MIN(sig_pg_cnt, SEAMLDR_MAX_NR_SIG_PAGES); i++) {

Same for the MIN(). Do all the calculations separate from the loop.

> +		params->sigstruct_pa[i] = vmalloc_to_pfn(sig) << PAGE_SHIFT;
> +		sig += PAGE_SIZE;
> +	}
> +
> +	params->num_module_pages = MIN(module_pg_cnt, SEAMLDR_MAX_NR_MODULE_PAGES);
> +	for (i = 0; i < params->num_module_pages; i++) {
> +		params->mod_pages_pa_list[i] = vmalloc_to_pfn(module) << PAGE_SHIFT;
> +		module += PAGE_SIZE;
> +	}

Really what you want here is a helper. Have it take the module or
sigstruct pointer, a pointer to the pa_list[] and a maximum size.

Then call the helper twice.

> +	return params;
> +}
> +
> +static struct seamldr_params *init_seamldr_params(const u8 *data, u32 size)
> +{
> +	const struct tdx_blob *blob;
> +
> +	blob = get_and_check_blob(data, size);
> +	if (IS_ERR(blob))
> +		return ERR_CAST(blob);
> +
> +	return alloc_seamldr_params(blob, size);
> +}
> +
> +DEFINE_FREE(free_seamldr_params, struct seamldr_params *,
> +	    if (!IS_ERR_OR_NULL(_T)) free_page((unsigned long)_T))

Is this really worth it?

>  /**
>   * seamldr_install_module - Install a new TDX module.
>   * @data: Pointer to the TDX module update blob.
> @@ -52,6 +202,11 @@ EXPORT_SYMBOL_FOR_MODULES(seamldr_get_info, "tdx-host");
>   */
>  int seamldr_install_module(const u8 *data, u32 size)
>  {
> +	struct seamldr_params *params __free(free_seamldr_params) =
> +						init_seamldr_params(data, size);
> +	if (IS_ERR(params))
> +		return PTR_ERR(params);
> +
>  	/* TODO: Update TDX module here */
>  	return 0;
>  }

IMNHO, this patch has way too much going on. It took well over an hour
to go through it. That's problematic.

^ permalink raw reply

* [PATCH v2 0/4] struct page to PFN conversion for TDX guest private memory
From: Yan Zhao @ 2026-04-30  1:48 UTC (permalink / raw)
  To: dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yan.y.zhao, yilun.xu, vannapurve,
	ackerleytng, sagis, binbin.wu, xiaoyao.li, isaku.yamahata

Hi

This is v2 of the struct page to PFN conversion series, which converts TDX
guest private memory mapping/unmapping APIs from taking struct page to
taking PFN as input.

v2 is based on v7.1.0-rc1 + Sean's 4 cleanup patches (see details in
section "Base" below). The purpose is to get Dave's Ack, so Sean can take
it from the KVM x86 tree. The full stack of v2 is available at [14].

Compared to v1, v2:
- Rewrote commit messages of patches 1/2 (the conversion patches for
  mapping and unmapping) by specifically explaining the downside of
  assuming guest private memory must be backed by struct page, and
  incorporating Dave's rewording that also works for Sean.

- Updated patch 2 (which is for unmapping) to use tdx_quirk_reset_paddr()
  directly for unmapping guest private memory, and added patch 3 to drop
  the exported function tdx_quirk_reset_page() by having KVM invoke
  tdx_quirk_reset_paddr() in all scenarios, as suggested by Paolo and
  Xiaoyao.

- Split patch 4 (moving mk_keyed_paddr() to .c) out of patch 2, so patch 2
  can focus on the struct page to PFN conversion for unmapping.

Note: as agreed in v1, Kirill's concern about AUG "level" will be addressed
in a separate patch later.


Background
----------
TDX SEAMCALL wrappers take struct page as input, which provides:
1. Type safety
2. Make it harder to misuse and make it obvious that physical pages in RAM
   are expected from just looking at the API declaration [2][3][4][5].

This is appropriate for SEAMCALL wrappers for TDX control pages (e.g., TDR
page, TDCS pages, TDX SEPT pages), since KVM manages and allocates those
pages explicitly from core MM.

However, unlike TDX control pages, KVM guest memory is not necessarily
backed by refcounted struct page or even struct page (e.g., VM_PFNMAP
memory [6]). Taking struct page as input for SEAMCALL wrappers for
mapping/unmapping guest private memory imposes unnecessary assumptions on
how KVM and guest_memfd manage memory [7]. So, Sean suggested converting
from using struct page to PFN for SEAMCALL wrappers operating on guest
private memory [8].

This series therefore converts struct page to PFN for guest private memory
while keeping struct page for TDX control pages, and uses kvm_pfn_t for
type safety.


Sanity check
------------
Reasonable PFN sanity checks in the guest private memory mapping/unmapping
APIs are still agreed upon [9][10], such as checking TDX convertibility to
avoid SEAMCALL failure.

However, we decided not to provide any in-kernel sanity checks to avoid
introducing unnecessary overhead, both because those failures are supposed
to only occur when there are kernel bugs, and due to the lack of
satisfactory tiny checks to ensure convertibility. When unexpected
non-TDX-convertible PFNs are passed in, just let SEAMCALLs fail or have
#MCs or #PFs generated, which are obvious enough in themselves.


Base:
----
This v2 is rebased on top of v7.1.0-rc1 (kvm/next, commit 39f1c201b93f) +
the first 4 patches from Sean's v5 "TDX: Dynamic PAMT + S-EPT Hugepage"
series [11].

Note: due to the instability of v7.1.0-rc1, I also applied series [12] and
[13] to pass CI.


Changelogs:
-----------
v1 [1] --> v2:
    1. Updated patch logs of patches 1/2. (Dave).
    2. Added patch 3 to drop tdx_quirk_reset_page() and export
       tdx_quirk_reset_paddr() only. (Paolo, Xiaoyao)
    3. Split out patch 4 to move mk_keyed_paddr() from .h to .c.
    4. Rebased to v7.1.0-rc1 + Sean's 4 cleanup patches.

Sean's original patch [0] --> v1:
    1. Rebased to kvm-x86-next-2026.03.13.
    2. Split to 2 patches for easy review.  (Rick)
    3. Replaced "u64 pfn" with "kvm_pfn_t pfn"  (Rick)
    4. Dropped using PFN as input to tdx_reclaim_page(). (Rick)
    5. Move mk_keyed_paddr() from tdx.h to tdx.c. 

Thanks
Yan

[0] https://lore.kernel.org/kvm/20260129011517.3545883-26-seanjc@google.com
[1] https://lore.kernel.org/all/20260319005605.8965-1-yan.y.zhao@intel.com
[2] https://lore.kernel.org/all/30d0cef5-82d5-4325-b149-0e99833b8785@intel.com
[3] https://lore.kernel.org/kvm/f4240495-120b-4124-b91a-b365e45bf50a@intel.com
[4] https://lore.kernel.org/kvm/435b8d81-b4de-4933-b0ae-357dea311488@intel.com
[5] https://lore.kernel.org/kvm/1b236a64-d511-49a2-9962-55f4b1eb08e3@intel.com
[6] https://lore.kernel.org/all/20241010182427.1434605-1-seanjc@google.com
[7] https://lore.kernel.org/all/aWgyhmTJphGQqO0Y@google.com
[8] https://lore.kernel.org/all/aWe1tKpFw-As6VKg@google.com
[9] https://lore.kernel.org/all/aWkVLViKBgiVGgaI@google.com
[10] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com
[11] https://lore.kernel.org/all/20260129011517.3545883-1-seanjc@google.com
[12] https://lore.kernel.org/all/20260423155611.216805954@infradead.org
[13] https://lore.kernel.org/all/20260428024746.1040531-1-binbin.wu@linux.intel.com
[14] https://github.com/intel-staging/tdx/tree/struct_page_to_pfn_v2


Sean Christopherson (2):
  x86/tdx: Use PFN directly for mapping guest private memory
  x86/tdx: Use PFN directly for unmapping guest private memory

Yan Zhao (2):
  x86/tdx: Drop exported function tdx_quirk_reset_page()
  x86/virt/tdx: Move mk_keyed_paddr() to tdx.c due to no external users

 arch/x86/include/asm/tdx.h  | 21 ++++++-------------
 arch/x86/kvm/vmx/tdx.c      | 17 ++++++++--------
 arch/x86/virt/vmx/tdx/tdx.c | 40 +++++++++++++++++++++----------------
 3 files changed, 37 insertions(+), 41 deletions(-)

-- 
2.43.2


^ permalink raw reply

* [PATCH v2 1/4] x86/tdx: Use PFN directly for mapping guest private memory
From: Yan Zhao @ 2026-04-30  1:49 UTC (permalink / raw)
  To: dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yan.y.zhao, yilun.xu, vannapurve,
	ackerleytng, sagis, binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014852.24183-1-yan.y.zhao@intel.com>

From: Sean Christopherson <seanjc@google.com>

Remove struct page assumptions/constraints in the SEAMCALL wrapper APIs for
mapping guest private memory and have them take PFN directly.

Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].

KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tied to the
regular CPU MMUs and the kernel code around them. Using 'struct page' for
TDX guest memory is not a good fit anywhere near the KVM MMU code [2].

Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_mem_page_add() and tdh_mem_page_aug() are exported to KVM
only.

[ Yan: Replace "u64 pfn" with "kvm_pfn_t pfn" ]

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/aWgyhmTJphGQqO0Y@google.com [1]
Link: https://lore.kernel.org/all/ac7V0g2q2hN3dU5u@google.com [2]
---
 arch/x86/include/asm/tdx.h  |  6 ++++--
 arch/x86/kvm/vmx/tdx.c      |  7 +++----
 arch/x86/virt/vmx/tdx/tdx.c | 19 ++++++++++++-------
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 0cb77ed4adc5..619aed134c83 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -6,6 +6,7 @@
 #include <linux/init.h>
 #include <linux/bits.h>
 #include <linux/mmzone.h>
+#include <linux/kvm_types.h>
 
 #include <asm/errno.h>
 #include <asm/ptrace.h>
@@ -189,11 +190,12 @@ static inline u64 mk_keyed_paddr(u16 hkid, struct page *page)
 
 u64 tdh_vp_enter(struct tdx_vp *vp, struct tdx_module_args *args);
 u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page);
-u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page *source, u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, kvm_pfn_t pfn, struct page *source,
+		     u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, enum pg_level level, struct page *page,
 		     u64 *ext_err1, u64 *ext_err2);
 u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page);
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, enum pg_level level, struct page *page,
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, enum pg_level level, kvm_pfn_t pfn,
 		     u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, enum pg_level level, u64 *ext_err1,
 			u64 *ext_err2);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 77aea8920a4a..9b47dd257ff4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1624,8 +1624,8 @@ static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 	    KVM_BUG_ON(!kvm_tdx->page_add_src, kvm))
 		return -EIO;
 
-	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
-			       kvm_tdx->page_add_src, &entry, &level_state);
+	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn, kvm_tdx->page_add_src,
+			       &entry, &level_state);
 	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
 
@@ -1639,12 +1639,11 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 			    enum pg_level level, kvm_pfn_t pfn)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	struct page *page = pfn_to_page(pfn);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 entry, level_state;
 	u64 err;
 
-	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, level, page, &entry, &level_state);
+	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, level, pfn, &entry, &level_state);
 	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index a6e77afafa79..b24b81cea5ea 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -30,7 +30,6 @@
 #include <linux/suspend.h>
 #include <linux/syscore_ops.h>
 #include <linux/idr.h>
-#include <linux/kvm_types.h>
 #include <asm/page.h>
 #include <asm/special_insns.h>
 #include <asm/msr-index.h>
@@ -1568,6 +1567,11 @@ static void tdx_clflush_page(struct page *page)
 	clflush_cache_range(page_to_virt(page), PAGE_SIZE);
 }
 
+static void tdx_clflush_pfn(kvm_pfn_t pfn)
+{
+	clflush_cache_range(__va(PFN_PHYS(pfn)), PAGE_SIZE);
+}
+
 static int pg_level_to_tdx_sept_level(enum pg_level level)
 {
 	WARN_ON_ONCE(level == PG_LEVEL_NONE);
@@ -1594,17 +1598,18 @@ u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page)
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_mng_addcx);
 
-u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page *source, u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, kvm_pfn_t pfn, struct page *source,
+		     u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
 		.rcx = gpa,
 		.rdx = tdx_tdr_pa(td),
-		.r8 = page_to_phys(page),
+		.r8 = PFN_PHYS(pfn),
 		.r9 = page_to_phys(source),
 	};
 	u64 ret;
 
-	tdx_clflush_page(page);
+	tdx_clflush_pfn(pfn);
 	ret = seamcall_ret(TDH_MEM_PAGE_ADD, &args);
 
 	*ext_err1 = args.rcx;
@@ -1647,16 +1652,16 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
 EXPORT_SYMBOL_FOR_KVM(tdh_vp_addcx);
 
 u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, enum pg_level level,
-		     struct page *page, u64 *ext_err1, u64 *ext_err2)
+		     kvm_pfn_t pfn, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
 		.rcx = gpa | pg_level_to_tdx_sept_level(level),
 		.rdx = tdx_tdr_pa(td),
-		.r8 = page_to_phys(page),
+		.r8 = PFN_PHYS(pfn),
 	};
 	u64 ret;
 
-	tdx_clflush_page(page);
+	tdx_clflush_pfn(pfn);
 	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
 
 	*ext_err1 = args.rcx;
-- 
2.43.2


^ permalink raw reply related

* [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Yan Zhao @ 2026-04-30  1:49 UTC (permalink / raw)
  To: dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yan.y.zhao, yilun.xu, vannapurve,
	ackerleytng, sagis, binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014852.24183-1-yan.y.zhao@intel.com>

From: Sean Christopherson <seanjc@google.com>

Remove struct page assumptions/constraints in APIs for unmapping guest
private memory and have them take physical address directly.

Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].

KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tightly tied
to the regular CPU MMUs and the kernel code around them. Using
"struct page" for TDX guest memory is not a good fit anywhere near the KVM
MMU code [2].

Therefore, for unmapping guest private memory: export
tdx_quirk_reset_paddr() for direct KVM invocation, and convert the SEAMCALL
wrapper API tdh_phymem_page_wbinvd_hkid() to take PFN as input (thus
updating mk_keyed_paddr() and tdh_phymem_page_wbinvd_tdr()).

Intentionally have KVM pass PAGE_SIZE (rather than KVM_HPAGE_SIZE(level))
to tdx_quirk_reset_paddr() in tdx_sept_remove_private_spte() to avoid
mixing in huge page changes. The KVM_BUG_ON() check for !PG_LEVEL_4K in
tdx_sept_remove_private_spte() justifies using PAGE_SIZE.

Do not convert tdx_reclaim_page() to use PFN as input since it currently
does not remove guest private memory.

Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_phymem_page_wbinvd_hkid() and tdx_quirk_reset_paddr() are
exported to KVM only.

[Yan: Use kvm_pfn_t,exclude tdx_reclaim_page(),use tdx_quirk_reset_paddr()]

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/aWgyhmTJphGQqO0Y@google.com [1]
Link: https://lore.kernel.org/all/ac7V0g2q2hN3dU5u@google.com [2]
---
 arch/x86/include/asm/tdx.h  | 14 +++++---------
 arch/x86/kvm/vmx/tdx.c      |  6 +++---
 arch/x86/virt/vmx/tdx/tdx.c |  9 +++++----
 3 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 619aed134c83..65f7d874fb5a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -154,6 +154,7 @@ u32 tdx_get_nr_guest_keyids(void);
 void tdx_guest_keyid_free(unsigned int keyid);
 
 void tdx_quirk_reset_page(struct page *page);
+void tdx_quirk_reset_paddr(unsigned long base, unsigned long size);
 
 struct tdx_td {
 	/* TD root structure: */
@@ -177,15 +178,10 @@ struct tdx_vp {
 	struct page **tdcx_pages;
 };
 
-static inline u64 mk_keyed_paddr(u16 hkid, struct page *page)
+static inline u64 mk_keyed_paddr(u16 hkid, kvm_pfn_t pfn)
 {
-	u64 ret;
-
-	ret = page_to_phys(page);
-	/* KeyID bits are just above the physical address bits: */
-	ret |= (u64)hkid << boot_cpu_data.x86_phys_bits;
-
-	return ret;
+	/* KeyID bits are just above the physical address bits. */
+	return PFN_PHYS(pfn) | ((u64)hkid << boot_cpu_data.x86_phys_bits);
 }
 
 u64 tdh_vp_enter(struct tdx_vp *vp, struct tdx_module_args *args);
@@ -218,7 +214,7 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, enum pg_level level,
 			u64 *ext_err1, u64 *ext_err2);
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
-u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
+u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, kvm_pfn_t pfn);
 #else
 static inline void tdx_init(void) { }
 static inline u32 tdx_get_nr_guest_keyids(void) { return 0; }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9b47dd257ff4..a2aadc6d0174 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1774,8 +1774,8 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 					 enum pg_level level, u64 mirror_spte)
 {
-	struct page *page = pfn_to_page(spte_to_pfn(mirror_spte));
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	kvm_pfn_t pfn = spte_to_pfn(mirror_spte);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
 
@@ -1814,11 +1814,11 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm))
 		return;
 
-	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
+	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, pfn);
 	if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm))
 		return;
 
-	tdx_quirk_reset_page(page);
+	tdx_quirk_reset_paddr(PFN_PHYS(pfn), PAGE_SIZE);
 }
 
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b24b81cea5ea..e5a37ea2d4a0 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -710,7 +710,7 @@ static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
  * to normal kernel memory. Systems with the X86_BUG_TDX_PW_MCE erratum need to
  * do the conversion explicitly via MOVDIR64B.
  */
-static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
+void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
 {
 	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
 	unsigned long phys, end;
@@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
 	 */
 	mb();
 }
+EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
 
 void tdx_quirk_reset_page(struct page *page)
 {
@@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
 {
 	struct tdx_module_args args = {};
 
-	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
+	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));
 
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_tdr);
 
-u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
+u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, kvm_pfn_t pfn)
 {
 	struct tdx_module_args args = {};
 
-	args.rcx = mk_keyed_paddr(hkid, page);
+	args.rcx = mk_keyed_paddr(hkid, pfn);
 
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
 }
-- 
2.43.2


^ permalink raw reply related

* [PATCH v2 3/4] x86/tdx: Drop exported function tdx_quirk_reset_page()
From: Yan Zhao @ 2026-04-30  1:50 UTC (permalink / raw)
  To: dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yan.y.zhao, yilun.xu, vannapurve,
	ackerleytng, sagis, binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014852.24183-1-yan.y.zhao@intel.com>

KVM invokes tdx_quirk_reset_page() to reset TDX control pages (including
S-EPT pages, TDR page, etc.), as all those pages are allocated by KVM TDX
and thus always have struct page.

However, it's also reasonable for KVM to reset those TDX control pages via
tdx_quirk_reset_paddr() directly, eliminating the need to export two
parallel APIs. Keeping tdx_quirk_reset_page() as a one-line helper in the
header file is also unnecessary.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/tdx.h  | 1 -
 arch/x86/kvm/vmx/tdx.c      | 4 ++--
 arch/x86/virt/vmx/tdx/tdx.c | 6 ------
 3 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 65f7d874fb5a..9c63deaa0e8f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -153,7 +153,6 @@ int tdx_guest_keyid_alloc(void);
 u32 tdx_get_nr_guest_keyids(void);
 void tdx_guest_keyid_free(unsigned int keyid);
 
-void tdx_quirk_reset_page(struct page *page);
 void tdx_quirk_reset_paddr(unsigned long base, unsigned long size);
 
 struct tdx_td {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a2aadc6d0174..9bd4fd748e2a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -343,7 +343,7 @@ static int tdx_reclaim_page(struct page *page)
 
 	r = __tdx_reclaim_page(page);
 	if (!r)
-		tdx_quirk_reset_page(page);
+		tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE);
 	return r;
 }
 
@@ -597,7 +597,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 	if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm))
 		return;
 
-	tdx_quirk_reset_page(kvm_tdx->td.tdr_page);
+	tdx_quirk_reset_paddr(page_to_phys(kvm_tdx->td.tdr_page), PAGE_SIZE);
 
 	__free_page(kvm_tdx->td.tdr_page);
 	kvm_tdx->td.tdr_page = NULL;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e5a37ea2d4a0..deb67e68f85f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -731,12 +731,6 @@ void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
 }
 EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
 
-void tdx_quirk_reset_page(struct page *page)
-{
-	tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE);
-}
-EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_page);
-
 static __init void tdmr_quirk_reset_pamt(struct tdmr_info *tdmr)
 
 {
-- 
2.43.2


^ permalink raw reply related

* [PATCH v2 4/4] x86/virt/tdx: Move mk_keyed_paddr() to tdx.c due to no external users
From: Yan Zhao @ 2026-04-30  1:50 UTC (permalink / raw)
  To: dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yan.y.zhao, yilun.xu, vannapurve,
	ackerleytng, sagis, binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014852.24183-1-yan.y.zhao@intel.com>

Move mk_keyed_paddr() from tdx.h to tdx.c to avoid unnecessary header
inclusion and improve encapsulation since there are no users outside of
tdx.c.

No functional change intended.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/tdx.h  | 6 ------
 arch/x86/virt/vmx/tdx/tdx.c | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9c63deaa0e8f..503f9a3f46d6 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -177,12 +177,6 @@ struct tdx_vp {
 	struct page **tdcx_pages;
 };
 
-static inline u64 mk_keyed_paddr(u16 hkid, kvm_pfn_t pfn)
-{
-	/* KeyID bits are just above the physical address bits. */
-	return PFN_PHYS(pfn) | ((u64)hkid << boot_cpu_data.x86_phys_bits);
-}
-
 u64 tdh_vp_enter(struct tdx_vp *vp, struct tdx_module_args *args);
 u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page);
 u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, kvm_pfn_t pfn, struct page *source,
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index deb67e68f85f..967482ae3c80 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1911,6 +1911,12 @@ u64 tdh_phymem_cache_wb(bool resume)
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_phymem_cache_wb);
 
+static inline u64 mk_keyed_paddr(u16 hkid, kvm_pfn_t pfn)
+{
+	/* KeyID bits are just above the physical address bits. */
+	return PFN_PHYS(pfn) | ((u64)hkid << boot_cpu_data.x86_phys_bits);
+}
+
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
 {
 	struct tdx_module_args args = {};
-- 
2.43.2


^ permalink raw reply related

* Re: [PATCH kernel 4/9] dma/swiotlb: Stop forcing SWIOTLB for TDISP devices
From: Alexey Kardashevskiy @ 2026-04-30  3:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Robin Murphy, x86, linux-kernel, kvm, linux-pci,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Sean Christopherson, Paolo Bonzini,
	Andy Lutomirski, Peter Zijlstra, Bjorn Helgaas, Marek Szyprowski,
	Andrew Morton, Catalin Marinas, Michael Ellerman, Mike Rapoport,
	Tom Lendacky, Ard Biesheuvel, Ashish Kalra, Stefano Garzarella,
	Melody Wang, Seongman Lee, Joerg Roedel, Nikunj A Dadhania,
	Michael Roth, Suravee Suthikulpanit, Andi Kleen,
	Kuppuswamy Sathyanarayanan, Tony Luck, David Woodhouse,
	Greg Kroah-Hartman, Denis Efremov, Geliang Tang, Piotr Gregor,
	Michael S. Tsirkin, Alex Williamson, Arnd Bergmann, Jesse Barnes,
	Jacob Pan, Yinghai Lu, Kevin Brodsky, Jonathan Cameron,
	Aneesh Kumar K.V (Arm), Xu Yilun, Herbert Xu, Kim Phillips,
	Konrad Rzeszutek Wilk, Stefano Stabellini, Claire Chang,
	linux-coco, iommu
In-Reply-To: <20260420235046.GA3199414@nvidia.com>



On 21/4/26 09:50, Jason Gunthorpe wrote:
> On Wed, Apr 15, 2026 at 04:32:14PM +1000, Alexey Kardashevskiy wrote:
>>>> So the DMA API should see the DMA_ATTR_CC_DECRYPTED and setup the
>>>> correct dma_dddr_t either by choosing the shared alias for the TDISP
>>>> device's vTOM, or setting the C bit in a vIOMMU S1.
>>>
>>> Something like that?
>>>
>>> https://github.com/AMDESE/linux-kvm/commit/266a41a1ea746557eb63debce886ce2c98820667
>>>
>>> With some little hacks I can make this tree do TDISP DMA to private or shared (swiotlb) memory by steering via this vTOM thing. Thanks,
>>
>> Ping? Thanks,
> 
> That seems approx right, it is broadl similar to what ARM is
> doing.. But the address map changes when switching to T=1 for AMD?

Well, at the moment, in my WIP tree, I map the guest memory in the host IOMMU twice -

1) from 0 offset (and this is shared when T=0 and private when T=1) and
2) from vTOM offset (say, 1TB) - and this half of the mapping is always shared.

Thanks,


-- 
Alexey


^ permalink raw reply

* [Invitation] bi-weekly guest_memfd upstream call on 2026-04-30 (today!)
From: David Hildenbrand (Arm) @ 2026-04-30 10:47 UTC (permalink / raw)
  To: linux-coco@lists.linux.dev, linux-mm@kvack.org, KVM

Hi,

Late reminder :(

Our next guest_memfd upstream call is scheduled for today, Thursday,
2026-04-30 at 8:00 - 9:00am (GMT-07:00) Pacific Time - Vancouver.

We'll be using the following Google meet:
http://meet.google.com/wxp-wtju-jzw

The meeting notes can be found at [1], where we also link recordings and
collect current guest_memfd upstream proposals. If you want an google
calendar invitation that also covers all future meetings, just write me
or Ackerley a mail.

In this meeting, we'll talk about progress on the in-place conversion front.

To put something to discuss onto the agenda, reply to this mail or add
them to the "Topics/questions for next meeting(s)" section in the
meeting notes as a comment.

[1]
https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?usp=sharing

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2 1/4] x86/tdx: Use PFN directly for mapping guest private memory
From: Ackerley Tng @ 2026-04-30 17:43 UTC (permalink / raw)
  To: Yan Zhao, dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yilun.xu, vannapurve, sagis,
	binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014929.24210-1-yan.y.zhao@intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

> From: Sean Christopherson <seanjc@google.com>
>
> Remove struct page assumptions/constraints in the SEAMCALL wrapper APIs for
> mapping guest private memory and have them take PFN directly.
>

Thanks for this :)

> Having core TDX make assumptions that guest private memory must be backed
> by struct page (and/or folio) will create subtle dependencies on how
> KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
> allocated from core MM, if the memory is refcounted, or if the folio is
> split) that are easily avoided. [1].
>
> KVM's MMUs work with PFNs. This is very much an intentional design choice.
> It ensures that the KVM MMUs remain flexible and are not too tied to the
> regular CPU MMUs and the kernel code around them. Using 'struct page' for
> TDX guest memory is not a good fit anywhere near the KVM MMU code [2].
>
> Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
> since APIs tdh_mem_page_add() and tdh_mem_page_aug() are exported to KVM
> only.
>
> [ Yan: Replace "u64 pfn" with "kvm_pfn_t pfn" ]
>

Thanks for your help, Yan!

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Link: https://lore.kernel.org/all/aWgyhmTJphGQqO0Y@google.com [1]
> Link: https://lore.kernel.org/all/ac7V0g2q2hN3dU5u@google.com [2]
>
> [...snip...]
>

^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Ackerley Tng @ 2026-04-30 18:17 UTC (permalink / raw)
  To: Yan Zhao, dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yilun.xu, vannapurve, sagis,
	binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014948.24226-1-yan.y.zhao@intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

>
> [...snip...]
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index b24b81cea5ea..e5a37ea2d4a0 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -710,7 +710,7 @@ static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
>   * to normal kernel memory. Systems with the X86_BUG_TDX_PW_MCE erratum need to
>   * do the conversion explicitly via MOVDIR64B.
>   */
> -static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)

Could this be updated to use phys_addr_t base and size_t size instead of
generic unsigned long?

>  {
>  	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
>  	unsigned long phys, end;
> @@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>  	 */
>  	mb();
>  }
> +EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
>
>  void tdx_quirk_reset_page(struct page *page)
>  {
> @@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
>  {
>  	struct tdx_module_args args = {};
>
> -	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
> +	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));

Should mk_keyed_paddr() be updated to have a return type of phys_addr_t?
I guess in this case since mk_keyed_paddr() is pretty much an internal
function, returning u64 also makes sense to indicate that it should only
be used to set 64 bit registers.

>
>  	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
>  }
>  EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_tdr);
>
> -u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> +u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, kvm_pfn_t pfn)
>  {
>  	struct tdx_module_args args = {};
>
> -	args.rcx = mk_keyed_paddr(hkid, page);
> +	args.rcx = mk_keyed_paddr(hkid, pfn);
>
>  	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
>  }
> --
> 2.43.2

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v2 3/4] x86/tdx: Drop exported function tdx_quirk_reset_page()
From: Ackerley Tng @ 2026-04-30 18:29 UTC (permalink / raw)
  To: Yan Zhao, dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yilun.xu, vannapurve, sagis,
	binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430015001.24242-1-yan.y.zhao@intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

> KVM invokes tdx_quirk_reset_page() to reset TDX control pages (including
> S-EPT pages, TDR page, etc.), as all those pages are allocated by KVM TDX
> and thus always have struct page.
>
> However, it's also reasonable for KVM to reset those TDX control pages via
> tdx_quirk_reset_paddr() directly, eliminating the need to export two
> parallel APIs. Keeping tdx_quirk_reset_page() as a one-line helper in the
> header file is also unnecessary.
>
> No functional change intended.
>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

Thanks for the cleanup!

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

>
> [...snip...]
>

^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Edgecombe, Rick P @ 2026-04-30 18:38 UTC (permalink / raw)
  To: Tng, Ackerley, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y, dave.hansen@linux.intel.com
  Cc: Shahar, Sagi, Yamahata, Isaku, x86@kernel.org, kas@kernel.org,
	yilun.xu@linux.intel.com, bp@alien8.de, mingo@redhat.com,
	linux-kernel@vger.kernel.org, Huang, Kai, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Li, Xiaoyao, tglx@kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal
In-Reply-To: <CAEvNRgGsMaioFvTWaLhnDiLbdx3tunZpo3Qzxdm1xeb4ef0J9Q@mail.gmail.com>

On Thu, 2026-04-30 at 11:17 -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > 
> > [...snip...]
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index b24b81cea5ea..e5a37ea2d4a0 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -710,7 +710,7 @@ static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
> >    * to normal kernel memory. Systems with the X86_BUG_TDX_PW_MCE erratum need to
> >    * do the conversion explicitly via MOVDIR64B.
> >    */
> > -static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> > +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> 
> Could this be updated to use phys_addr_t base and size_t size instead of
> generic unsigned long?

A type is a really good idea. It could look like a virtual address, despite the
paddr in the name.

I added it to the cleanup list. But I would prefer to keep this series focused
on resolving the critical controversy around the struct page/pfn type, rather
than adding in more things to debate. If you don't mind.

> 
> >   {
> >   	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
> >   	unsigned long phys, end;
> > @@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> >   	 */
> >   	mb();
> >   }
> > +EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
> > 
> >   void tdx_quirk_reset_page(struct page *page)
> >   {
> > @@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
> >   {
> >   	struct tdx_module_args args = {};
> > 
> > -	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
> > +	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));
> 
> Should mk_keyed_paddr() be updated to have a return type of phys_addr_t?
> I guess in this case since mk_keyed_paddr() is pretty much an internal
> function, returning u64 also makes sense to indicate that it should only
> be used to set 64 bit registers.

Yea, this is used to construct u64 inputs for seamcall args. So I think it
should keep returning u64s. Maybe instead it would be better to have a function
name pattern that denotes that purpose. We have some more helpers like this
coming.

But similarly, I'd like to not grow a snowballing cleanup series for this one.

^ permalink raw reply

* Re: [PATCH v8 10/21] x86/virt/seamldr: Shut down the current TDX module
From: Dave Hansen @ 2026-04-30 18:52 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-11-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
>  static int do_seamldr_install_module(void *seamldr_params)
>  {
>  	enum module_update_state newstate, curstate = MODULE_UPDATE_START;
> +	int cpu = smp_processor_id();
> +	bool primary;
>  	int ret = 0;
>  
> +	primary = cpumask_first(cpu_online_mask) == cpu;

Isn't cpumask_first(cpu_online_mask)==0, always? I thought CPU 0 could
never be offlined.

^ permalink raw reply

* Re: [PATCH v8 11/21] x86/virt/tdx: Reset software states during TDX module shutdown
From: Dave Hansen @ 2026-04-30 18:58 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-12-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> +	/*
> +	 * Clear global and per-CPU initialization flags so the new module
> +	 * can be fully re-initialized after a successful update.
> +	 *
> +	 * No locks needed as no concurrent accesses can occur here.
> +	 */
> +	tdx_module_initialized = false;
> +	sysinit_done = false;
> +	sysinit_ret = 0;
> +	for_each_possible_cpu(cpu)
> +		per_cpu(tdx_lp_initialized, cpu) = false;

This speaks to needing refactoring.

If there's global TDX state, it needs to be in a global TDX state
structure, not scattered across random global variables.

Imagine the structure is:

struct tdx_module_config foo;

That's 0's at boot. You have to init the TDX module to bring it out of
0's to valid state. It actually means something if you do:

	memset(&foo, 0, sizeof(foo));

Because it takes it right back to its bss state. That ^ is also handy
because it naturally just works if new state gets added.

Guess what will happen the next time someone adds:

	int sysinit_new_fancy_thing;

Someone will forget to add it to this code. Then the module gets updated
and breaks things in fun ways.

^ permalink raw reply

* Re: [PATCH v8 12/21] x86/virt/seamldr: Install a new TDX module
From: Dave Hansen @ 2026-04-30 19:00 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-13-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> +			case MODULE_UPDATE_CPU_INSTALL:
> +				ret = seamldr_install(seamldr_params);
> +				break;

I really despise this naming. Could you please clarify with comments?

This reads like it is installing a seamldr, not telling a seamldr to
perform an install.

^ permalink raw reply

* Re: [PATCH v8 15/21] x86/virt/tdx: Refresh TDX module version after update
From: Dave Hansen @ 2026-04-30 19:14 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-16-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> The kernel exposes the TDX module version through sysfs so userspace can
> check update compatibility. That information needs to remain accurate
> across runtime updates.
> 
> A runtime update may change the module's update_version, so refresh the
> cached version after a successful update and emit a log message to show
> the version change.
> 
> Drop __ro_after_init from tdx_sysinfo because it is now updated at runtime.
> 
> Perform the refresh outside of stop_machine() since printk() within
> stop_machine() would add significant latency.
> 
> Do not refresh the rest of tdx_sysinfo. Refreshing them at runtime could
> disrupt running software that relies on the previously reported values.

This needs more explanation. I think the reasoning is quite nuanced.

tdx_sysinfo is just a cache of what the TDX module is and can do. If
that changes, it means the TDX module changed. So you somehow need to
argue why it's OK to hide those changes from the tdx_sysinfo users.

Why would they be confused by tdx_sysinfo changes but not by the TDX
module *itself* changing?

> Note that major and minor versions are not refreshed because runtime updates
> are supported only between releases with identical major and minor versions.

I'd rather have this in code than a changelog comment.

If they can't change then warn if they do.

> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index 98a8d9d3ae25..c81b26c4bac1 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -306,6 +306,8 @@ DEFINE_FREE(free_seamldr_params, struct seamldr_params *,
>   */
>  int seamldr_install_module(const u8 *data, u32 size)
>  {
> +	int ret;
> +
>  	struct seamldr_params *params __free(free_seamldr_params) =
>  						init_seamldr_params(data, size);
>  	if (IS_ERR(params))
> @@ -314,6 +316,10 @@ int seamldr_install_module(const u8 *data, u32 size)
>  	/* Ensure a stable set of online CPUs for the update process. */
>  	guard(cpus_read_lock)();
>  	set_target_state(MODULE_UPDATE_START + 1);
> -	return stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
> +	ret = stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
> +	if (ret)
> +		return ret;
> +
> +	return tdx_module_refresh_version();
>  }

This is one reason I rather dislike guard().

Does tdx_module_refresh_version() need to be guarded by 'cpus_read_lock'?

> +int tdx_module_refresh_version(void)
> +{
> +	struct tdx_sys_info_version *old, new;

Two variable definitions, please. IMNHO pointers and plain variables are
different types. Even if you can, they should not be defined together.

> +	int ret;
> +
> +	/* Shouldn't fail as the update has succeeded. */
> +	ret = get_tdx_sys_info_version(&new);
> +	WARN_ON_ONCE(ret);
> +
> +	old = &tdx_sysinfo.version;
> +	pr_info("version " TDX_VERSION_FMT " -> " TDX_VERSION_FMT "\n",
> +		old->major_version, old->minor_version, old->update_version,
> +		new.major_version, new.minor_version, new.update_version);

Having 'new' and 'old' be different types is really wonky. What's wrong
with:

	struct tdx_sys_info_version old = tdx_sysinfo.version;

?


> +	/* Major/minor versions should not change across updates. */
> +	tdx_sysinfo.version.update_version	= new.update_version;

					   ^ very odd tab

Also, how much of this do you *NEED*? You don't need to print the old
version. You don't really need to _print_ the new version either.

Isn't this arguably all fluff?

If the fluff went away would we even be talking about the funky pointer
versus plain type gunk?

Second, this is all open-coded and completely discrete from the code
that originally sets 'tdx_sysinfo.version'. Can it be unified?


^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Sean Christopherson @ 2026-04-30 19:20 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Ackerley Tng, pbonzini@redhat.com, Yan Y Zhao,
	dave.hansen@linux.intel.com, Sagi Shahar, Isaku Yamahata,
	x86@kernel.org, kas@kernel.org, yilun.xu@linux.intel.com,
	bp@alien8.de, mingo@redhat.com, linux-kernel@vger.kernel.org,
	Kai Huang, kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	Xiaoyao Li, tglx@kernel.org, binbin.wu@linux.intel.com,
	Vishal Annapurve
In-Reply-To: <231efd4e9267d3dbb8c63e57b3a43567c965e24a.camel@intel.com>

On Thu, Apr 30, 2026, Rick P Edgecombe wrote:
> On Thu, 2026-04-30 at 11:17 -0700, Ackerley Tng wrote:
> > >   {
> > >   	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
> > >   	unsigned long phys, end;
> > > @@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> > >   	 */
> > >   	mb();
> > >   }
> > > +EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
> > > 
> > >   void tdx_quirk_reset_page(struct page *page)
> > >   {
> > > @@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
> > >   {
> > >   	struct tdx_module_args args = {};
> > > 
> > > -	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
> > > +	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));
> > 
> > Should mk_keyed_paddr() be updated to have a return type of phys_addr_t?
> > I guess in this case since mk_keyed_paddr() is pretty much an internal
> > function, returning u64 also makes sense to indicate that it should only
> > be used to set 64 bit registers.
> 
> Yea, this is used to construct u64 inputs for seamcall args. So I think it
> should keep returning u64s.

+1.  IMO, we should treat the TDX-Module as an extension of hardware and pass in
u64s where the spec says it takes a 64-bit value.


^ permalink raw reply

* Re: [PATCH v8 16/21] x86/virt/tdx: Reject updates during concurrent TD build
From: Dave Hansen @ 2026-04-30 19:25 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-17-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> tl;dr: A TDX module erratum can silently corrupt TD measurement state if a
> module update races with TD build. Handle that by rejecting the update,
> instead of introducing new TD-build ioctl failure paths.

The downside of this needs to be discussed. Namely that module updates
can be blocked forever.

> Long Version:
...

This explanation is confusing.

Focus on what the patch *does* and its features and downsides.

*Then* broach the alternatives. But, please, clearly separate out this
patch from other opining.

> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index de822ed9ef0b..b063aabe2554 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -26,11 +26,18 @@
>  #define TDX_SEAMCALL_GP			(TDX_SW_ERROR | X86_TRAP_GP)
>  #define TDX_SEAMCALL_UD			(TDX_SW_ERROR | X86_TRAP_UD)
>  
> +#define TDX_SEAMCALL_STATUS_MASK		0xFFFFFFFF00000000ULL
> +
>  /*
>   * TDX module SEAMCALL leaf function error codes
>   */
> -#define TDX_SUCCESS		0ULL
> -#define TDX_RND_NO_ENTROPY	0x8000020300000000ULL
> +#define TDX_SUCCESS			0ULL
> +#define TDX_RND_NO_ENTROPY		0x8000020300000000ULL
> +#define TDX_UPDATE_COMPAT_SENSITIVE	0x8000051200000000ULL
> +
> +/* Bit definitions of TDX_FEATURES0 metadata field */
> +#define TDX_FEATURES0_NO_RBP_MOD	BIT_ULL(18)
> +#define TDX_FEATURES0_UPDATE_COMPAT	BIT_ULL(47)

Refactor first. Add new features second.

>  #ifndef __ASSEMBLER__
>  
> diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
> index 6ff4672c4181..215c00d76a94 100644
> --- a/arch/x86/kvm/vmx/tdx_errno.h
> +++ b/arch/x86/kvm/vmx/tdx_errno.h
> @@ -4,8 +4,6 @@
>  #ifndef __KVM_X86_TDX_ERRNO_H
>  #define __KVM_X86_TDX_ERRNO_H
>  
> -#define TDX_SEAMCALL_STATUS_MASK		0xFFFFFFFF00000000ULL
> -
>  /*
>   * TDX SEAMCALL Status Codes (returned in RAX)
>   */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index a7dfa4ee8813..7864ab68f4e3 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1234,10 +1234,13 @@ static __init int tdx_enable(void)
>  }
>  subsys_initcall(tdx_enable);
>  
> +#define TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE BIT(16)
> +
>  int tdx_module_shutdown(void)
>  {
>  	struct tdx_sys_info_handoff handoff = {};
>  	struct tdx_module_args args = {};
> +	u64 err;
>  	int ret, cpu;
>  
>  	ret = get_tdx_sys_info_handoff(&handoff);
> @@ -1248,9 +1251,26 @@ int tdx_module_shutdown(void)
>  	 * module can produce and most likely supported by newer modules.
>  	 */
>  	args.rcx = handoff.module_hv;
> -	ret = seamcall_prerr(TDH_SYS_SHUTDOWN, &args);
> -	if (ret)
> -		return ret;
> +
> +	/*
> +	 * Mitigate the erratum where updates can break concurrent TD
> +	 * build. Do not pre-check support for this flag. If unsupported,
> +	 * rely on the TDX module to reject shutdown requests.
> +	 */
> +	args.rcx |= TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE;

"Mitigate the erratum..." is a strange way to start this.

This would be a much better format I think:

	/*
	 * This flag will <say what it does> if <triggering event>
	 * happens. That eliminates exposure to a TDX erratum which
	 * can <explain bad things here>.
	 *
	 * This flag is not supported by all TDX modules and may cause
	 * the shutdown (and subsequent update procedure) to fail.
	 */

> +	err = seamcall(TDH_SYS_SHUTDOWN, &args);
> +
> +	/*
> +	 * Return -EBUSY to signal that some ongoing flows are incompatible
> +	 * with updates so that userspace can retry.
> +	 */

	/*
	 * The shutdown ran into a "sensitive" ongoing operation, like
	 * TD build. Signal to userspace that it can retry.
	 */

> +	if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_UPDATE_COMPAT_SENSITIVE)
> +		return -EBUSY;
> +	if (err) {
> +		seamcall_err(TDH_SYS_SHUTDOWN, err, &args);
> +		return -EIO;
> +	}

Whitespace between the if()s please.



^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Dave Hansen @ 2026-04-30 19:37 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe
  Cc: Ackerley Tng, pbonzini@redhat.com, Yan Y Zhao,
	dave.hansen@linux.intel.com, Sagi Shahar, Isaku Yamahata,
	x86@kernel.org, kas@kernel.org, yilun.xu@linux.intel.com,
	bp@alien8.de, mingo@redhat.com, linux-kernel@vger.kernel.org,
	Kai Huang, kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	Xiaoyao Li, tglx@kernel.org, binbin.wu@linux.intel.com,
	Vishal Annapurve
In-Reply-To: <afOrd7JYkUfe7wcZ@google.com>

On 4/30/26 12:20, Sean Christopherson wrote:
>> Yea, this is used to construct u64 inputs for seamcall args. So I think it
>> should keep returning u64s.
> +1.  IMO, we should treat the TDX-Module as an extension of hardware and pass in
> u64s where the spec says it takes a 64-bit value.

+2

In this very specific case 'phys_addr_t' is 100% the *WRONG* type for
mk_keyed_paddr(). Why? Because the thing being returned is *NOT* *A*
*PHYSICAL* *ADDRESS*. It's a composite type that contains a special
physical address plus some metadata in a special "hardware" format. It's
as much of a 'phys_addr_t' as a PTE is a 'phys_addr_t'. Yeah, they
contain and are constructed partly from physical addresses, but they are
not physical addresses themselves.

At the same time, if the kernel has a type-safe way of representing
something that's also a 64-bit value, we should use it. Just because the
TDX spec takes a 64-bit virtual address doesn't mean we should use a u64
and not a "struct foo *".

Let's please just not go back to a sea of u64's everywhere. Use common
sense. Use the kernel's type system where appropriate, but don't
over-apply it.

IOW, I agree with Sean. But please don't take Sean's advise too far here.

^ permalink raw reply

* Re: [PATCH v8 09/21] x86/virt/seamldr: Introduce skeleton for TDX module updates
From: Dave Hansen @ 2026-04-30 20:03 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-10-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> +static struct {
> +	enum module_update_state state;
> +	int thread_ack;

multi_stop_data has an atomic_t for this.

You have an int.

Which one is right?

^ permalink raw reply

* Re: [PATCH v8 17/21] x86/virt/seamldr: Abort updates on failure
From: Dave Hansen @ 2026-04-30 20:06 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-18-chao.gao@intel.com>

I don't like how this is being done.

 1. Introduce this do{}while() loop
 2. Do 20 other patches
 3. Introduce a thing that can make it change
 4. Change the fundamental flow of the loop, to fix #3

I'd much rather have:

 1. Introduce this do{}while() loop
 2. Tweak fundamental flow of the loop from the last patch when I can
    remember it. Allude to future failures.
 3. Do 20 other patches
 4. Introduce a thing that uses #2


> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index c81b26c4bac1..9b8f571eb03f 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -220,6 +220,7 @@ enum module_update_state {
>  static struct {
>  	enum module_update_state state;
>  	int thread_ack;
> +	bool failed;
>  	/*
>  	 * Protect update_data. Raw spinlock as it will be acquired from
>  	 * interrupt-disabled contexts.
> @@ -284,12 +285,15 @@ static int do_seamldr_install_module(void *seamldr_params)
>  				break;
>  			}
>  
> -			ack_state();
> +			if (ret)
> +				WRITE_ONCE(update_data.failed, true);
> +			else
> +				ack_state();
>  		} else {
>  			touch_nmi_watchdog();
>  			rcu_momentary_eqs();
>  		}

I don't like how this is turning out either. I don't like all the nested
conditions or ack_state() that hides its mucking with update data while
its caller mucks with it directly. It's just all hacked together.

Defer all of the acking, and *failed* acking to the ack_state() helper.

Also, I'm kinda peeved that you copied and pasted the  		
touch_nmi_watchdog()/rcu_momentary_eqs() bits and none of the comments.
This is a rather subtle use of both. If you want this to be a normal
"spinning in stop machine" idiom, then create a helper and put the
comments there.

Also, this is a case where:

	do {
		cpu_relax();
		newstate = READ_ONCE(update_data.state);

		if (newstate == curstate) {
			// can cpu_relax() just go in here??
			touch_nmi_watchdog();
			rcu_momentary_eqs();
			continue;
		}
	
		switch() {
			// state changing here
		}
	} while (...);

is a much more sane setup. You're not paying the if() indentation cost
for the entire state transition block. You're also putting the "shut up
the warnings" code out of the way where you can forget about it.


> -	} while (curstate != MODULE_UPDATE_DONE);
> +	} while (curstate != MODULE_UPDATE_DONE && !READ_ONCE(update_data.failed));
>  
>  	return ret;
>  }
> @@ -315,6 +319,7 @@ int seamldr_install_module(const u8 *data, u32 size)
>  
>  	/* Ensure a stable set of online CPUs for the update process. */
>  	guard(cpus_read_lock)();
> +	WRITE_ONCE(update_data.failed, false);
>  	set_target_state(MODULE_UPDATE_START + 1);
>  	ret = stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
>  	if (ret)
I kinda wish this 'update_data.failed' set was named. This is trying to
bring 'update_data' into some initial state. Let's _call_ it that.
Honestly, I wouldn't hate if that function just also did the spinlock
init since it's so ugly do to statically.

^ permalink raw reply

* Re: [PATCH v8 18/21] coco/tdx-host: Don't expose P-SEAMLDR features on CPUs with erratum
From: Dave Hansen @ 2026-04-30 20:09 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-19-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> Some TDX-capable CPUs have an erratum, as documented in Intel® Trust
> Domain CPU Architectural Extensions (May 2021 edition) Chapter 2.3:
> 
>   SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
>   to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR using
>   SEAMCALL must reload the current-VMCS, if required, using the VMPTRLD
>   instruction.
> 
> Clearing the current VMCS behind KVM's back will break KVM.
> 
> This erratum is not present when IA32_VMX_BASIC[60] is set. Add a CPU
> bug bit for this erratum and refuse to expose P-SEAMLDR features (e.g.,
> TDX module updates) on affected CPUs.

This seems totally random.

Shouldn't this be way back when can_expose_seamldr() got defined in the
first place?
> +#define X86_BUG_SEAMRET_INVD_VMCS	X86_BUG( 1*32+11) /* "seamret_invd_vmcs" SEAMRET from P-SEAMLDR clears the current VMCS */

I find myself wondering if this is worth a bug bit.


^ permalink raw reply

* Re: [PATCH v8 08/21] x86/virt/seamldr: Allocate and populate a module update request
From: Edgecombe, Rick P @ 2026-04-30 21:23 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Hansen, Dave,
	linux-kernel@vger.kernel.org, Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <a52c4701-c99d-48d5-9b63-8eb1c0e589f0@intel.com>

On Wed, 2026-04-29 at 17:45 -0700, Dave Hansen wrote:
> > +/*
> > + * Intel TDX module blob. Its format is defined at:
> > + *
> > https://github.com/intel/tdx-module-binaries/blob/main/blob_structure.txt
> 
> Heh, so URLs are not OK in changelogs because they go stale, but they're
> fine in the code?

Hmm, we usually could refer to a document by name. But this is just a txt in a
github repo... Maybe just:

Intel TDX module blob is a format for distributing TDX module updates. It is
parsed by the host before internal bits are passed to the TDX module. For
details, see the latest TDX module update repository.


To me the confusing part in all of this is that there is a format that the host
has to parse and then pass different bits into the TDX module. Usually something
would just take the format it defines. So I think that is probably good context
to have around it.

^ permalink raw reply

* Re: [PATCH v8 08/21] x86/virt/seamldr: Allocate and populate a module update request
From: Dave Hansen @ 2026-04-30 21:31 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
	Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <e4a12820bdb7fa8c2e97e028fc9a4b51e1baff1e.camel@intel.com>

On 4/30/26 14:23, Edgecombe, Rick P wrote:
> On Wed, 2026-04-29 at 17:45 -0700, Dave Hansen wrote:
>>> +/*
>>> + * Intel TDX module blob. Its format is defined at:
>>> + *
>>> https://github.com/intel/tdx-module-binaries/blob/main/blob_structure.txt
>> Heh, so URLs are not OK in changelogs because they go stale, but they're
>> fine in the code?
> Hmm, we usually could refer to a document by name. But this is just a txt in a
> github repo... Maybe just:
> 
> Intel TDX module blob is a format for distributing TDX module updates. It is
> parsed by the host before internal bits are passed to the TDX module. For
> details, see the latest TDX module update repository.
> 
> To me the confusing part in all of this is that there is a format that the host
> has to parse and then pass different bits into the TDX module. Usually something
> would just take the format it defines. So I think that is probably good context
> to have around it.

Just say:

/* Intel TDX module update ABI structure. aka. "TDX module blob" */

We don't need to tell folks how to use Google. Just give them the right
keywords to dump in there.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox