Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Ackerley Tng @ 2026-05-01  0:58 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y, dave.hansen@linux.intel.com
  Cc: Shahar, Sagi, Yamahata, Isaku, x86@kernel.org, kas@kernel.org,
	yilun.xu@linux.intel.com, bp@alien8.de, mingo@redhat.com,
	linux-kernel@vger.kernel.org, Huang, Kai, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Li, Xiaoyao, tglx@kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal
In-Reply-To: <231efd4e9267d3dbb8c63e57b3a43567c965e24a.camel@intel.com>

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Thu, 2026-04-30 at 11:17 -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>
>> >
>> > [...snip...]
>> >
>> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> > index b24b81cea5ea..e5a37ea2d4a0 100644
>> > --- a/arch/x86/virt/vmx/tdx/tdx.c
>> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> > @@ -710,7 +710,7 @@ static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
>> >    * to normal kernel memory. Systems with the X86_BUG_TDX_PW_MCE erratum need to
>> >    * do the conversion explicitly via MOVDIR64B.
>> >    */
>> > -static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>> > +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>>
>> Could this be updated to use phys_addr_t base and size_t size instead of
>> generic unsigned long?
>
> A type is a really good idea. It could look like a virtual address, despite the
> paddr in the name.
>
> I added it to the cleanup list. But I would prefer to keep this series focused
> on resolving the critical controversy around the struct page/pfn type, rather
> than adding in more things to debate. If you don't mind.
>

No worries, please go ahead :)

>>
>> >   {
>> >   	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
>> >   	unsigned long phys, end;
>> > @@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>> >   	 */
>> >   	mb();
>> >   }
>> > +EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
>> >
>> >   void tdx_quirk_reset_page(struct page *page)
>> >   {
>> > @@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
>> >   {
>> >   	struct tdx_module_args args = {};
>> >
>> > -	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
>> > +	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));
>>
>> Should mk_keyed_paddr() be updated to have a return type of phys_addr_t?
>> I guess in this case since mk_keyed_paddr() is pretty much an internal
>> function, returning u64 also makes sense to indicate that it should only
>> be used to set 64 bit registers.
>
> Yea, this is used to construct u64 inputs for seamcall args. So I think it
> should keep returning u64s. Maybe instead it would be better to have a function
> name pattern that denotes that purpose. We have some more helpers like this
> coming.
>
> But similarly, I'd like to not grow a snowballing cleanup series for this one.

Makes sense, let's go :)

^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Ackerley Tng @ 2026-05-01  0:57 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson, Rick P Edgecombe
  Cc: pbonzini@redhat.com, Yan Y Zhao, dave.hansen@linux.intel.com,
	Sagi Shahar, Isaku Yamahata, x86@kernel.org, kas@kernel.org,
	yilun.xu@linux.intel.com, bp@alien8.de, mingo@redhat.com,
	linux-kernel@vger.kernel.org, Kai Huang, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Xiaoyao Li, tglx@kernel.org,
	binbin.wu@linux.intel.com, Vishal Annapurve
In-Reply-To: <3067bbb0-b528-41ae-8739-c49432f9a62c@intel.com>

Dave Hansen <dave.hansen@intel.com> writes:

> On 4/30/26 12:20, Sean Christopherson wrote:
>>> Yea, this is used to construct u64 inputs for seamcall args. So I think it
>>> should keep returning u64s.
>> +1.  IMO, we should treat the TDX-Module as an extension of hardware and pass in
>> u64s where the spec says it takes a 64-bit value.
>
> +2
>
> In this very specific case 'phys_addr_t' is 100% the *WRONG* type for
> mk_keyed_paddr(). Why? Because the thing being returned is *NOT* *A*
> *PHYSICAL* *ADDRESS*. It's a composite type that contains a special
> physical address plus some metadata in a special "hardware" format. It's
> as much of a 'phys_addr_t' as a PTE is a 'phys_addr_t'. Yeah, they
> contain and are constructed partly from physical addresses, but they are
> not physical addresses themselves.
>

Got it, thanks!

> At the same time, if the kernel has a type-safe way of representing
> something that's also a 64-bit value, we should use it. Just because the
> TDX spec takes a 64-bit virtual address doesn't mean we should use a u64
> and not a "struct foo *".
>
> Let's please just not go back to a sea of u64's everywhere. Use common
> sense. Use the kernel's type system where appropriate, but don't
> over-apply it.
>
> IOW, I agree with Sean. But please don't take Sean's advise too far here.

^ permalink raw reply

* Re: [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-04-30 23:51 UTC (permalink / raw)
  To: Michael Roth
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <x7n77snnvvukofo3slopl74c5tmlb2t2un5qy5fq6eb3d2xt7e@6a6tixg5mdfc>

Michael Roth <michael.roth@amd.com> writes:

>
> [...snip...]
>
> I made a super-long-winded reply to that thread, but to summarize:
>
> PRESERVE flag has different enumeration/behavior/enforcement for pre-launch
> vs. post-launch, and similar considerations might come into play for
> other flags, so to make it easier to enumerate what flags are available
> for pre-launch/post-launch, maybe we could have 2 capabilities instead
> of 1:
>
>   KVM_CAP_MEMORY_ATTRIBUTES2_PRE_LAUNCH_FLAGS
>   KVM_CAP_MEMORY_ATTRIBUTES2_FLAGS
>
> where SNP/TDX would only advertise PRESERVE for PRE_LAUNCH, and pKVM I
> guess would enumerate it for both (or maybe just POST_LAUNCH?)
>
> That lets us keep the flags definitions more straightforward but still
> allows userspace to easily enumerate what exactly should be available at
> pre vs. post launch time, and give us some flexibility to detail
> variations in behavior between the 2 phases without documenting
> edge-cases in terms of VM types.
>

Oops Michael I only read this after the meeting today.

Sean, today at guest_memfd biweekly we also discussed this topic. I
brought up this topic because IMO the interface is starting to get
a little awkward, I'm struggling to put the awkwardness into words.

Here are some awkward points:

For PRESERVE, even though it is defined (now) as that what the host
writes will be readable in the guest, it only works for both to-private
and to-shared conversions for KVM_X86_SW_PROTECTED_VMs and pKVM. That's
because guest_memfd doesn't actually invoke encryption during the
conversion. For TDX and SNP, the encryption can only be done before the
VM is finalized, through vendor-specific ioctls that go through
kvm_gmem_populate() to load memory into the guest.

For ZERO, it is defined in api.rst that ZERO is not supported for
to-private conversions, and the rationale there was that when ZEROing,
guest_memfd/KVM can zero, but it's really the contract between the guest
and the vendor trusted firmware whether the guest sees zeros later.

Another awkward point is that ZERO was meant to enable an optimization
for TDX since the firmware zeroes memory, but it actually only zeroes
memory when the page is unmapped from Secure EPTs. guest_memfd (for now)
doesn't track whether the page was unmapped from Secure EPTs as part of
the conversion, so guest_memfd can't assume it was mapped before the
conversion request. To uphold the ZERO contract with userspace,
guest_memfd applies zeroing for TDX anyway.

Summarizing from guest_memfd biweekly today:

David suggested enumerating the combinations, something like
`SHARED_ZERO` and friends (since to-private and ZERO is not supported)
and Michael then brought up the other axis of pre/post launch. IIRC
there might be another axis since pKVM would need to determine
dynamically if a to_shared conversion can be permitted for the range
being converted, based on whether the guest had requested a to_shared
conversion.

I think this might just result in too many flags, and could paint us
into a corner if more options get supported later.


I spent even more time thinking about this today. I get that we want a
consistent contract to userspace, can we scope the contract differently?

What if we scope as "what KVM guarantees the content will look like
after guest_memfd updates attributes"? This is a smaller contract, since
it doesn't promise anything about what the guest sees. Running this
through a few examples:

+ Pre-finalize, SNP, to-private, PRESERVE: guest_memfd guarantees that
  after setting memory attributes, the contents of the pages will not
  change. The contents are then ready for populate. What populate does
  to the memory is another contract between SNP and the guest that is
  out of scope of guest_memfd's contract.

+ Post-finalize, SNP, to-private, PRESERVE: guest_memfd guarantees that
  after setting memory attributes, the contents of the pages will not
  change. SNP's contract with the guest does not, though. After the page
  gets faulted in, the guest sees scrambled data. This may be a
  meaningless operation now, but it leaves the door open so perhaps we
  could have an SNP-specific ioctl in future where step 1 is to set
  memory attributes within guest_memfd to private and step 2 is to
  encrypt in place.

+ pKVM, to-private, PRESERVE: guest_memfd guarantees that after setting
  memory attributes, the contents of the pages don't change. Separately,
  pKVM doesn't do encryption, so the pKVM guest reads the same contents
  the host wrote. The distinction here from the current state is that
  guest_memfd didn't guarantee that the pKVM guest will see the same
  content the host wrote since that's a separate contract between the
  pKVM guest and pKVM.

+ Post-finalize, TDX, to-shared, ZERO: guest_memfd guarantees that
  contents of the pages will be zeroed in the process of updating
  guest_memfd attributes. Host userspace reads zeros after faulting it
  in, which is because guest_memfd did zero the pages after conversion
  to shared. A future optimization is possible, where guest_memfd only
  zeroes the pages that were unmapped from Secure EPTs, since (this
  version of) TDX zeros memory when unmapping from Secure EPTs.

+ Post-finalize, TDX, to-shared, PRESERVE: -EOPNOTSUPP. guest_memfd is
  unable to guarantee that the process of setting memory attributes will
  not change memory contents. The process of setting memory attributes
  requires unmapping from Secure EPTs, which will zero the memory. (In
  future, if we want to relax this, we could permit this if nothing in
  the requested range was mapped in Secure EPTs)

+ Post-finalize, SNP, to-shared, PRESERVE: guest_memfd guarantees that
  after setting memory attributes, the contents of the pages will not
  change. For SNP, unmapping doesn't change memory contents? The guest
  reads garbage, and that's a separate contract between SNP and the
  guest. In the guest_memfd contract, guest_memfd PRESERVEs the memory
  contents in the process of setting memory attributes, and can fulfil
  that.

+ Post-finalize, TDX, to-private, ZERO: guest_memfd zeroes the shared
  memory before updating the attributes to be private, because it
  promised to. If this memory gets faulted in to Secure EPTs, TDX
  firmware zeros it again, because that's TDX's contract with the
  guest. I can't see any benefit to userspace in using this combination,
  but the guest_memfd contract and implementation are simple.

TLDR:

+ PRESERVE == guarantee that the process of setting memory attributes
  doesn't change memory contents.
    + implementation == do nothing in most cases, except -EOPNOTSUPP for
      to-shared on TDX, since unmapping is a required part of setting
      memory attributes to private, and a TDX side effect of unmapping
      is zeroing memory,
+ ZERO == guarantee that the process of setting memory attributes zeroes
  memory contents.
    + implementation == memset(zero) in most cases. For TDX, a future
      optimization exists, where memset() can be skipped for pages that
      were mapped in Secure EPTs before conversion
+ UNSPECIFIED == no guarantees
    + implementation == guest_memfd does nothing explicitly about memory
      contents. The implementation is pretty much the same as PRESERVE
      except guest_memfd won't take into account vendor-specific side
      effects of the process of conversion. Except for the test vehicle
      KVM_X86_SW_PROTECTED_VMS, where memory is scrambled.

>>
>> [...snip...]
>>
>
> Looking at the example you have there:
>
>   + Note: These content modes apply to the entire requested range, not
>   + just the parts of the range that underwent conversion. For example, if
>   + this was the initial state:
>   +
>   +   * [0x0000, 0x1000): shared
>   +   * [0x1000, 0x2000): private
>   +   * [0x2000, 0x3000): shared
>   + and range [0x0000, 0x3000) was set to shared, the content mode would
>   + apply to all memory in [0x0000, 0x3000), not just the range that
>   + underwent conversion [0x1000, 0x2000).
>
> Userspace would be aware of whether the range contains pages that were
> already set to private, so if it really wants to set the just the
> [0x1000, 0x2000) range to shared with appropriate content mode, it is
> fully able to do so by just issuing the ioctl for that specific range.
> If it attempts to issue it for the entire range, it only seems like it
> would defy normal expectations and cause confusion to skip ranges, and
> I'm not sure it gains us anything useful in exchange for that potential
> confusion.
>

Great that we're aligned here :) No complaints from guest_memfd biweekly
today as well :)

>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH v8 12/21] x86/virt/seamldr: Install a new TDX module
From: Dave Hansen @ 2026-04-30 22:29 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
	Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <ab11b3c922c93db045c4c0dfa8f191eb6fdb0b9f.camel@intel.com>

On 4/30/26 14:48, Edgecombe, Rick P wrote:
> On Thu, 2026-04-30 at 12:00 -0700, Dave Hansen wrote:
>> On 4/27/26 08:28, Chao Gao wrote:
>>> +			case MODULE_UPDATE_CPU_INSTALL:
>>> +				ret = seamldr_install(seamldr_params);
>>> +				break;
>> I really despise this naming. Could you please clarify with comments?
>>
>> This reads like it is installing a seamldr, not telling a seamldr to
>> perform an install.
> Yea it does read that way. We kind of have a convention around the seamcall
> wrappers matching the tdx docs name. tdh_foo_bar() is pretty clear.
> Unfortunately the seamldr calls are defined like SEAMLDR.INSTALL.

The naming convention is fine, really. It does make thing easier to look
up in the documentation.

In this case, just comment it, please. It's only a single site.

^ permalink raw reply

* Re: [PATCH v8 12/21] x86/virt/seamldr: Install a new TDX module
From: Edgecombe, Rick P @ 2026-04-30 21:48 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Hansen, Dave,
	linux-kernel@vger.kernel.org, Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <f92d114c-c0f1-4eaa-a828-a4d9ac4054d4@intel.com>

On Thu, 2026-04-30 at 12:00 -0700, Dave Hansen wrote:
> On 4/27/26 08:28, Chao Gao wrote:
> > +			case MODULE_UPDATE_CPU_INSTALL:
> > +				ret = seamldr_install(seamldr_params);
> > +				break;
> 
> I really despise this naming. Could you please clarify with comments?
> 
> This reads like it is installing a seamldr, not telling a seamldr to
> perform an install.

Yea it does read that way. We kind of have a convention around the seamcall
wrappers matching the tdx docs name. tdh_foo_bar() is pretty clear.
Unfortunately the seamldr calls are defined like SEAMLDR.INSTALL.

We don't need to make matching the wrappers name match the tdx docs be a rule. I
just considered trying to make a seamldr name schema to make it clear that these
are seamldr call names. But it is probably better to just special case this one
with a clearer name we give it, like seamldr_tdx_module_install(). Which is
slightly long, but both clear on the purpose and that it is in the family of
seamldr_() functions.

^ permalink raw reply

* Re: [PATCH v8 15/21] x86/virt/tdx: Refresh TDX module version after update
From: Edgecombe, Rick P @ 2026-04-30 21:35 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Hansen, Dave,
	linux-kernel@vger.kernel.org, Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <28be5180-ff74-4e4d-b392-7ba7a9b4c1c0@intel.com>

On Thu, 2026-04-30 at 12:14 -0700, Dave Hansen wrote:
> > Do not refresh the rest of tdx_sysinfo. Refreshing them at runtime could
> > disrupt running software that relies on the previously reported values.
> 
> This needs more explanation. I think the reasoning is quite nuanced.
> 
> tdx_sysinfo is just a cache of what the TDX module is and can do. If
> that changes, it means the TDX module changed. So you somehow need to
> argue why it's OK to hide those changes from the tdx_sysinfo users.
> 
> Why would they be confused by tdx_sysinfo changes but not by the TDX
> module *itself* changing?

We shouldn't refresh them because they don't change, right Chao? It's not
because we are hiding things?

> 
> > Note that major and minor versions are not refreshed because runtime updates
> > are supported only between releases with identical major and minor versions.
> 
> I'd rather have this in code than a changelog comment.
> 
> If they can't change then warn if they do.


^ permalink raw reply

* Re: [PATCH v8 08/21] x86/virt/seamldr: Allocate and populate a module update request
From: Dave Hansen @ 2026-04-30 21:31 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
	Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <e4a12820bdb7fa8c2e97e028fc9a4b51e1baff1e.camel@intel.com>

On 4/30/26 14:23, Edgecombe, Rick P wrote:
> On Wed, 2026-04-29 at 17:45 -0700, Dave Hansen wrote:
>>> +/*
>>> + * Intel TDX module blob. Its format is defined at:
>>> + *
>>> https://github.com/intel/tdx-module-binaries/blob/main/blob_structure.txt
>> Heh, so URLs are not OK in changelogs because they go stale, but they're
>> fine in the code?
> Hmm, we usually could refer to a document by name. But this is just a txt in a
> github repo... Maybe just:
> 
> Intel TDX module blob is a format for distributing TDX module updates. It is
> parsed by the host before internal bits are passed to the TDX module. For
> details, see the latest TDX module update repository.
> 
> To me the confusing part in all of this is that there is a format that the host
> has to parse and then pass different bits into the TDX module. Usually something
> would just take the format it defines. So I think that is probably good context
> to have around it.

Just say:

/* Intel TDX module update ABI structure. aka. "TDX module blob" */

We don't need to tell folks how to use Google. Just give them the right
keywords to dump in there.

^ permalink raw reply

* Re: [PATCH v8 08/21] x86/virt/seamldr: Allocate and populate a module update request
From: Edgecombe, Rick P @ 2026-04-30 21:23 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Hansen, Dave,
	linux-kernel@vger.kernel.org, Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <a52c4701-c99d-48d5-9b63-8eb1c0e589f0@intel.com>

On Wed, 2026-04-29 at 17:45 -0700, Dave Hansen wrote:
> > +/*
> > + * Intel TDX module blob. Its format is defined at:
> > + *
> > https://github.com/intel/tdx-module-binaries/blob/main/blob_structure.txt
> 
> Heh, so URLs are not OK in changelogs because they go stale, but they're
> fine in the code?

Hmm, we usually could refer to a document by name. But this is just a txt in a
github repo... Maybe just:

Intel TDX module blob is a format for distributing TDX module updates. It is
parsed by the host before internal bits are passed to the TDX module. For
details, see the latest TDX module update repository.


To me the confusing part in all of this is that there is a format that the host
has to parse and then pass different bits into the TDX module. Usually something
would just take the format it defines. So I think that is probably good context
to have around it.

^ permalink raw reply

* Re: [PATCH v8 18/21] coco/tdx-host: Don't expose P-SEAMLDR features on CPUs with erratum
From: Dave Hansen @ 2026-04-30 20:09 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-19-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> Some TDX-capable CPUs have an erratum, as documented in Intel® Trust
> Domain CPU Architectural Extensions (May 2021 edition) Chapter 2.3:
> 
>   SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
>   to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR using
>   SEAMCALL must reload the current-VMCS, if required, using the VMPTRLD
>   instruction.
> 
> Clearing the current VMCS behind KVM's back will break KVM.
> 
> This erratum is not present when IA32_VMX_BASIC[60] is set. Add a CPU
> bug bit for this erratum and refuse to expose P-SEAMLDR features (e.g.,
> TDX module updates) on affected CPUs.

This seems totally random.

Shouldn't this be way back when can_expose_seamldr() got defined in the
first place?
> +#define X86_BUG_SEAMRET_INVD_VMCS	X86_BUG( 1*32+11) /* "seamret_invd_vmcs" SEAMRET from P-SEAMLDR clears the current VMCS */

I find myself wondering if this is worth a bug bit.


^ permalink raw reply

* Re: [PATCH v8 17/21] x86/virt/seamldr: Abort updates on failure
From: Dave Hansen @ 2026-04-30 20:06 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-18-chao.gao@intel.com>

I don't like how this is being done.

 1. Introduce this do{}while() loop
 2. Do 20 other patches
 3. Introduce a thing that can make it change
 4. Change the fundamental flow of the loop, to fix #3

I'd much rather have:

 1. Introduce this do{}while() loop
 2. Tweak fundamental flow of the loop from the last patch when I can
    remember it. Allude to future failures.
 3. Do 20 other patches
 4. Introduce a thing that uses #2


> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index c81b26c4bac1..9b8f571eb03f 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -220,6 +220,7 @@ enum module_update_state {
>  static struct {
>  	enum module_update_state state;
>  	int thread_ack;
> +	bool failed;
>  	/*
>  	 * Protect update_data. Raw spinlock as it will be acquired from
>  	 * interrupt-disabled contexts.
> @@ -284,12 +285,15 @@ static int do_seamldr_install_module(void *seamldr_params)
>  				break;
>  			}
>  
> -			ack_state();
> +			if (ret)
> +				WRITE_ONCE(update_data.failed, true);
> +			else
> +				ack_state();
>  		} else {
>  			touch_nmi_watchdog();
>  			rcu_momentary_eqs();
>  		}

I don't like how this is turning out either. I don't like all the nested
conditions or ack_state() that hides its mucking with update data while
its caller mucks with it directly. It's just all hacked together.

Defer all of the acking, and *failed* acking to the ack_state() helper.

Also, I'm kinda peeved that you copied and pasted the  		
touch_nmi_watchdog()/rcu_momentary_eqs() bits and none of the comments.
This is a rather subtle use of both. If you want this to be a normal
"spinning in stop machine" idiom, then create a helper and put the
comments there.

Also, this is a case where:

	do {
		cpu_relax();
		newstate = READ_ONCE(update_data.state);

		if (newstate == curstate) {
			// can cpu_relax() just go in here??
			touch_nmi_watchdog();
			rcu_momentary_eqs();
			continue;
		}
	
		switch() {
			// state changing here
		}
	} while (...);

is a much more sane setup. You're not paying the if() indentation cost
for the entire state transition block. You're also putting the "shut up
the warnings" code out of the way where you can forget about it.


> -	} while (curstate != MODULE_UPDATE_DONE);
> +	} while (curstate != MODULE_UPDATE_DONE && !READ_ONCE(update_data.failed));
>  
>  	return ret;
>  }
> @@ -315,6 +319,7 @@ int seamldr_install_module(const u8 *data, u32 size)
>  
>  	/* Ensure a stable set of online CPUs for the update process. */
>  	guard(cpus_read_lock)();
> +	WRITE_ONCE(update_data.failed, false);
>  	set_target_state(MODULE_UPDATE_START + 1);
>  	ret = stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
>  	if (ret)
I kinda wish this 'update_data.failed' set was named. This is trying to
bring 'update_data' into some initial state. Let's _call_ it that.
Honestly, I wouldn't hate if that function just also did the spinlock
init since it's so ugly do to statically.

^ permalink raw reply

* Re: [PATCH v8 09/21] x86/virt/seamldr: Introduce skeleton for TDX module updates
From: Dave Hansen @ 2026-04-30 20:03 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-10-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> +static struct {
> +	enum module_update_state state;
> +	int thread_ack;

multi_stop_data has an atomic_t for this.

You have an int.

Which one is right?

^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Dave Hansen @ 2026-04-30 19:37 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe
  Cc: Ackerley Tng, pbonzini@redhat.com, Yan Y Zhao,
	dave.hansen@linux.intel.com, Sagi Shahar, Isaku Yamahata,
	x86@kernel.org, kas@kernel.org, yilun.xu@linux.intel.com,
	bp@alien8.de, mingo@redhat.com, linux-kernel@vger.kernel.org,
	Kai Huang, kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	Xiaoyao Li, tglx@kernel.org, binbin.wu@linux.intel.com,
	Vishal Annapurve
In-Reply-To: <afOrd7JYkUfe7wcZ@google.com>

On 4/30/26 12:20, Sean Christopherson wrote:
>> Yea, this is used to construct u64 inputs for seamcall args. So I think it
>> should keep returning u64s.
> +1.  IMO, we should treat the TDX-Module as an extension of hardware and pass in
> u64s where the spec says it takes a 64-bit value.

+2

In this very specific case 'phys_addr_t' is 100% the *WRONG* type for
mk_keyed_paddr(). Why? Because the thing being returned is *NOT* *A*
*PHYSICAL* *ADDRESS*. It's a composite type that contains a special
physical address plus some metadata in a special "hardware" format. It's
as much of a 'phys_addr_t' as a PTE is a 'phys_addr_t'. Yeah, they
contain and are constructed partly from physical addresses, but they are
not physical addresses themselves.

At the same time, if the kernel has a type-safe way of representing
something that's also a 64-bit value, we should use it. Just because the
TDX spec takes a 64-bit virtual address doesn't mean we should use a u64
and not a "struct foo *".

Let's please just not go back to a sea of u64's everywhere. Use common
sense. Use the kernel's type system where appropriate, but don't
over-apply it.

IOW, I agree with Sean. But please don't take Sean's advise too far here.

^ permalink raw reply

* Re: [PATCH v8 16/21] x86/virt/tdx: Reject updates during concurrent TD build
From: Dave Hansen @ 2026-04-30 19:25 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-17-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> tl;dr: A TDX module erratum can silently corrupt TD measurement state if a
> module update races with TD build. Handle that by rejecting the update,
> instead of introducing new TD-build ioctl failure paths.

The downside of this needs to be discussed. Namely that module updates
can be blocked forever.

> Long Version:
...

This explanation is confusing.

Focus on what the patch *does* and its features and downsides.

*Then* broach the alternatives. But, please, clearly separate out this
patch from other opining.

> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index de822ed9ef0b..b063aabe2554 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -26,11 +26,18 @@
>  #define TDX_SEAMCALL_GP			(TDX_SW_ERROR | X86_TRAP_GP)
>  #define TDX_SEAMCALL_UD			(TDX_SW_ERROR | X86_TRAP_UD)
>  
> +#define TDX_SEAMCALL_STATUS_MASK		0xFFFFFFFF00000000ULL
> +
>  /*
>   * TDX module SEAMCALL leaf function error codes
>   */
> -#define TDX_SUCCESS		0ULL
> -#define TDX_RND_NO_ENTROPY	0x8000020300000000ULL
> +#define TDX_SUCCESS			0ULL
> +#define TDX_RND_NO_ENTROPY		0x8000020300000000ULL
> +#define TDX_UPDATE_COMPAT_SENSITIVE	0x8000051200000000ULL
> +
> +/* Bit definitions of TDX_FEATURES0 metadata field */
> +#define TDX_FEATURES0_NO_RBP_MOD	BIT_ULL(18)
> +#define TDX_FEATURES0_UPDATE_COMPAT	BIT_ULL(47)

Refactor first. Add new features second.

>  #ifndef __ASSEMBLER__
>  
> diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
> index 6ff4672c4181..215c00d76a94 100644
> --- a/arch/x86/kvm/vmx/tdx_errno.h
> +++ b/arch/x86/kvm/vmx/tdx_errno.h
> @@ -4,8 +4,6 @@
>  #ifndef __KVM_X86_TDX_ERRNO_H
>  #define __KVM_X86_TDX_ERRNO_H
>  
> -#define TDX_SEAMCALL_STATUS_MASK		0xFFFFFFFF00000000ULL
> -
>  /*
>   * TDX SEAMCALL Status Codes (returned in RAX)
>   */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index a7dfa4ee8813..7864ab68f4e3 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1234,10 +1234,13 @@ static __init int tdx_enable(void)
>  }
>  subsys_initcall(tdx_enable);
>  
> +#define TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE BIT(16)
> +
>  int tdx_module_shutdown(void)
>  {
>  	struct tdx_sys_info_handoff handoff = {};
>  	struct tdx_module_args args = {};
> +	u64 err;
>  	int ret, cpu;
>  
>  	ret = get_tdx_sys_info_handoff(&handoff);
> @@ -1248,9 +1251,26 @@ int tdx_module_shutdown(void)
>  	 * module can produce and most likely supported by newer modules.
>  	 */
>  	args.rcx = handoff.module_hv;
> -	ret = seamcall_prerr(TDH_SYS_SHUTDOWN, &args);
> -	if (ret)
> -		return ret;
> +
> +	/*
> +	 * Mitigate the erratum where updates can break concurrent TD
> +	 * build. Do not pre-check support for this flag. If unsupported,
> +	 * rely on the TDX module to reject shutdown requests.
> +	 */
> +	args.rcx |= TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE;

"Mitigate the erratum..." is a strange way to start this.

This would be a much better format I think:

	/*
	 * This flag will <say what it does> if <triggering event>
	 * happens. That eliminates exposure to a TDX erratum which
	 * can <explain bad things here>.
	 *
	 * This flag is not supported by all TDX modules and may cause
	 * the shutdown (and subsequent update procedure) to fail.
	 */

> +	err = seamcall(TDH_SYS_SHUTDOWN, &args);
> +
> +	/*
> +	 * Return -EBUSY to signal that some ongoing flows are incompatible
> +	 * with updates so that userspace can retry.
> +	 */

	/*
	 * The shutdown ran into a "sensitive" ongoing operation, like
	 * TD build. Signal to userspace that it can retry.
	 */

> +	if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_UPDATE_COMPAT_SENSITIVE)
> +		return -EBUSY;
> +	if (err) {
> +		seamcall_err(TDH_SYS_SHUTDOWN, err, &args);
> +		return -EIO;
> +	}

Whitespace between the if()s please.



^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Sean Christopherson @ 2026-04-30 19:20 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Ackerley Tng, pbonzini@redhat.com, Yan Y Zhao,
	dave.hansen@linux.intel.com, Sagi Shahar, Isaku Yamahata,
	x86@kernel.org, kas@kernel.org, yilun.xu@linux.intel.com,
	bp@alien8.de, mingo@redhat.com, linux-kernel@vger.kernel.org,
	Kai Huang, kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	Xiaoyao Li, tglx@kernel.org, binbin.wu@linux.intel.com,
	Vishal Annapurve
In-Reply-To: <231efd4e9267d3dbb8c63e57b3a43567c965e24a.camel@intel.com>

On Thu, Apr 30, 2026, Rick P Edgecombe wrote:
> On Thu, 2026-04-30 at 11:17 -0700, Ackerley Tng wrote:
> > >   {
> > >   	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
> > >   	unsigned long phys, end;
> > > @@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> > >   	 */
> > >   	mb();
> > >   }
> > > +EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
> > > 
> > >   void tdx_quirk_reset_page(struct page *page)
> > >   {
> > > @@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
> > >   {
> > >   	struct tdx_module_args args = {};
> > > 
> > > -	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
> > > +	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));
> > 
> > Should mk_keyed_paddr() be updated to have a return type of phys_addr_t?
> > I guess in this case since mk_keyed_paddr() is pretty much an internal
> > function, returning u64 also makes sense to indicate that it should only
> > be used to set 64 bit registers.
> 
> Yea, this is used to construct u64 inputs for seamcall args. So I think it
> should keep returning u64s.

+1.  IMO, we should treat the TDX-Module as an extension of hardware and pass in
u64s where the spec says it takes a 64-bit value.


^ permalink raw reply

* Re: [PATCH v8 15/21] x86/virt/tdx: Refresh TDX module version after update
From: Dave Hansen @ 2026-04-30 19:14 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-16-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> The kernel exposes the TDX module version through sysfs so userspace can
> check update compatibility. That information needs to remain accurate
> across runtime updates.
> 
> A runtime update may change the module's update_version, so refresh the
> cached version after a successful update and emit a log message to show
> the version change.
> 
> Drop __ro_after_init from tdx_sysinfo because it is now updated at runtime.
> 
> Perform the refresh outside of stop_machine() since printk() within
> stop_machine() would add significant latency.
> 
> Do not refresh the rest of tdx_sysinfo. Refreshing them at runtime could
> disrupt running software that relies on the previously reported values.

This needs more explanation. I think the reasoning is quite nuanced.

tdx_sysinfo is just a cache of what the TDX module is and can do. If
that changes, it means the TDX module changed. So you somehow need to
argue why it's OK to hide those changes from the tdx_sysinfo users.

Why would they be confused by tdx_sysinfo changes but not by the TDX
module *itself* changing?

> Note that major and minor versions are not refreshed because runtime updates
> are supported only between releases with identical major and minor versions.

I'd rather have this in code than a changelog comment.

If they can't change then warn if they do.

> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index 98a8d9d3ae25..c81b26c4bac1 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -306,6 +306,8 @@ DEFINE_FREE(free_seamldr_params, struct seamldr_params *,
>   */
>  int seamldr_install_module(const u8 *data, u32 size)
>  {
> +	int ret;
> +
>  	struct seamldr_params *params __free(free_seamldr_params) =
>  						init_seamldr_params(data, size);
>  	if (IS_ERR(params))
> @@ -314,6 +316,10 @@ int seamldr_install_module(const u8 *data, u32 size)
>  	/* Ensure a stable set of online CPUs for the update process. */
>  	guard(cpus_read_lock)();
>  	set_target_state(MODULE_UPDATE_START + 1);
> -	return stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
> +	ret = stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
> +	if (ret)
> +		return ret;
> +
> +	return tdx_module_refresh_version();
>  }

This is one reason I rather dislike guard().

Does tdx_module_refresh_version() need to be guarded by 'cpus_read_lock'?

> +int tdx_module_refresh_version(void)
> +{
> +	struct tdx_sys_info_version *old, new;

Two variable definitions, please. IMNHO pointers and plain variables are
different types. Even if you can, they should not be defined together.

> +	int ret;
> +
> +	/* Shouldn't fail as the update has succeeded. */
> +	ret = get_tdx_sys_info_version(&new);
> +	WARN_ON_ONCE(ret);
> +
> +	old = &tdx_sysinfo.version;
> +	pr_info("version " TDX_VERSION_FMT " -> " TDX_VERSION_FMT "\n",
> +		old->major_version, old->minor_version, old->update_version,
> +		new.major_version, new.minor_version, new.update_version);

Having 'new' and 'old' be different types is really wonky. What's wrong
with:

	struct tdx_sys_info_version old = tdx_sysinfo.version;

?


> +	/* Major/minor versions should not change across updates. */
> +	tdx_sysinfo.version.update_version	= new.update_version;

					   ^ very odd tab

Also, how much of this do you *NEED*? You don't need to print the old
version. You don't really need to _print_ the new version either.

Isn't this arguably all fluff?

If the fluff went away would we even be talking about the funky pointer
versus plain type gunk?

Second, this is all open-coded and completely discrete from the code
that originally sets 'tdx_sysinfo.version'. Can it be unified?


^ permalink raw reply

* Re: [PATCH v8 12/21] x86/virt/seamldr: Install a new TDX module
From: Dave Hansen @ 2026-04-30 19:00 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-13-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> +			case MODULE_UPDATE_CPU_INSTALL:
> +				ret = seamldr_install(seamldr_params);
> +				break;

I really despise this naming. Could you please clarify with comments?

This reads like it is installing a seamldr, not telling a seamldr to
perform an install.

^ permalink raw reply

* Re: [PATCH v8 11/21] x86/virt/tdx: Reset software states during TDX module shutdown
From: Dave Hansen @ 2026-04-30 18:58 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-12-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> +	/*
> +	 * Clear global and per-CPU initialization flags so the new module
> +	 * can be fully re-initialized after a successful update.
> +	 *
> +	 * No locks needed as no concurrent accesses can occur here.
> +	 */
> +	tdx_module_initialized = false;
> +	sysinit_done = false;
> +	sysinit_ret = 0;
> +	for_each_possible_cpu(cpu)
> +		per_cpu(tdx_lp_initialized, cpu) = false;

This speaks to needing refactoring.

If there's global TDX state, it needs to be in a global TDX state
structure, not scattered across random global variables.

Imagine the structure is:

struct tdx_module_config foo;

That's 0's at boot. You have to init the TDX module to bring it out of
0's to valid state. It actually means something if you do:

	memset(&foo, 0, sizeof(foo));

Because it takes it right back to its bss state. That ^ is also handy
because it naturally just works if new state gets added.

Guess what will happen the next time someone adds:

	int sysinit_new_fancy_thing;

Someone will forget to add it to this code. Then the module gets updated
and breaks things in fun ways.

^ permalink raw reply

* Re: [PATCH v8 10/21] x86/virt/seamldr: Shut down the current TDX module
From: Dave Hansen @ 2026-04-30 18:52 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-11-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
>  static int do_seamldr_install_module(void *seamldr_params)
>  {
>  	enum module_update_state newstate, curstate = MODULE_UPDATE_START;
> +	int cpu = smp_processor_id();
> +	bool primary;
>  	int ret = 0;
>  
> +	primary = cpumask_first(cpu_online_mask) == cpu;

Isn't cpumask_first(cpu_online_mask)==0, always? I thought CPU 0 could
never be offlined.

^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Edgecombe, Rick P @ 2026-04-30 18:38 UTC (permalink / raw)
  To: Tng, Ackerley, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y, dave.hansen@linux.intel.com
  Cc: Shahar, Sagi, Yamahata, Isaku, x86@kernel.org, kas@kernel.org,
	yilun.xu@linux.intel.com, bp@alien8.de, mingo@redhat.com,
	linux-kernel@vger.kernel.org, Huang, Kai, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Li, Xiaoyao, tglx@kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal
In-Reply-To: <CAEvNRgGsMaioFvTWaLhnDiLbdx3tunZpo3Qzxdm1xeb4ef0J9Q@mail.gmail.com>

On Thu, 2026-04-30 at 11:17 -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > 
> > [...snip...]
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index b24b81cea5ea..e5a37ea2d4a0 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -710,7 +710,7 @@ static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
> >    * to normal kernel memory. Systems with the X86_BUG_TDX_PW_MCE erratum need to
> >    * do the conversion explicitly via MOVDIR64B.
> >    */
> > -static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> > +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> 
> Could this be updated to use phys_addr_t base and size_t size instead of
> generic unsigned long?

A type is a really good idea. It could look like a virtual address, despite the
paddr in the name.

I added it to the cleanup list. But I would prefer to keep this series focused
on resolving the critical controversy around the struct page/pfn type, rather
than adding in more things to debate. If you don't mind.

> 
> >   {
> >   	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
> >   	unsigned long phys, end;
> > @@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> >   	 */
> >   	mb();
> >   }
> > +EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
> > 
> >   void tdx_quirk_reset_page(struct page *page)
> >   {
> > @@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
> >   {
> >   	struct tdx_module_args args = {};
> > 
> > -	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
> > +	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));
> 
> Should mk_keyed_paddr() be updated to have a return type of phys_addr_t?
> I guess in this case since mk_keyed_paddr() is pretty much an internal
> function, returning u64 also makes sense to indicate that it should only
> be used to set 64 bit registers.

Yea, this is used to construct u64 inputs for seamcall args. So I think it
should keep returning u64s. Maybe instead it would be better to have a function
name pattern that denotes that purpose. We have some more helpers like this
coming.

But similarly, I'd like to not grow a snowballing cleanup series for this one.

^ permalink raw reply

* Re: [PATCH v2 3/4] x86/tdx: Drop exported function tdx_quirk_reset_page()
From: Ackerley Tng @ 2026-04-30 18:29 UTC (permalink / raw)
  To: Yan Zhao, dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yilun.xu, vannapurve, sagis,
	binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430015001.24242-1-yan.y.zhao@intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

> KVM invokes tdx_quirk_reset_page() to reset TDX control pages (including
> S-EPT pages, TDR page, etc.), as all those pages are allocated by KVM TDX
> and thus always have struct page.
>
> However, it's also reasonable for KVM to reset those TDX control pages via
> tdx_quirk_reset_paddr() directly, eliminating the need to export two
> parallel APIs. Keeping tdx_quirk_reset_page() as a one-line helper in the
> header file is also unnecessary.
>
> No functional change intended.
>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

Thanks for the cleanup!

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

>
> [...snip...]
>

^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Ackerley Tng @ 2026-04-30 18:17 UTC (permalink / raw)
  To: Yan Zhao, dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yilun.xu, vannapurve, sagis,
	binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014948.24226-1-yan.y.zhao@intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

>
> [...snip...]
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index b24b81cea5ea..e5a37ea2d4a0 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -710,7 +710,7 @@ static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
>   * to normal kernel memory. Systems with the X86_BUG_TDX_PW_MCE erratum need to
>   * do the conversion explicitly via MOVDIR64B.
>   */
> -static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)

Could this be updated to use phys_addr_t base and size_t size instead of
generic unsigned long?

>  {
>  	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
>  	unsigned long phys, end;
> @@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>  	 */
>  	mb();
>  }
> +EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
>
>  void tdx_quirk_reset_page(struct page *page)
>  {
> @@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
>  {
>  	struct tdx_module_args args = {};
>
> -	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
> +	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));

Should mk_keyed_paddr() be updated to have a return type of phys_addr_t?
I guess in this case since mk_keyed_paddr() is pretty much an internal
function, returning u64 also makes sense to indicate that it should only
be used to set 64 bit registers.

>
>  	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
>  }
>  EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_tdr);
>
> -u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> +u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, kvm_pfn_t pfn)
>  {
>  	struct tdx_module_args args = {};
>
> -	args.rcx = mk_keyed_paddr(hkid, page);
> +	args.rcx = mk_keyed_paddr(hkid, pfn);
>
>  	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
>  }
> --
> 2.43.2

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v2 1/4] x86/tdx: Use PFN directly for mapping guest private memory
From: Ackerley Tng @ 2026-04-30 17:43 UTC (permalink / raw)
  To: Yan Zhao, dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yilun.xu, vannapurve, sagis,
	binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014929.24210-1-yan.y.zhao@intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

> From: Sean Christopherson <seanjc@google.com>
>
> Remove struct page assumptions/constraints in the SEAMCALL wrapper APIs for
> mapping guest private memory and have them take PFN directly.
>

Thanks for this :)

> Having core TDX make assumptions that guest private memory must be backed
> by struct page (and/or folio) will create subtle dependencies on how
> KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
> allocated from core MM, if the memory is refcounted, or if the folio is
> split) that are easily avoided. [1].
>
> KVM's MMUs work with PFNs. This is very much an intentional design choice.
> It ensures that the KVM MMUs remain flexible and are not too tied to the
> regular CPU MMUs and the kernel code around them. Using 'struct page' for
> TDX guest memory is not a good fit anywhere near the KVM MMU code [2].
>
> Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
> since APIs tdh_mem_page_add() and tdh_mem_page_aug() are exported to KVM
> only.
>
> [ Yan: Replace "u64 pfn" with "kvm_pfn_t pfn" ]
>

Thanks for your help, Yan!

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Link: https://lore.kernel.org/all/aWgyhmTJphGQqO0Y@google.com [1]
> Link: https://lore.kernel.org/all/ac7V0g2q2hN3dU5u@google.com [2]
>
> [...snip...]
>

^ permalink raw reply

* [Invitation] bi-weekly guest_memfd upstream call on 2026-04-30 (today!)
From: David Hildenbrand (Arm) @ 2026-04-30 10:47 UTC (permalink / raw)
  To: linux-coco@lists.linux.dev, linux-mm@kvack.org, KVM

Hi,

Late reminder :(

Our next guest_memfd upstream call is scheduled for today, Thursday,
2026-04-30 at 8:00 - 9:00am (GMT-07:00) Pacific Time - Vancouver.

We'll be using the following Google meet:
http://meet.google.com/wxp-wtju-jzw

The meeting notes can be found at [1], where we also link recordings and
collect current guest_memfd upstream proposals. If you want an google
calendar invitation that also covers all future meetings, just write me
or Ackerley a mail.

In this meeting, we'll talk about progress on the in-place conversion front.

To put something to discuss onto the agenda, reply to this mail or add
them to the "Topics/questions for next meeting(s)" section in the
meeting notes as a comment.

[1]
https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?usp=sharing

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH kernel 4/9] dma/swiotlb: Stop forcing SWIOTLB for TDISP devices
From: Alexey Kardashevskiy @ 2026-04-30  3:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Robin Murphy, x86, linux-kernel, kvm, linux-pci,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Sean Christopherson, Paolo Bonzini,
	Andy Lutomirski, Peter Zijlstra, Bjorn Helgaas, Marek Szyprowski,
	Andrew Morton, Catalin Marinas, Michael Ellerman, Mike Rapoport,
	Tom Lendacky, Ard Biesheuvel, Ashish Kalra, Stefano Garzarella,
	Melody Wang, Seongman Lee, Joerg Roedel, Nikunj A Dadhania,
	Michael Roth, Suravee Suthikulpanit, Andi Kleen,
	Kuppuswamy Sathyanarayanan, Tony Luck, David Woodhouse,
	Greg Kroah-Hartman, Denis Efremov, Geliang Tang, Piotr Gregor,
	Michael S. Tsirkin, Alex Williamson, Arnd Bergmann, Jesse Barnes,
	Jacob Pan, Yinghai Lu, Kevin Brodsky, Jonathan Cameron,
	Aneesh Kumar K.V (Arm), Xu Yilun, Herbert Xu, Kim Phillips,
	Konrad Rzeszutek Wilk, Stefano Stabellini, Claire Chang,
	linux-coco, iommu
In-Reply-To: <20260420235046.GA3199414@nvidia.com>



On 21/4/26 09:50, Jason Gunthorpe wrote:
> On Wed, Apr 15, 2026 at 04:32:14PM +1000, Alexey Kardashevskiy wrote:
>>>> So the DMA API should see the DMA_ATTR_CC_DECRYPTED and setup the
>>>> correct dma_dddr_t either by choosing the shared alias for the TDISP
>>>> device's vTOM, or setting the C bit in a vIOMMU S1.
>>>
>>> Something like that?
>>>
>>> https://github.com/AMDESE/linux-kvm/commit/266a41a1ea746557eb63debce886ce2c98820667
>>>
>>> With some little hacks I can make this tree do TDISP DMA to private or shared (swiotlb) memory by steering via this vTOM thing. Thanks,
>>
>> Ping? Thanks,
> 
> That seems approx right, it is broadl similar to what ARM is
> doing.. But the address map changes when switching to T=1 for AMD?

Well, at the moment, in my WIP tree, I map the guest memory in the host IOMMU twice -

1) from 0 offset (and this is shared when T=0 and private when T=1) and
2) from vTOM offset (say, 1TB) - and this half of the mapping is always shared.

Thanks,


-- 
Alexey


^ permalink raw reply

* [PATCH v2 4/4] x86/virt/tdx: Move mk_keyed_paddr() to tdx.c due to no external users
From: Yan Zhao @ 2026-04-30  1:50 UTC (permalink / raw)
  To: dave.hansen, pbonzini, seanjc
  Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
	kai.huang, rick.p.edgecombe, yan.y.zhao, yilun.xu, vannapurve,
	ackerleytng, sagis, binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014852.24183-1-yan.y.zhao@intel.com>

Move mk_keyed_paddr() from tdx.h to tdx.c to avoid unnecessary header
inclusion and improve encapsulation since there are no users outside of
tdx.c.

No functional change intended.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/tdx.h  | 6 ------
 arch/x86/virt/vmx/tdx/tdx.c | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9c63deaa0e8f..503f9a3f46d6 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -177,12 +177,6 @@ struct tdx_vp {
 	struct page **tdcx_pages;
 };
 
-static inline u64 mk_keyed_paddr(u16 hkid, kvm_pfn_t pfn)
-{
-	/* KeyID bits are just above the physical address bits. */
-	return PFN_PHYS(pfn) | ((u64)hkid << boot_cpu_data.x86_phys_bits);
-}
-
 u64 tdh_vp_enter(struct tdx_vp *vp, struct tdx_module_args *args);
 u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page);
 u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, kvm_pfn_t pfn, struct page *source,
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index deb67e68f85f..967482ae3c80 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1911,6 +1911,12 @@ u64 tdh_phymem_cache_wb(bool resume)
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_phymem_cache_wb);
 
+static inline u64 mk_keyed_paddr(u16 hkid, kvm_pfn_t pfn)
+{
+	/* KeyID bits are just above the physical address bits. */
+	return PFN_PHYS(pfn) | ((u64)hkid << boot_cpu_data.x86_phys_bits);
+}
+
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
 {
 	struct tdx_module_args args = {};
-- 
2.43.2


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox