Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: Jürgen Groß @ 2026-05-28 13:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, kvm, linux-coco, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	David Woodhouse, Paul Durrant
In-Reply-To: <ahg-bEiwyqYTdWOD@google.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 832 bytes --]

On 28.05.26 15:09, Sean Christopherson wrote:
> On Thu, May 28, 2026, Juergen Gross wrote:
>> Please disregard this series, there is one complication sashiko made me
>> aware of.
> 
> Sashiko beat me to the punch. :-)
> 
> See commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad WRMSR(MCi_CTL/STATUS)")
> for a real world example of how things can and will go wrong.

Yeah, with Sashiko's pointer it was easy to spot.

Question now is whether the already existing cases of -errno passed as return
value are wrong or on purpose. If the latter, there should be a comment for
that, otherwise they need to be fixed..

Disentangling the MSR emulation return values from the "normal" ones ("return
to guest"/"return to user mode") will be quite interesting with the overloaded
semantics of "1".


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* Re: [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: Sean Christopherson @ 2026-05-28 13:21 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: linux-kernel, x86, kvm, linux-coco, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	David Woodhouse, Paul Durrant
In-Reply-To: <476964db-837a-45d8-a647-c5a2ea8b0cb6@suse.com>

On Thu, May 28, 2026, Jürgen Groß wrote:
> On 28.05.26 15:09, Sean Christopherson wrote:
> > On Thu, May 28, 2026, Juergen Gross wrote:
> > > Please disregard this series, there is one complication sashiko made me
> > > aware of.
> > 
> > Sashiko beat me to the punch. :-)
> > 
> > See commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad WRMSR(MCi_CTL/STATUS)")
> > for a real world example of how things can and will go wrong.
> 
> Yeah, with Sashiko's pointer it was easy to spot.
> 
> Question now is whether the already existing cases of -errno passed as return
> value are wrong or on purpose. 

What are the existing cases?

> If the latter, there should be a comment for
> that, otherwise they need to be fixed..
> 
> Disentangling the MSR emulation return values from the "normal" ones ("return
> to guest"/"return to user mode") will be quite interesting with the overloaded
> semantics of "1".

LOL, "interesting".

^ permalink raw reply

* Re: [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: Jürgen Groß @ 2026-05-28 14:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, kvm, linux-coco, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	David Woodhouse, Paul Durrant
In-Reply-To: <ahhBQDvyzHQfTCBD@google.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 834 bytes --]

On 28.05.26 15:21, Sean Christopherson wrote:
> On Thu, May 28, 2026, Jürgen Groß wrote:
>> On 28.05.26 15:09, Sean Christopherson wrote:
>>> On Thu, May 28, 2026, Juergen Gross wrote:
>>>> Please disregard this series, there is one complication sashiko made me
>>>> aware of.
>>>
>>> Sashiko beat me to the punch. :-)
>>>
>>> See commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad WRMSR(MCi_CTL/STATUS)")
>>> for a real world example of how things can and will go wrong.
>>
>> Yeah, with Sashiko's pointer it was easy to spot.
>>
>> Question now is whether the already existing cases of -errno passed as return
>> value are wrong or on purpose.
> 
> What are the existing cases?

Have a look at:

kvm_hv_msr_get_crash_data()
kvm_hv_msr_set_crash_data()
svm_get_msr()
svm_set_msr()


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* Re: [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: Jürgen Groß @ 2026-05-28 14:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, kvm, linux-coco, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	David Woodhouse, Paul Durrant
In-Reply-To: <ahhBQDvyzHQfTCBD@google.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 814 bytes --]

On 28.05.26 15:21, Sean Christopherson wrote:
> On Thu, May 28, 2026, Jürgen Groß wrote:
>> On 28.05.26 15:09, Sean Christopherson wrote:
>>> On Thu, May 28, 2026, Juergen Gross wrote:
>>>> Please disregard this series, there is one complication sashiko made me
>>>> aware of.
>>>
>>> Sashiko beat me to the punch. :-)
>>>
>>> See commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad WRMSR(MCi_CTL/STATUS)")
>>> for a real world example of how things can and will go wrong.
>>
>> Yeah, with Sashiko's pointer it was easy to spot.
>>
>> Question now is whether the already existing cases of -errno passed as return
>> value are wrong or on purpose.
> 
> What are the existing cases?

Found another one:

kvm_xen_write_hypercall_page() (called by kvm_set_msr_common())


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* Re: [PATCH v5 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ackerley Tng @ 2026-05-28 14:45 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <6f1ec3d8ebcf3aaceccc099c07d0deb545dd4ab9.1779133590.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

Thank you Ashish!

> From: Ashish Kalra <ashish.kalra@amd.com>
>
> When SEV-SNP is enabled, all writes to memory are checked to ensure
> integrity of SNP guest memory. This imposes performance overhead on the
> whole system.
>
> RMPOPT is a new instruction that minimizes the performance overhead of
> RMP checks on the hypervisor and on non-SNP guests by allowing RMP
> checks to be skipped for 1GB regions of memory that are known not to
> contain any SEV-SNP guest memory.
>
> Add support for performing RMP optimizations asynchronously using a
> dedicated workqueue.
>
> Enable RMPOPT optimizations globally for all system RAM up to 2TB at

This should also be updated to say "Enable RMPOPT optimizations for up
to 2TB worth of system RAM at..."

The current phrasing sounds like only addresses [0, 2TB) are allowed to
be optimized, but actually any address [start, start + 2TB) can be
optimized?

> RMP initialization time. RMP checks can initially be skipped for 1GB
> memory ranges that do not contain SEV-SNP guest memory (excluding
> preassigned pages such as the RMP table and firmware pages). As SNP
> guests are launched, RMPUPDATE will disable the corresponding RMPOPT
> optimizations.
>
> Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> ---
>  arch/x86/virt/svm/sev.c | 167 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 164 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
> index 82f9dc7a57c3..8876cac052d5 100644
> --- a/arch/x86/virt/svm/sev.c
> +++ b/arch/x86/virt/svm/sev.c
> @@ -19,6 +19,7 @@
>  #include <linux/iommu.h>
>  #include <linux/amd-iommu.h>
>  #include <linux/nospec.h>
> +#include <linux/workqueue.h>
>
>  #include <asm/sev.h>
>  #include <asm/processor.h>
> @@ -125,7 +126,18 @@ static void *rmp_bookkeeping __ro_after_init;
>  static u64 probed_rmp_base, probed_rmp_size;
>
>  static cpumask_t rmpopt_cpumask;
> -static phys_addr_t rmpopt_pa_start;
> +static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
> +
> +enum rmpopt_function {
> +	RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
> +	RMPOPT_FUNC_REPORT_STATUS
> +};
> +
> +#define RMPOPT_WORK_TIMEOUT	10000
> +
> +static struct workqueue_struct *rmpopt_wq;
> +static struct delayed_work rmpopt_delayed_work;
> +static DEFINE_MUTEX(rmpopt_wq_mutex);
>
>  static LIST_HEAD(snp_leaked_pages_list);
>  static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
> @@ -564,12 +576,21 @@ EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
>
>  static void rmpopt_cleanup(void)
>  {
> +	guard(mutex)(&rmpopt_wq_mutex);
> +
> +	if (!rmpopt_wq)
> +		return;
> +
> +	cancel_delayed_work_sync(&rmpopt_delayed_work);
> +	destroy_workqueue(rmpopt_wq);
> +
>  	cpus_read_lock();
>  	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, 0);
>  	cpus_read_unlock();
>
>  	cpumask_clear(&rmpopt_cpumask);
> -	rmpopt_pa_start = 0;
> +	rmpopt_pa_start = rmpopt_pa_end = 0;
> +	rmpopt_wq = NULL;
>  }
>
>  void snp_shutdown(void)
> @@ -587,6 +608,105 @@ void snp_shutdown(void)
>  }
>  EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
>
> +static inline bool __rmpopt(u64 rax, u64 rcx)

Perhaps use pa_start instead of rax and op_type for rcx?

> +{
> +	bool optimized;
> +
> +	asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
> +		     : "=@ccc" (optimized)
> +		     : "a" (rax), "c" (rcx)
> +		     : "memory", "cc");
> +
> +	return optimized;
> +}
> +
> +static void rmpopt(u64 pa)
> +{
> +	u64 rax = ALIGN_DOWN(pa, SZ_1G);
> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
> +

And pa_start and op_type here too.

> +	__rmpopt(rax, rcx);
> +}
> +
> +/*
> + * 'val' is a system physical address.
> + */
> +static void rmpopt_smp(void *val)
> +{
> +	rmpopt((u64)val);
> +}
> +
> +/*
> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
> + * range of memory does not contain any SNP guest memory.
> + */
> +static void rmpopt_work_handler(struct work_struct *work)
> +{
> +	bool current_cpu_cleared = false;
> +	phys_addr_t pa;
> +	int this_cpu;
> +
> +	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
> +		rmpopt_pa_start, rmpopt_pa_end);
> +
> +	/*
> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
> +	 * reserved processor memory. The RMP scan is the most expensive
> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
> +	 * if they can see a cached result in the reserved processor memory.
> +	 *
> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
> +	 * on every other primary thread. This potentially allows the

I like the leader and follower comments below, thanks! With this
leader/follower setup, will the followers definitely see the cached scan
results, or might the followers still potentially not benefit from the
caching? If it's still only "potentially", why?

> +	 * followers to use the "cached" scan results to avoid repeating
> +	 * full scans.
> +	 */
> +
> +	/*
> +	 * Pin the worker to the current CPU for the leader loop so that
> +	 * this_cpu remains valid and the RMPOPT instruction executes on
> +	 * the CPU that was cleared from the cpumask.  The workqueue is
> +	 * WQ_UNBOUND, so without pinning, the scheduler could migrate
> +	 * the worker between the cpumask manipulation and the leader
> +	 * loop, causing the leader to run on a different CPU while
> +	 * this_cpu's core is skipped entirely.
> +	 *
> +	 * Use migrate_disable() rather than get_cpu() to prevent
> +	 * migration while still allowing preemption.
> +	 *
> +	 * Note: rmpopt_cpumask is modified here without holding
> +	 * rmpopt_wq_mutex.  This is safe because the delayed_work
> +	 * mechanism guarantees single-threaded execution of this
> +	 * handler, and rmpopt_cleanup() calls cancel_delayed_work_sync()
> +	 * to ensure handler completion before tearing down the cpumask.
> +	 */
> +	migrate_disable();
> +	this_cpu = smp_processor_id();
> +	if (cpumask_test_cpu(this_cpu, &rmpopt_cpumask)) {
> +		cpumask_clear_cpu(this_cpu, &rmpopt_cpumask);
> +		current_cpu_cleared = true;
> +	}
> +

Instead of reusing the global rmpopt_cpumask, why not make a copy of
rmpopt_cpumask for this function? Then this function won't have to
figure out current_cpu_cleared or restore rmpopt_cpumask at the end.

I'm thinking to also drop the test and clear, this function can just
always clear, like

  cpumask_clear_cpu(smp_processor_id(), followers_cpumask);

and later

  on_each_cpu_mask(&followers_cpumask, ...);

Actually, if for whatever reason cpumask_test_cpu(this_cpu,
&rmpopt_cpumask) above returns false, would that mean somehow some cpu
exists that wasn't enabled right when rmpopt was initialized? If yes,
what happens if we call rmpopt() on a cpu where it wasn't initialized?

> +	/* Leader: prime the RMPOPT cache on this CPU */
> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
> +		rmpopt(pa);
> +
> +	migrate_enable();
> +
> +	/* Followers: run RMPOPT on all other cores */
> +	cpus_read_lock();
> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> +		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
> +				 (void *)pa, true);
> +
> +		 /* Give a chance for other threads to run */
> +		cond_resched();
> +	}
> +	cpus_read_unlock();
> +
> +	if (current_cpu_cleared)
> +		cpumask_set_cpu(this_cpu, &rmpopt_cpumask);
> +}
> +
>
> [...snip...]
>

^ permalink raw reply

* Re: [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: David Woodhouse @ 2026-05-28 15:32 UTC (permalink / raw)
  To: Jürgen Groß, Sean Christopherson
  Cc: linux-kernel, x86, kvm, linux-coco, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	Paul Durrant
In-Reply-To: <8d3f3cbe-89ca-44df-b35f-90c2724e9154@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1168 bytes --]

On Thu, 2026-05-28 at 16:33 +0200, Jürgen Groß wrote:
> On 28.05.26 15:21, Sean Christopherson wrote:
> > On Thu, May 28, 2026, Jürgen Groß wrote:
> > > On 28.05.26 15:09, Sean Christopherson wrote:
> > > > On Thu, May 28, 2026, Juergen Gross wrote:
> > > > > Please disregard this series, there is one complication sashiko made me
> > > > > aware of.
> > > > 
> > > > Sashiko beat me to the punch. :-)
> > > > 
> > > > See commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad WRMSR(MCi_CTL/STATUS)")
> > > > for a real world example of how things can and will go wrong.
> > > 
> > > Yeah, with Sashiko's pointer it was easy to spot.
> > > 
> > > Question now is whether the already existing cases of -errno passed as return
> > > value are wrong or on purpose.
> > 
> > What are the existing cases?
> 
> Found another one:
> 
> kvm_xen_write_hypercall_page() (called by kvm_set_msr_common())

You mean in the case where it's using the user-provided hypercall page,
and can't copy from the buffer that the VMM provided?

I think that's correct to return -errno via PTR_ERR() and let the guest
die?

The rest return 0 or 1.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: Jürgen Groß @ 2026-05-28 15:36 UTC (permalink / raw)
  To: David Woodhouse, Sean Christopherson
  Cc: linux-kernel, x86, kvm, linux-coco, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	Paul Durrant
In-Reply-To: <a6fdbcc655ea24209ed03e377f7e4a2a4d43bd9c.camel@infradead.org>


[-- Attachment #1.1.1: Type: text/plain, Size: 1337 bytes --]

On 28.05.26 17:32, David Woodhouse wrote:
> On Thu, 2026-05-28 at 16:33 +0200, Jürgen Groß wrote:
>> On 28.05.26 15:21, Sean Christopherson wrote:
>>> On Thu, May 28, 2026, Jürgen Groß wrote:
>>>> On 28.05.26 15:09, Sean Christopherson wrote:
>>>>> On Thu, May 28, 2026, Juergen Gross wrote:
>>>>>> Please disregard this series, there is one complication sashiko made me
>>>>>> aware of.
>>>>>
>>>>> Sashiko beat me to the punch. :-)
>>>>>
>>>>> See commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad WRMSR(MCi_CTL/STATUS)")
>>>>> for a real world example of how things can and will go wrong.
>>>>
>>>> Yeah, with Sashiko's pointer it was easy to spot.
>>>>
>>>> Question now is whether the already existing cases of -errno passed as return
>>>> value are wrong or on purpose.
>>>
>>> What are the existing cases?
>>
>> Found another one:
>>
>> kvm_xen_write_hypercall_page() (called by kvm_set_msr_common())
> 
> You mean in the case where it's using the user-provided hypercall page,
> and can't copy from the buffer that the VMM provided?

Yes.

> 
> I think that's correct to return -errno via PTR_ERR() and let the guest
> die?

In this case I think a comment in this regard would be nice, as it would
prevent others stumbling over it asking the same question again.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* Re: [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: Jürgen Groß @ 2026-05-28 15:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, kvm, linux-coco, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	David Woodhouse, Paul Durrant
In-Reply-To: <ahhBQDvyzHQfTCBD@google.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 1487 bytes --]

On 28.05.26 15:21, Sean Christopherson wrote:
> On Thu, May 28, 2026, Jürgen Groß wrote:
>> On 28.05.26 15:09, Sean Christopherson wrote:
>>> On Thu, May 28, 2026, Juergen Gross wrote:
>>>> Please disregard this series, there is one complication sashiko made me
>>>> aware of.
>>>
>>> Sashiko beat me to the punch. :-)
>>>
>>> See commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad WRMSR(MCi_CTL/STATUS)")
>>> for a real world example of how things can and will go wrong.
>>
>> Yeah, with Sashiko's pointer it was easy to spot.
>>
>> Question now is whether the already existing cases of -errno passed as return
>> value are wrong or on purpose.
> 
> What are the existing cases?
> 
>> If the latter, there should be a comment for
>> that, otherwise they need to be fixed..
>>
>> Disentangling the MSR emulation return values from the "normal" ones ("return
>> to guest"/"return to user mode") will be quite interesting with the overloaded
>> semantics of "1".
> 
> LOL, "interesting".

What do you think about the following idea:

Lets pass struct msr_info * down to all functions which get their return
value passed up. Then extend msr_info with a bool "return_to_guest" (valid
only if !host_initiated), which should be set instead of passing "1" up to
the caller (probably using an inline helper). Then the return value could
be 0 or -errno, and after MSR emulation the return_to_guest indicator can
be tested if needed.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* Re: [PATCH v3 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Dave Hansen @ 2026-05-28 16:43 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
	Kai Huang, Sean Christopherson, Borys Tsyrulnikov, linux-kernel,
	linux-coco, kvm, stable
In-Reply-To: <ahgUBLjBRGhxULu3@thinkstation>

On 5/28/26 03:14, Kiryl Shutsemau wrote:
> +	switch (size) {
> +	case 1:
> +		*(u8 *)&regs->ax = (u8)val;
> +		break;
> +	case 2:
> +		*(u16 *)&regs->ax = (u16)val;
> +		break;

Is this intentionally clever to only work on little endian, or
accidentally clever? This seems like a great IOCCC thing to do, but it's
far too clever for my taste.

I mean, it's making a pointer to a 64-bit value with an 8-bit name and
casting that to an 8-bit pointer and then assigning that to a 32-bit
value cast to a u8.

Is it just my tiny brain that thinks this will be unintelligible on Monday?

How about we just make the CPU do the thinking for us? IN[BWL] and
MOV[BWL] have the same semantics here, right? So even if 'rax' and 'val'
are 64-bit values here, the following should have all the right
behaviors, I think.

I generally loathe inline assembly. But we have a CPU that kinda knows
the rules already. No need for us to laboriously reimplement it. Right?

Thanks to the friendly LLM that knows inline assembly better than I do.
The resulting compiled assembly looks right to me.

/*
 * Use MOV[BWL] to/from registers to match the IN[BWL] behavior
 * including the fact that INL zeros the upper 64-bits while
 * IN[BW] don't zero anything.
 */

    switch (size) {
    case 1:
	// Just write 1 byte of RAX:
        __asm__ volatile ("movb %b1, %b0" : "+q"(rax)
                                          : "q"(val));
        break;
    case 2:
	// Write 2 bytes of RAX:
        __asm__ volatile ("movw %w1, %w0" : "+r"(rax)
                                          : "r"(val));
        break;
    case 4:
	// Write 'val' into lower 32 bits. Zero the upper 32 bits:
        __asm__ volatile ("movl %k1, %k0" : "=r"(rax)
                                          : "r"(val));
        break;
    default:
	// WARN
    }

Thoughts?

^ permalink raw reply

* Re: [PATCH v3 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Dave Hansen @ 2026-05-28 17:25 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
	Kai Huang, Sean Christopherson, Borys Tsyrulnikov, linux-kernel,
	linux-coco, kvm, stable
In-Reply-To: <ahgUBLjBRGhxULu3@thinkstation>

On 5/28/26 03:14, Kiryl Shutsemau wrote:
> What about the patch below. Inspired by kvm's assign_register().

I think I could stand this if it consolidated this site with kvm's
assign_register(). The copy/paste is too much to bear.



^ permalink raw reply

* RE: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Michael Kelley @ 2026-05-28 18:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm), iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
  Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <20260522042815.370873-6-aneesh.kumar@kernel.org>

From: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>Sent: Thursday, May 21, 2026 9:28 PM
> 
> Teach the atomic DMA pool code to distinguish between encrypted and
> unencrypted pools, and make pool allocation select the matching pool based
> on DMA attributes.
> 
> Introduce a dma_gen_pool wrapper that records whether a pool is
> unencrypted, initialize that state when the atomic pools are created, and
> use it when expanding and resizing the pools. Update dma_alloc_from_pool()
> to take attrs and skip pools whose encrypted state does not match
> DMA_ATTR_CC_SHARED. Update dma_free_from_pool() accordingly.
> 
> Also pass DMA_ATTR_CC_SHARED from the swiotlb atomic allocation path so
> decrypted swiotlb allocations are taken from the correct atomic pool.
> 
> Tested-by: Jiri Pirko <jiri@nvidia.com>
> Reviewed-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
>  drivers/iommu/dma-iommu.c   |   2 +-
>  include/linux/dma-map-ops.h |   2 +-
>  kernel/dma/direct.c         |  11 ++-
>  kernel/dma/pool.c           | 167 +++++++++++++++++++++++-------------
>  kernel/dma/swiotlb.c        |   7 +-
>  5 files changed, 123 insertions(+), 66 deletions(-)
>

[snip]
 
> +static __init struct dma_gen_pool *__dma_atomic_pool_init(struct dma_gen_pool *dma_pool,
> +		size_t pool_size, gfp_t gfp)
>  {
> -	struct gen_pool *pool;
>  	int ret;
> 
> -	pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> -	if (!pool)
> +	dma_pool->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> +	if (!dma_pool->pool)
>  		return NULL;
> 
> -	gen_pool_set_algo(pool, gen_pool_first_fit_order_align, NULL);
> +	gen_pool_set_algo(dma_pool->pool, gen_pool_first_fit_order_align, NULL);
> +
> +	/* if platform is using memory encryption atomic pools are by default decrypted. */
> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> +		dma_pool->unencrypted = true;
> +	else
> +		dma_pool->unencrypted = false;

I'm curious about the name of the "unencrypted" field in struct dma_gen_pool,
and similarly in Patch 7 of the series for the swiotlb struct io_tlb_pool and
struct io_tlb_mem. Up through v3 of this series, you used "decrypted", but
starting in v4 switched to "unencrypted".

To me, the above "if" statement has some cognitive dissonance in that if
CC_ATTR_MEM_ENCRYPT is false (i.e., a normal VM), "unencrypted" is set
to false. But I think of memory in a normal VM as "unencrypted" since it
was never encrypted. A similar "if" statement occurs in your swiotlb changes.

Two related concepts are captured by the field:
1) Is some action needed to put the memory into the unencrypted state,
and to remove it from that state? This applies when assigning memory to the
pool, or freeing the memory in the pool.
2) Is the memory currently in the unencrypted state? This applies when
allocating memory from the pool to a caller.

It's hard to capture all that in a short field name. But I think I prefer "decrypted"
over "unencrypted".  The former implies that some action was taken. It's a
little easier to think of a normal VM as *not* having decrypted memory. The
memory was never encrypted in the first place, so no decryption action was taken.

Throughout the kernel, "decrypted" occurs much more frequently than
"unencrypted".  We have set_memory_encrypted() and set_memory_decrypted()
that are "take action" names.  But we also have force_dma_unencrypted(),
phys_to_dma_unencrypted(), and dma_addr_unencrypted(). So it's a bit
of a mess.

But maybe there's more background here that led to the change
between your v3 and v4.

Michael

^ permalink raw reply

* Re: [PATCH v13 07/22] KVM: selftests: Introduce structures for TDX guest boot parameters
From: Yosry Ahmed @ 2026-05-28 19:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
	Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <CAO9r8zMaiGL8v=f72EAwWbwofoUHOkH8r6Se22k2TVxnUCQLOQ@mail.gmail.com>

On Fri, May 22, 2026 at 04:50:07PM -0700, Yosry Ahmed wrote:
> > > Sean, is this the preferred way to expose offsets to asm files (or asm
> > > code blocks) -- as opposed to say using .equ [*]?
> >
> > For actual .S assembly, yes.  For inline asm, maybe?  If it looks prettier, go
> > for it.
> >
> > > If yes, I can rework my nVMX GPR fixes to use the same approach for
> > > register offsets. I wonder if the non-TDX part of this patch (i.e.
> > > Makefile stuff) can be split, then patch 6 and the Makefile stuff can
> > > land independently and allow development on top.
> > >
> > > I can also split them out and include them in the next version of my
> > > series, then whichever series lands first will land the offsets
> > > support.
> > >
> > > WDYT?
> >
> > Hmm, I'd say keep your series as-is for now.  The OFFSET() infrastructure really
> > shines for proper assembly.  For what you're doing, AFAICT it's only marginally
> > better.  So I don't think it's worth juggling dependencies to use it right away,
> > we can always convert if/when the TDX series lands the fancy stuff.
> 
> Ack. We can do the switch later like you say.

I take this back. My series builds with the internal toolchain, but not
when I just use make with LLVM. Probably different compiler versions or
build options, but the fact the .equ thing doesn't always work means I
can't use it.

I would paste the error here, but the compiler literally spits out
incomprehensible garbage.

Lisa, if you will send a new version of this series for other reasons,
do you mind splitting out the non-TDX parts of this patch? Ideally we'd
have 1-2 patches that introduce the OFFSET() infrastructure without any
TDX parts, which should make it easier to pick up separately or include
with other series.

If a new version won't be needed anyway, I will just wait for this to
land before refreshing my series on top.

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Kalra, Ashish @ 2026-05-28 19:37 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260528004332.GDahePtGqVp2boiEJL@fat_crate.local>

Hello Boris and Dave,

On 5/27/2026 7:43 PM, Borislav Petkov wrote:
> On Wed, May 27, 2026 at 02:38:05PM -0700, Dave Hansen wrote:
>> This one is my doing.
> 
> I know.
> 
> But hey, maybe we should not disagree on the public ML because the submitter
> might disappear like the last one. :-P
> 
>> wrmsr_on_cpus() is kinda a mess. I think it only has a single user. It's
>> also not very flexible because it needs a 'struct msr __percpu *msrs'
>> argument where each MSR has a value in memory.
> 
> Right, we did that a looong time ago.
> 
> The only reason I'd have for per-CPU MSR structs is reading different MSR
> values on different cores, modifying only the bits you need and then *keeping*
> the remaining values as they were. And that interface allows you to do that
> while this new thing won't.
> 
> And I'm going to venture a guess here that adding a simpler interface which
> simply forces a new value ontop of a whole MSR could cause a lot of subtle
> bugs when people don't pay attention to keep the old values.
> 
>> The use case for RMPOPT is that all CPUs get the same value. It'd be a
>> little awkward to go create a percpu data structure to duplcate the same
>> value to call wrmsr_on_cpus(). The RMPOPT case is also arguably
>> performance sensitive since it's done during boot. It should do the IPIs
>> in parallel.
> 
> Oh sure, my meaning was to create something that serves both purposes.
> 
>> toggle_ecc_err_reporting(), on the other hand, is done at module init
>> time. It's not really performance sensitive. It's probably pretty easy
>> to zap wrmsr_on_cpus() and just have toggle_ecc_err_reporting() do
>> something slightly less efficient.
> 
> Sure. That's fine.
> 
>> Yeah, the
>>
>> 	wrmsr_on_cpus()
>> 	wrmsrq_on_cpus()
>>
>> naming pain is real. There's little chance of bugs coming from it
>> because the function signatures are *SO* different. But, it certainly
>> could confuse humans for a minute.
> 
> Yap.
> 
>> But the real solution to this is axing wrmsr_on_cpus(). 
> 
> Yap, for example. Basically reingeneering the whole
> write-MSRs-on-multiple-CPUs functionality is what I meant.
> 
>> Which I think we could do after killing its one user which the attached
>> (completely untested) patch does. The only downside of the patch is that it
>> does RDMSR via IPIs one CPU at a time. But, looking at the code, I'm not
>> sure anyone would care. If anyone did, I _think_ all those MSRs have the
>> same value and the code could be simplified further. But that would take
>> more than 3 minutes.
>>
>> It's also possible that my grepping was bad or I'm completely
>> misunderstanding amd64_edac.c. Cluebat welcome if I'm being dense.
> 
> Looks ok to me, we can surely do that. I even hw to test it. I think...
> 
>> BTW, I also don't feel the need to make Ashish go do any of this edac
>> cleanup. I think it can just be done in parallel. But I wouldn't stop
>> him if he volunteered.
> 
> Why not?
> 
> It has always been the case: cleanups and bug fixes first, new features ontop.
> 
> So yeah, modulo figuring out how to redefine the *msr_on_cpus() interface,
> I think this all makes sense.

snp_setup_rmpopt() runs once during init and rmpopt_cleanup() runs once during shutdown. The batch IPI optimization
is irrelevant here. This RMPOPT_BASE MSR setup/programming is not in a performance critical path.

A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:

  for_each_cpu(cpu, &rmpopt_cpumask)
      wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);

Calling wrmsrq_on_cpus() here for programming RMPOPT_BASE MSR:

-       wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
+       for_each_cpu(cpu, &rmpopt_cpumask)
+               wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);

So i will drop this helper patch.

Thanks,
Ashish

> 
> Thx.
> 

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Dave Hansen @ 2026-05-28 19:50 UTC (permalink / raw)
  To: Kalra, Ashish, Borislav Petkov
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <2d164e19-5cc6-47ca-9150-f4d432dd10c4@amd.com>

On 5/28/26 12:37, Kalra, Ashish wrote:
> A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:
> 
>   for_each_cpu(cpu, &rmpopt_cpumask)
>       wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);

I'm glad we're on the same page finally. I just hope we can get to this
point more quickly next time. I started off with exactly this
suggestion, but someone chimed in to the thread and said it was "slower":

> https://lore.kernel.org/lkml/6a50d050-f602-43fd-a44a-cecedd9823eb@amd.com/


^ permalink raw reply

* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Sohil Mehta @ 2026-05-28 19:50 UTC (permalink / raw)
  To: Xu Yilun
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, yilun.xu, baolu.lu, zhenzhong.duan, xiaoyao.li
In-Reply-To: <ahfJ9MW+kJp+kE6A@yilunxu-OptiPlex-7050>

On 5/27/2026 9:52 PM, Xu Yilun wrote:

> No the memory needed varies depends on the feature or the number of
> features. But currently I see the total requirement is ~50MB.
> 
This is important consideration when defining the default policy. Could
you please elaborate on how this will scale in the future?

How are the memory requirements expected to grow with additional features?

Let's say a future platform has a lot more features and needs
significantly more memory. Wouldn't loading a legacy kernel with this
default policy lead to excessive wastage?

Maybe I am missing something obvious. The struct in patch 1,
memory_pool_required_pages is u16. So, will the Extensions support never
require more than 256MB?

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Kalra, Ashish @ 2026-05-28 19:55 UTC (permalink / raw)
  To: Dave Hansen, Borislav Petkov
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <c40dcb8c-5706-4c0f-ac85-c22957b9e192@intel.com>

Hello Dave,

On 5/28/2026 2:50 PM, Dave Hansen wrote:
> On 5/28/26 12:37, Kalra, Ashish wrote:
>> A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:
>>
>>   for_each_cpu(cpu, &rmpopt_cpumask)
>>       wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
> 
> I'm glad we're on the same page finally. I just hope we can get to this
> point more quickly next time. I started off with exactly this
> suggestion, but someone chimed in to the thread and said it was "slower":
> 
>> https://lore.kernel.org/lkml/6a50d050-f602-43fd-a44a-cecedd9823eb@amd.com/
> 

Yes, actually i should have made it explicitly clear that we need to do it in
parallel especially for issuing the RMPOPT instruction itself, as that is
in a performance critical path (and for that we are using on_each_cpu_mask()).

Thanks,
Ashish

^ permalink raw reply

* Re: [PATCH v3 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: David Laight @ 2026-05-28 19:58 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Rick Edgecombe,
	Kuppuswamy Sathyanarayanan, Kai Huang, Sean Christopherson,
	Borys Tsyrulnikov, linux-kernel, linux-coco, kvm, stable
In-Reply-To: <ahgUBLjBRGhxULu3@thinkstation>

On Thu, 28 May 2026 11:14:38 +0100
Kiryl Shutsemau <kas@kernel.org> wrote:

> On Wed, May 27, 2026 at 10:45:28AM -0700, Dave Hansen wrote:
> > On 5/27/26 05:05, Kiryl Shutsemau (Meta) wrote:
> > ...  
> > > -	/* Update part of the register affected by the emulated instruction */
> > > -	regs->ax &= ~mask;
> > > +	/*
> > > +	 * IN writes the result into a sub-register of RAX. Only the
> > > +	 * 32-bit form zero-extends; the smaller forms leave the upper
> > > +	 * bits untouched:
> > > +	 *
> > > +	 *   insn  dest  size  bits written     bits preserved
> > > +	 *   inb   AL    1     RAX[ 7: 0]       RAX[63: 8]
> > > +	 *   inw   AX    2     RAX[15: 0]       RAX[63:16]
> > > +	 *   inl   EAX   4     RAX[63: 0]       (none, zero-extended)
> > > +	 *
> > > +	 * 'mask' only covers the low 'size' bytes, which is exactly the
> > > +	 * range affected for size 1 and 2. For size 4 the write also
> > > +	 * clears RAX[63:32], so widen the clear-mask.
> > > +	 */
> > > +	if (size == 4)
> > > +		regs->ax = 0;
> > > +	else
> > > +		regs->ax &= ~mask;
> > > +  
> > 
> > Is there any way we could do this with fewer comments and more code?
> > 
> > I mean, there's only three cases. Why have;
> > 
> > 	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
> > 
> > When there are only 3 possible cases:
> > 
> > 	1 => 0xf
> > 	2 => 0xff
> > 	4 => 0xffff
> > 
> > and one of those cases needs a special case on top of it.
> > 
> > Maybe something like this?
> > 
> > 	/* Clear out part of RAX so part of args.r11 can be OR'd in: */
> > 	switch (size) {
> > 	case 1:
> > 		/* inb consumes lower 8 bits of r11: */
> > 		regs->ax &= ~GENMASK_ULL(7, 0);
> > 		args.r11 &=  GENMASK_ULL(7, 0);
> > 		break;
> > 	case 2:
> > 		/* inw consumes lower 16 bits of r11: */
> > 		regs->ax &= ~GENMASK_ULL(15, 0);
> > 		args.r11 &=  GENMASK_ULL(15, 0);
> > 		break;
> > 	case 4:
> > 		/* inl is weird and zeros the whole register: */
> > 		regs->ax &= ~GENMASK_ULL(63, 0);
> > 		/* But only consumes 32-bits from r11: */
> > 		args.r11 &=  GENMASK_ULL(31, 0);
> > 		break;
> > 	default:
> > 		/* Probable TDX module bug. Illegal in[bwl] size: */
> > 		WARN_ON_ONCE(1);
> > 		success = 0;
> > 	}
> > 
> > 	if (success)
> > 		regs->ax |= args.r11;
> > 
> > It might need a temporary variable for args.r11, but you get the point.
> > That's basically the data from the comment but written as code.  
> 
> I hate how verbose it is. All these GENMASK_ULL() make it hard to
> follow.
> 
> What about the patch below. Inspired by kvm's assign_register().
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 65119362f9a2..460b9fbabf14 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -693,8 +693,8 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  		.r13 = PORT_READ,
>  		.r14 = port,
>  	};
> -	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
>  	bool success;
> +	u32 val;
>  
>  	/*
>  	 * Emulate the I/O read via hypercall. More info about ABI can be found
> @@ -703,10 +703,33 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  	 */
>  	success = !__tdx_hypercall(&args);
>  
> -	/* Update part of the register affected by the emulated instruction */
> -	regs->ax &= ~mask;
>  	if (success)
> -		regs->ax |= args.r11 & mask;
> +		val = args.r11;
> +	else
> +		val = 0;
> +
> +	/*
> +	 * IN writes the result into a sub-register of RAX.
> +	 *
> +	 * Only the 32-bit form zero-extends; the smaller forms leave
> +	 * the upper bits untouched.
> +	 */
> +	switch (size) {
> +	case 1:
> +		*(u8 *)&regs->ax = (u8)val;
> +		break;
> +	case 2:
> +		*(u16 *)&regs->ax = (u16)val;
> +		break;
> +	case 4:
> +		/* zero-extended */
> +		regs->ax = val;
> +		break;
> +	default:
> +		/* Probable TDX module bug. Illegal in[bwl] size. */
> +		WARN_ON_ONCE(1);
> +		break;
> +	}

Just write it as normal arithmetic code:

	/* IN writes the result into a sub-register of RAX. */
	switch (size) {
	case 1:
		regs->ax = (regs->ax & ~0xfful) | (val & 0xff);
		break;
	case 2:
		regs->ax = (regs->ax & ~0xfffful) | (val & 0xffff);
		break;
	case 4:
	default:
		/* 32bit 'INB' will zero the high bits. */
		regs->ax = val
		break;
	}

Succinct, obvious and readable.

-- David


>  
>  	return success;
>  }


^ permalink raw reply

* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Edgecombe, Rick P @ 2026-05-28 21:00 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-2-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> +struct tdx_sys_info_ext {
> +	u16 memory_pool_required_pages;

> +	u8 ext_required;

The docs say this is a bool.

> +};
> +


^ permalink raw reply

* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Edgecombe, Rick P @ 2026-05-28 21:17 UTC (permalink / raw)
  To: kas@kernel.org, yilun.xu@linux.intel.com
  Cc: Xu, Yilun, x86@kernel.org, baolu.lu@linux.intel.com, Li, Xiaoyao,
	djbw@kernel.org, linux-kernel@vger.kernel.org, Duan, Zhenzhong,
	Mehta, Sohil, kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	Fang, Peter
In-Reply-To: <ahfD1KMYnTrXJziq@yilunxu-OptiPlex-7050>

On Thu, 2026-05-28 at 12:25 +0800, Xu Yilun wrote:
> > > 
> > > If I read the TDX module base spec correctly, the amount of memory for
> > > extensions and EXT_REQUIRED field depends on the enabled features, which
> > > is
> > > determined by TDH.SYS.CONFIG/TDH.SYS.UPDATE ?
> 
> Yes.
> 
> > 
> > This is my read too. Looks like we need a separate step after
> > config_tdx_module() to readout config-dependatant metadata.
> 
> 
> The timing for when metadata becomes valid is now variable, e.g., the
> TDX QUOTING metadata is only valid after TDH.QUOTE.INIT [1].
> 
> Based on recent discussion, I think we should introduce runtime metadata
> reading interfaces for specific metadata sets as needed, rather than
> another catch-all step right after config_tdx_module(). See [2] for the
> proposed approach for Extensions metadata.
> 
> [1]:
> https://lore.kernel.org/all/20260522034128.3144354-7-yilun.xu@linux.intel.com/
> [2]: https://lore.kernel.org/all/ahXAL41ZmIDHmgfu@yilunxu-OptiPlex-7050/

Yea It is going to get confusing as to which metadata is populated at which
step. And if anything updates it.

I'm not sure we need to have all the metadata stored permanently. Some of the
metadata is needed for KVM and someday TSM. But a lot of it is onetime internal
use. There is some handiness in referring to a global var, but also those
reference add confusion as to when it got populated.

We only use ext_required, max_quote_size and memory_pool_required_pages each
once. So why not just read them to the stack and leave them out of struct
tdx_sys_info? Making it so there is not confusion of when it was read. And also
saving a global var that is never used again is a bit wrong.

How about for struct tdx_sys_info_ext read it to the stack in init_tdx_ext() and
pass it into init_tdx_ext_features(). For max_quote_size read it where it is
already read, but not into the global struct.

Do you see a problem?

^ permalink raw reply

* Re: [PATCH 04/15] x86/virt/tdx: Enable the Extensions right after basic TDX Module init
From: Edgecombe, Rick P @ 2026-05-28 21:32 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-5-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> The detailed initialization flow for TDX Module Extensions has been
> fully implemented.
> 

I'm not sure what this means exactly. Why "detailed". Is that important?

>  Enable the flow after basic TDX Module
> initialization.
> 
> Theoretically, the Extensions doesn't need to be enabled right after
> basic TDX initialization. It could be enabled right before the first
> Extension SEAMCALL is issued. That would save or postpone memory usage.
> But it isn't worth the complexity, the needs for the Extensions are vast
> but the savings are little for a typical TDX capable system (about
> 0.001% of memory). So the Linux decision is to just enable it along with
> the basic TDX.

The Linux decision is whatever this patch turns out to be after community
review. So for the patch log we just need to justify why it's a good idea, not
not make an argument to defer to authority.

> 
> Note that the Extensions initialization flow will still not start if no
> add-on features require Extensions. The enabling of add-on features will
> be in later patches. Until then, the system hasn't consumed extra memory.

Hmm, this patch reads like we are finally doing the initialization up until this
point. Then it turns out we don't actually light up the new code yet... 

A lot of this diff is adding __init to the function added in the earlier
patches. Do we need to do this? Why not add them as __init in the original
patches?


I think we maybe want to say instead that we are setting up to enable extensions
at TDX module init time, and do the explanation of why. Then without the __init
stuff, the patch is just about the init time decision. Which seems about right
sized.

^ permalink raw reply

* Re: [RFC PATCH 05/15] x86/virt/tdx: Move tdx_tdr_pa() up in the file
From: Edgecombe, Rick P @ 2026-05-28 21:32 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-6-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> Move the tdx_tdr_pa() in preparation for upcoming changes to use them
> during TDX bringup.
> 
> No functional change intended.
> 
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply

* Re: [RFC PATCH 06/15] x86/virt/tdx: Initialize Quoting extension during bringup
From: Edgecombe, Rick P @ 2026-05-28 21:35 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-7-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> Initialize the Quoting extension and fetch its metadata during TDX
> bringup.
> 
> Because Quoting is an optional TDX feature, do not let its
> initialization failures cause TDX bringup to fail.
> 
> This patch
> 

Don't say "this patch" in tip logs. The patch is a temporary format, and some
x86 maintainers hate the term in logs.

>  does not include the opt-in portion of the initialization.
> It mainly lays the groundwork for TDX Quoting support. Opt-in will be
> added in a follow-up patch once the feature can be properly used by the
> system.

This could be imperative mood.

> 
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>


^ permalink raw reply

* Re: [RFC PATCH 07/15] x86/virt/tdx: Prepare Quote buffer during extension bringup
From: Edgecombe, Rick P @ 2026-05-28 22:30 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-8-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> The host uses a Quote buffer to communicate with the TDX module when
> generating Quotes.
> 

Can this be put in common terms. This is going to mean nothing to someone
reading this that doesn't already know the feature.

>  Because the Quote buffer is shared with TDX guests,

Why capitalize "Quote"?

> prepare the required metadata during Quoting extension bringup.

What does prepare the required metadata mean?

How does it being shared with TDX guest suggest this? Just that TDX guests will
need them? Is the reason just that only one is needed, so do it during global
init? 

> 
> This mostly involves determining the physical addresses of the Quote
> buffer pages and arranging them in the HPA_LINKED_LIST format defined by
> the Intel TDX Module ABI specification.
> 
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 85 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index fb84fb6d952b..9d04293394d7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -32,6 +32,7 @@
>  #include <linux/idr.h>
>  #include <linux/kvm_types.h>
>  #include <linux/bitfield.h>
> +#include <linux/vmalloc.h>
>  #include <asm/page.h>
>  #include <asm/special_insns.h>
>  #include <asm/msr-index.h>
> @@ -61,6 +62,13 @@ static LIST_HEAD(tdx_memlist);
>  static struct tdx_sys_info tdx_sysinfo __ro_after_init;
>  static bool tdx_module_initialized __ro_after_init;
>  
> +static struct quote_data {
> +	void *buf;
> +	u64 buf_len;
> +	u64 *hpa_list;
> +	phys_addr_t hpa_list_pa;
> +} quote_data;

Hmm, I think this should separate the type and variable declaration. It's not a
common pattern. I don't think there is an official rule.

> +
>  typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>  
>  static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
> @@ -1205,9 +1213,78 @@ static inline u64 tdx_tdr_pa(struct tdx_td *td)
>  	return page_to_phys(td->tdr_page);
>  }
>  
> +#define HPAS_PER_PAGE			(PAGE_SIZE / sizeof(u64))
> +
> +static int tdx_quote_create_buf(unsigned int nr_pages, struct quote_data *qdata)
> +{
> +	unsigned long pfn;
> +	u64 qlist_npages;
> +	int err, i, j;
> +	u64 *qlist;
> +	void *qbuf;
> +
> +	if (!nr_pages)
> +		return -EINVAL;
> +
> +	/* The last entry of a linked list page points to the next page	*/
> +	qlist_npages = (u64)DIV_ROUND_UP(nr_pages, HPAS_PER_PAGE - 1);
> +
> +	qlist = vmalloc_array(qlist_npages, PAGE_SIZE);
> +	if (!qlist) {
> +		err = -ENOMEM;
> +		goto out_err;

Just return ENOMEM here. vfree() doesn't do any work if passed NULL, but it's
weird flow.

> +	}
> +
> +	/*
> +	 * Make sure unfilled entries are always -1, which means NULL in TDX.

Huh?

> +	 * Only the last page needs to be filled. All the other pages will be
> +	 * fully populated.
> +	 */
> +	memset((u8 *)qlist + (qlist_npages - 1) * PAGE_SIZE, 0xff, PAGE_SIZE);

What are the entries? And what is a -1 in u8? Or is it supposed to be u64?
Please make this a lot clearer.

> +
> +	qbuf = vcalloc(nr_pages, PAGE_SIZE);
> +	if (!qbuf) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	/* Populate HPA_LINKED_LIST as per TDX ABI spec */
> +	for (i = 0, j = 0; j < nr_pages; i++) {
> +		if ((i % HPAS_PER_PAGE) == HPAS_PER_PAGE - 1) {
> +			/*
> +			 * The last entry always points to the next page. The
> +			 * address of the following entry must be on next page's
> +			 * boundary.
> +			 */

Can you maybe just explain this format that you are building in like one
sentence at the beginning of the function? "The quote buffer is passed to the
tdx module in a format that like... (some common terms that have no TDX
jargon)."

> +			pfn = vmalloc_to_pfn(&qlist[i + 1]);
> +			qlist[i] = PFN_PHYS(pfn);
> +			continue;
> +		}
> +
> +		pfn = vmalloc_to_pfn((u8 *)qbuf + j * PAGE_SIZE);
> +		qlist[i] = PFN_PHYS(pfn);
> +		j++;
> +	}
> +
> +	qdata->buf = qbuf;
> +	qdata->buf_len = (u64)nr_pages * PAGE_SIZE;
> +	qdata->hpa_list = qlist;
> +
> +	pfn = vmalloc_to_pfn(qlist);

Do we need a vmalloc_to_pa() helper? Maybe put it in terms of tdx format. Like
vmalloc_pfn_to_tdxpa() and keep it here? The tdx update stuff does this a bunch
too.

> +	qdata->hpa_list_pa = PFN_PHYS(pfn);
> +
> +	return 0;
> +
> +out_err:
> +	vfree(qlist);
> +
> +	return err;

It only returns -ENOMEM, so do we need the err var?

> +}
> +
>  static void tdx_quote_init(void)
>  {
>  	struct tdx_module_args args = {};
> +	unsigned int nr_quote_pages;
>  	u64 r;
>  
>  	do {
> @@ -1218,7 +1295,13 @@ static void tdx_quote_init(void)
>  		return;
>  
>  	/* Quoting metadata is valid only after initialization */
> -	get_tdx_sys_info_quote(&tdx_sysinfo.quote);
> +	if (get_tdx_sys_info_quote(&tdx_sysinfo.quote))
> +		return;

How come this patch gets error handling? Why is it needed now when it wasn't
before?

> +
> +	nr_quote_pages = PAGE_ALIGN(tdx_sysinfo.quote.max_quote_size) /
> +			 PAGE_SIZE;
> +	if (tdx_quote_create_buf(nr_quote_pages, &quote_data))
> +		pr_err("Failed to create quote buffer\n");

Err... what happens in ENOMEM scenario? NULL pointer later?

>  }
>  
>  /* Initialize the TDX Module Extensions then Extension-SEAMCALLs can be used */


^ permalink raw reply

* Re: [RFC PATCH 09/15] x86/virt/tdx: Add interface to generate a Quote
From: Edgecombe, Rick P @ 2026-05-28 22:30 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-10-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> +void *tdx_quote_generate(struct tdx_td *td, void *in_data, u32 in_data_len,
> +			 u32 *quote_len)
> +{
> +	void *quote_dup = NULL;
> +	u64 r, out_len;
> +
> +	if (!tdx_quote_enabled())
> +		return NULL;
> +
> +	/* TDH.QUOTE.GET expects the input data to fit in a page */
> +	if (in_data_len > PAGE_SIZE)
> +		return NULL;

Do we really need this check? We can't trust the caller to pass the right size?

> +
> +	mutex_lock(&tdx_quote_lock);
> +
> +	/*
> +	 * Use the first page of the quote buffer for input data. The buffer
> +	 * must be at least one page in size. @in_data may not be page-aligned,
> +	 * but TDH.QUOTE.GET expects page-aligned addresses.
> +	 */
> +	memcpy(quote_data.buf, in_data, (size_t)in_data_len);
> +
> +	r = tdx_quote_get(td, quote_data.hpa_list[0], (u64)in_data_len,
> +			  quote_data.hpa_list_pa, quote_data.buf_len, &out_len);
> +	if (r || !out_len || out_len > quote_data.buf_len)


How do these various error conditions happen?

> +		goto out;
> +
> +	/*
> +	 * The quote buffer is a shared resource, so use it only for the
> +	 * SEAMCALL and copy the data out as soon as possible.
> +	 */
> +	quote_dup = kvmemdup(quote_data.buf, out_len, GFP_KERNEL);

So at init time we allocate a vmalloc for the quote and pre-populate the
hpa_list. Then we use it every time and copy the contents to a new vmalloc.
Would it really be that hard to keep the hpa list allocation around, do a
vmalloc here and update the pfn list. Then do get quote on that and pass back
the vmalloc we just allocated? Just feels like global reuse way has extra pieces
in it. Compared to the whole quoting operation, this vmalloc_to_pfn() loop is
probably not very expensive.

> +
> +out:
> +	mutex_unlock(&tdx_quote_lock);
> +
> +	*quote_len = (u32)out_len;
> +
> +	return quote_dup;
> +}
> +EXPORT_SYMBOL_FOR_KVM(tdx_quote_generate);
> +


^ permalink raw reply

* Re: [PATCH v5 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Kalra, Ashish @ 2026-05-28 23:52 UTC (permalink / raw)
  To: Ackerley Tng, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgGfyb7zvZ1u1j7YLomD+JdAxnVW36gtvNG9gxgZ80vMyQ@mail.gmail.com>

Hello Ackerley,

On 5/28/2026 9:45 AM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@amd.com> writes:
> 
> Thank you Ashish!
> 
>> From: Ashish Kalra <ashish.kalra@amd.com>
>>
>> When SEV-SNP is enabled, all writes to memory are checked to ensure
>> integrity of SNP guest memory. This imposes performance overhead on the
>> whole system.
>>
>> RMPOPT is a new instruction that minimizes the performance overhead of
>> RMP checks on the hypervisor and on non-SNP guests by allowing RMP
>> checks to be skipped for 1GB regions of memory that are known not to
>> contain any SEV-SNP guest memory.
>>
>> Add support for performing RMP optimizations asynchronously using a
>> dedicated workqueue.
>>
>> Enable RMPOPT optimizations globally for all system RAM up to 2TB at
> 
> This should also be updated to say "Enable RMPOPT optimizations for up
> to 2TB worth of system RAM at..."
> 
> The current phrasing sounds like only addresses [0, 2TB) are allowed to
> be optimized, but actually any address [start, start + 2TB) can be
> optimized?

Yes, i will update it.

> 
>> RMP initialization time. RMP checks can initially be skipped for 1GB
>> memory ranges that do not contain SEV-SNP guest memory (excluding
>> preassigned pages such as the RMP table and firmware pages). As SNP
>> guests are launched, RMPUPDATE will disable the corresponding RMPOPT
>> optimizations.
>>
>> Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
>> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
>> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
>> ---
>>  arch/x86/virt/svm/sev.c | 167 +++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 164 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
>> index 82f9dc7a57c3..8876cac052d5 100644
>> --- a/arch/x86/virt/svm/sev.c
>> +++ b/arch/x86/virt/svm/sev.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/iommu.h>
>>  #include <linux/amd-iommu.h>
>>  #include <linux/nospec.h>
>> +#include <linux/workqueue.h>
>>
>>  #include <asm/sev.h>
>>  #include <asm/processor.h>
>> @@ -125,7 +126,18 @@ static void *rmp_bookkeeping __ro_after_init;
>>  static u64 probed_rmp_base, probed_rmp_size;
>>
>>  static cpumask_t rmpopt_cpumask;
>> -static phys_addr_t rmpopt_pa_start;
>> +static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
>> +
>> +enum rmpopt_function {
>> +	RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
>> +	RMPOPT_FUNC_REPORT_STATUS
>> +};
>> +
>> +#define RMPOPT_WORK_TIMEOUT	10000
>> +
>> +static struct workqueue_struct *rmpopt_wq;
>> +static struct delayed_work rmpopt_delayed_work;
>> +static DEFINE_MUTEX(rmpopt_wq_mutex);
>>
>>  static LIST_HEAD(snp_leaked_pages_list);
>>  static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
>> @@ -564,12 +576,21 @@ EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
>>
>>  static void rmpopt_cleanup(void)
>>  {
>> +	guard(mutex)(&rmpopt_wq_mutex);
>> +
>> +	if (!rmpopt_wq)
>> +		return;
>> +
>> +	cancel_delayed_work_sync(&rmpopt_delayed_work);
>> +	destroy_workqueue(rmpopt_wq);
>> +
>>  	cpus_read_lock();
>>  	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, 0);
>>  	cpus_read_unlock();
>>
>>  	cpumask_clear(&rmpopt_cpumask);
>> -	rmpopt_pa_start = 0;
>> +	rmpopt_pa_start = rmpopt_pa_end = 0;
>> +	rmpopt_wq = NULL;
>>  }
>>
>>  void snp_shutdown(void)
>> @@ -587,6 +608,105 @@ void snp_shutdown(void)
>>  }
>>  EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
>>
>> +static inline bool __rmpopt(u64 rax, u64 rcx)
> 
> Perhaps use pa_start instead of rax and op_type for rcx?
> 

I used these parameters to align with the RMPOPT specifications (rax and rcx)
which i think makes more sense.

>> +{
>> +	bool optimized;
>> +
>> +	asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
>> +		     : "=@ccc" (optimized)
>> +		     : "a" (rax), "c" (rcx)
>> +		     : "memory", "cc");
>> +
>> +	return optimized;
>> +}
>> +
>> +static void rmpopt(u64 pa)
>> +{
>> +	u64 rax = ALIGN_DOWN(pa, SZ_1G);
>> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
>> +
> 
> And pa_start and op_type here too.
> 
>> +	__rmpopt(rax, rcx);
>> +}
>> +
>> +/*
>> + * 'val' is a system physical address.
>> + */
>> +static void rmpopt_smp(void *val)
>> +{
>> +	rmpopt((u64)val);
>> +}
>> +
>> +/*
>> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
>> + * range of memory does not contain any SNP guest memory.
>> + */
>> +static void rmpopt_work_handler(struct work_struct *work)
>> +{
>> +	bool current_cpu_cleared = false;
>> +	phys_addr_t pa;
>> +	int this_cpu;
>> +
>> +	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
>> +		rmpopt_pa_start, rmpopt_pa_end);
>> +
>> +	/*
>> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
>> +	 * reserved processor memory. The RMP scan is the most expensive
>> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
>> +	 * if they can see a cached result in the reserved processor memory.
>> +	 *
>> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>> +	 * on every other primary thread. This potentially allows the
> 
> I like the leader and follower comments below, thanks! With this
> leader/follower setup, will the followers definitely see the cached scan
> results, or might the followers still potentially not benefit from the
> caching? If it's still only "potentially", why?

I am verifying with the H/W architects if this is always going to be true or not,
will the followers always benefit from the scan results cached by the leader (first CPU)
or there is a possibility that the followers cannot see/access/get the cached results
and instead do full RMP scanning ?

> 
>> +	 * followers to use the "cached" scan results to avoid repeating
>> +	 * full scans.
>> +	 */
>> +
>> +	/*
>> +	 * Pin the worker to the current CPU for the leader loop so that
>> +	 * this_cpu remains valid and the RMPOPT instruction executes on
>> +	 * the CPU that was cleared from the cpumask.  The workqueue is
>> +	 * WQ_UNBOUND, so without pinning, the scheduler could migrate
>> +	 * the worker between the cpumask manipulation and the leader
>> +	 * loop, causing the leader to run on a different CPU while
>> +	 * this_cpu's core is skipped entirely.
>> +	 *
>> +	 * Use migrate_disable() rather than get_cpu() to prevent
>> +	 * migration while still allowing preemption.
>> +	 *
>> +	 * Note: rmpopt_cpumask is modified here without holding
>> +	 * rmpopt_wq_mutex.  This is safe because the delayed_work
>> +	 * mechanism guarantees single-threaded execution of this
>> +	 * handler, and rmpopt_cleanup() calls cancel_delayed_work_sync()
>> +	 * to ensure handler completion before tearing down the cpumask.
>> +	 */
>> +	migrate_disable();
>> +	this_cpu = smp_processor_id();
>> +	if (cpumask_test_cpu(this_cpu, &rmpopt_cpumask)) {
>> +		cpumask_clear_cpu(this_cpu, &rmpopt_cpumask);
>> +		current_cpu_cleared = true;
>> +	}
>> +
> 
> Instead of reusing the global rmpopt_cpumask, why not make a copy of
> rmpopt_cpumask for this function? Then this function won't have to
> figure out current_cpu_cleared or restore rmpopt_cpumask at the end.
> 
> I'm thinking to also drop the test and clear, this function can just
> always clear, like
> 
>   cpumask_clear_cpu(smp_processor_id(), followers_cpumask);
> 
> and later
> 
>   on_each_cpu_mask(&followers_cpumask, ...);

That's surely a much cleaner approach. Instead of modifying global
rmpopt_cpumask and using a local copy:

  cpumask_var_t follower_mask;
  alloc_cpumask_var(&follower_mask, GFP_KERNEL);
  cpumask_copy(follower_mask, &rmpopt_cpumask);

  migrate_disable();
  this_cpu = smp_processor_id();
  cpumask_clear_cpu(this_cpu, follower_mask);  // modify local only

  // leader loop on this_cpu...
  migrate_enable();

  // follower loop with follower_mask...
  on_each_cpu_mask(follower_mask, rmpopt_smp, ...);

  free_cpumask_var(follower_mask);

This eliminates:
- current_cpu_cleared variable
- The restore at the end
  
Additionally, the global rmpopt_cpumask is never modified, so no concurrency concerns with debugfs or other readers.

> 
> Actually, if for whatever reason cpumask_test_cpu(this_cpu,
> &rmpopt_cpumask) above returns false, would that mean somehow some cpu
> exists that wasn't enabled right when rmpopt was initialized? 

The work handler can always execute on a cpu which is not in the rmpopt_cpumask, so i believe the
cpumask_test_cpu() needs to be there.

The leader loop must only run on a CPU that has RMPOPT_BASE MSR programmed. If the WQ_UNBOUND scheduler puts the
handler on a CPU not in rmpopt_cpumask, that CPU's core never had RMPOPT enabled -> RMPOPT instruction causes #UD.

  So the leader should be conditional:

  cpumask_copy(follower_mask, &rmpopt_cpumask);

  migrate_disable();
  this_cpu = smp_processor_id();

  if (cpumask_test_cpu(this_cpu, follower_mask)) {
      cpumask_clear_cpu(this_cpu, follower_mask);

      /* Leader: prime the RMPOPT cache on this CPU */
      for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
          rmpopt(pa);
  }

  migrate_enable();

  /* Followers: run RMPOPT on remaining cores */
  cpus_read_lock();
  for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
      on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
      cond_resched();
  }
  cpus_read_unlock();

If the current CPU isn't in rmpopt_cpumask, the leader is skipped and all cores run as followers — they lose the caching
optimization from a leader priming pass, but correctness is maintained.

Alternatively, i could pick the first CPU from rmpopt_cpumask as the explicit leader instead of relying on whichever CPU the
scheduler chose.

 	cpumask_copy(follower_mask, &rmpopt_cpumask);

        migrate_disable();
        this_cpu = smp_processor_id();

        if (cpumask_test_cpu(this_cpu, follower_mask)) {
                /* Fast path: leader runs locally, no IPIs */
                cpumask_clear_cpu(this_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        rmpopt(pa);
        } else {
                /*
                 * Current CPU does not have RMPOPT_BASE MSR programmed.
                 * Pick an explicit leader from the cpumask to avoid #UD.
                 */
                int leader_cpu = cpumask_first(follower_mask);

                cpumask_clear_cpu(leader_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        smp_call_function_single(leader_cpu, rmpopt_smp,
                                                 (void *)pa, true);
        }

        migrate_enable();

        /* Followers: run RMPOPT on remaining cores */
        cpus_read_lock();
        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                on_each_cpu_mask(follower_mask, rmpopt_smp,
                                 (void *)pa, true);

                /* Give a chance for other threads to run */
                cond_resched();
        }
        cpus_read_unlock();

        free_cpumask_var(follower_mask);
  }

> If yes, what happens if we call rmpopt() on a cpu where it wasn't initialized?

That will cause a #UD exception, if RMPOPT instruction is issued on 
a CPU where RMPOPT is not enabled (RMPOPT_BASE.RMPOPT_EN==0), so
it is essential to issue RMPOPT instruction only on the cpumask (covers both
primary and secondary threads) which was setup initially when rmpopt was
initialized and on which the RMPOPT_BASE MSR was setup and RMPOPT enabled.

I believe, there are actually three cases to be considered here: 

  1. Current CPU is in rmpopt_cpumask -> primary thread, run leader locally, remove from followers
  2. Current CPU's sibling is in rmpopt_cpumask -> sibling thread, RMPOPT_BASE per-core is programmed, run leader locally, 
     remove the sibling's primary from the follower mask
  3. Neither -> new/unknown CPU, RMPOPT_BASE never programmed on this core, fall back to explicit leader via IPI.

So this seems to the *correct* implementation of the RMPOPT loop: 

   	cpumask_copy(follower_mask, &rmpopt_cpumask);

        migrate_disable();
        this_cpu = smp_processor_id();

        if (cpumask_test_cpu(this_cpu, follower_mask)) {
                /*
                 * Current CPU is a primary thread in rmpopt_cpumask.
                 * Run leader locally and remove from follower mask.
                 */
                cpumask_clear_cpu(this_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        rmpopt(pa);
        } else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
                                      follower_mask)) {
                /*
                 * Current CPU is a sibling thread whose primary is in
                 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
                 * is safe to run the leader locally.  Remove the sibling's
                 * primary from the follower mask as this core is already
                 * covered by the leader.
                 */
                cpumask_andnot(follower_mask, follower_mask,
                               topology_sibling_cpumask(this_cpu));

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        rmpopt(pa);
        } else {
                /*
                 * Current CPU's core does not have RMPOPT_BASE MSR
                 * programmed.  Pick an explicit leader from the cpumask
                 * to avoid #UD.
                 */
                int leader_cpu = cpumask_first(follower_mask);

                cpumask_clear_cpu(leader_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        smp_call_function_single(leader_cpu, rmpopt_smp,
                                                 (void *)pa, true);
        }

        migrate_enable();

        /* Followers: run RMPOPT on remaining cores */
        cpus_read_lock();
        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                on_each_cpu_mask(follower_mask, rmpopt_smp,
                                 (void *)pa, true);

                /* Give a chance for other threads to run */
                cond_resched();
        }
        cpus_read_unlock();

        free_cpumask_var(follower_mask);
  }

Thanks,
Ashish

> 
>> +	/* Leader: prime the RMPOPT cache on this CPU */
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>> +		rmpopt(pa);
>> +
>> +	migrate_enable();
>> +
>> +	/* Followers: run RMPOPT on all other cores */
>> +	cpus_read_lock();
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> +		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
>> +				 (void *)pa, true);
>> +
>> +		 /* Give a chance for other threads to run */
>> +		cond_resched();
>> +	}
>> +	cpus_read_unlock();
>> +
>> +	if (current_cpu_cleared)
>> +		cpumask_set_cpu(this_cpu, &rmpopt_cpumask);
>> +}
>> +
>>
>> [...snip...]
>>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox