Linux Confidential Computing Development

Linux Confidential Computing Development
 help / color / mirror / Atom feed

* Re: [PATCH v10 2/6] x86/sev: Initialize RMPOPT configuration MSRs
From: Kalra, Ashish @ 2026-07-20 22:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260720221725.GDal6edYI3uHrDruHj@fat_crate.local>

Hello Boris,

On 7/20/2026 5:17 PM, Borislav Petkov wrote:
> On Tue, Jun 30, 2026 at 06:10:13PM +0000, Ashish Kalra wrote:
>> @@ -490,6 +494,11 @@ static bool __init setup_rmptable(void)
>>  	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
>>  		if (!setup_segmented_rmptable())
>>  			return false;
>> +		/*
>> +		 * RMPOPT requires a segmented RMP, so indicate that the
>> +		 * system is capable of configuring and running RMPOPT.
>> +		 */
>> +		rmpopt_capable = true;
> 
> So we're capable of doing RMPOPT when setup_segmented_rmptable() has
> succeeded. Which means, when rmp_segment_table is not NULL, i.e., when we have
> a segmented table.
> 
> Which means that instead of testing rmpopt_capable, you need to test
> CC_ATTR_HOST_SEV_SNP and rmp_segment_table != NULL and you can put that in
> a helper local to arch/x86/virt/svm/sev.c
> 
> Which means, you don't need that bool.
> 

Agreed on dropping the bool — I'll derive it in a local helper.

One issue though: rmp_segment_table != NULL isn't segmented-only. setup_contiguous_rmptable() also allocates rmp_segment_table
(the contiguous RMP is stored as a single segment in the same table), so it's non-NULL for the contiguous case too.

To keep it segmented-only, I'll also need to gate on the segmented-RMP mode, something like: 

  static bool rmpopt_capable(void)
  {
        return cpu_feature_enabled(X86_FEATURE_RMPOPT) &&
               cc_platform_has(CC_ATTR_HOST_SEV_SNP) &&
               (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) &&
               rmp_segment_table;
  }

The CC_ATTR_HOST_SEV_SNP check also handles SNP being disabled at runtime, so snp_clear_rmpopt_capable() and its caller go away as well.

Thanks,
Ashish

^ permalink raw reply

* Re: [PATCH v10 2/6] x86/sev: Initialize RMPOPT configuration MSRs
From: Borislav Petkov @ 2026-07-20 22:17 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <8518e02c46d6edf1f37a180569c708a3cfa7c413.1782841284.git.ashish.kalra@amd.com>

On Tue, Jun 30, 2026 at 06:10:13PM +0000, Ashish Kalra wrote:
> @@ -490,6 +494,11 @@ static bool __init setup_rmptable(void)
>  	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
>  		if (!setup_segmented_rmptable())
>  			return false;
> +		/*
> +		 * RMPOPT requires a segmented RMP, so indicate that the
> +		 * system is capable of configuring and running RMPOPT.
> +		 */
> +		rmpopt_capable = true;

So we're capable of doing RMPOPT when setup_segmented_rmptable() has
succeeded. Which means, when rmp_segment_table is not NULL, i.e., when we have
a segmented table.

Which means that instead of testing rmpopt_capable, you need to test
CC_ATTR_HOST_SEV_SNP and rmp_segment_table != NULL and you can put that in
a helper local to arch/x86/virt/svm/sev.c

Which means, you don't need that bool.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v10 1/6] x86/cpufeatures: Add X86_FEATURE_RMPOPT feature flag
From: Borislav Petkov @ 2026-07-20 21:12 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <39e9ee269a572c516a3f4e937bfe12d00697d5e6.1782841284.git.ashish.kalra@amd.com>

On Tue, Jun 30, 2026 at 06:09:44PM +0000, Ashish Kalra wrote:
>  arch/x86/include/asm/cpufeatures.h       | 2 +-
>  arch/x86/kernel/cpu/scattered.c          | 1 +
>  tools/arch/x86/include/asm/cpufeatures.h | 2 +-

Yeah, apparently we don't do the tools/ change anymore:

https://lore.kernel.org/r/0af5122c-20df-4aea-8ab4-cba63f71dc3b@amd.com

Final version:

Author: Ashish Kalra <ashish.kalra@amd.com>
Date:   Tue Jun 30 18:09:44 2026 +0000

    x86/cpufeatures: Add X86_FEATURE_RMPOPT feature flag
    
    Add a flag indicating whether RMPOPT instruction is supported.
    
    RMPOPT is a new instruction that reduces the performance overhead of RMP
    checks for the hypervisor and non-SNP guests by allowing those checks to be
    skipped when 1-GB memory regions are known to contain no SEV-SNP guest memory.
    
    For more information on the RMPOPT instruction, see the AMD64 RMPOPT
    technical documentation.
    
      [ bp: Zap respective tools/ change. ]
    
    Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
    Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
    Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
    Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: Ackerley Tng <ackerleytng@google.com>
    Link: https://patch.msgid.link/39e9ee269a572c516a3f4e937bfe12d00697d5e6.1782841284.git.ashish.kalra@amd.com

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 3d0940a3b9f3..dbccde9ee5cd 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
 #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
 #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
 #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free                                 ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT		( 3*32+ 7) /* Support for AMD RMPOPT instruction */
 #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
 /* free: was #define X86_FEATURE_UP	( 3*32+ 9) * "up" SMP kernel running on UP */
 #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 937129ce6a96..021c0bf22de2 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -67,6 +67,7 @@ static const struct cpuid_bit cpuid_bits[] = {
 	{ X86_FEATURE_PERFMON_V2,		CPUID_EAX,  0, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_V2,		CPUID_EAX,  1, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_PMC_FREEZE,	CPUID_EAX,  2, 0x80000022, 0 },
+	{ X86_FEATURE_RMPOPT,			CPUID_EDX,  0, 0x80000025, 0 },
 	{ X86_FEATURE_AMD_HTR_CORES,		CPUID_EAX, 30, 0x80000026, 0 },
 	{ 0, 0, 0, 0, 0 }
 };

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related

* Re: [PATCH v10 0/6] Add RMPOPT support.
From: Kalra, Ashish @ 2026-07-20 20:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260720202845.GAal6E_dQjTa_vEO3y@fat_crate.local>

Hello Boris,

On 7/20/2026 3:28 PM, Borislav Petkov wrote:
> On Mon, Jul 20, 2026 at 03:17:23PM -0500, Kalra, Ashish wrote:
>> Just a gentle ping on this series. I believe v10 addresses the feedback from the earlier rounds. 
>> Please let me know if it looks good to merge, or if there's anything else you'd like me to change.
> 
> What about those Sashiko comments to it:
> 
> https://sashiko.dev/#/patchset/cover.1782841284.git.ashish.kalra%40amd.com
> 
> ?
> 

The three Sashiko comments on v10 all resolved to no code change, with rationale posted inline:

  1. Suspend/resume vs. hotplug disable (patch 3): pre-existing base-SNP behavior — an SNP host doesn't support suspend/resume anyway (SnpEn is set once before
     SNP_INIT and can't be rebuilt on resume), so it can't arise.
  2. snp_setup_rmpopt() re-init pass (patch 4): performance-only and self-correcting, doesn't occur on RMPOPT-capable platforms (the x86_snp_shutdown path frees
     the workqueue, so re-init re-runs the pass) — the guarded case is legacy-firmware-only.
  3. mod_delayed_work() and VM churn (patch 6): deliberate — queue_delayed_work() would regress the common case (scanning a guest before its page conversion 
     completes), the deferral is best-effort and self-limiting.

Thanks,
Ashish

^ permalink raw reply

* Re: [PATCH v10 0/6] Add RMPOPT support.
From: Borislav Petkov @ 2026-07-20 20:28 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <aad54bd5-8a5e-418d-8cb2-2cec6527993b@amd.com>

On Mon, Jul 20, 2026 at 03:17:23PM -0500, Kalra, Ashish wrote:
> Just a gentle ping on this series. I believe v10 addresses the feedback from the earlier rounds. 
> Please let me know if it looks good to merge, or if there's anything else you'd like me to change.

What about those Sashiko comments to it:

https://sashiko.dev/#/patchset/cover.1782841284.git.ashish.kalra%40amd.com

?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v10 0/6] Add RMPOPT support.
From: Kalra, Ashish @ 2026-07-20 20:17 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1782841284.git.ashish.kalra@amd.com>

Hi Boris, Dave,

Just a gentle ping on this series. I believe v10 addresses the feedback from the earlier rounds. 
Please let me know if it looks good to merge, or if there's anything else you'd like me to change.

Thanks,
Ashish

On 6/30/2026 1:08 PM, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
> 
> In the SEV-SNP architecture, hypervisor and non-SNP guests are subject
> to RMP checks on writes to provide integrity of SEV-SNP guest memory.
> 
> The RMPOPT architecture enables optimizations whereby the RMP checks
> can be skipped if 1GB regions of memory are known to not contain any
> SNP guest memory.
> 
> RMPOPT is a new instruction designed to minimize the performance
> overhead of RMP checks for the hypervisor and non-SNP guests.
> 
> RMPOPT instruction currently supports two functions. In case of the
> verify and report status function the CPU will read the RMP contents,
> verify the entire 1GB region starting at the provided SPA is HV-owned.
> For the entire 1GB region it checks that all RMP entries in this region
> are HV-owned (i.e, not in assigned state) and then accordingly updates
> the RMPOPT table to indicate if optimization has been enabled and
> provide indication to software if the optimization was successful.
> 
> In case of report status function, the CPU returns the optimization
> status for the 1GB region.
> 
> The RMPOPT table is managed by a combination of software and hardware.
> Software uses the RMPOPT instruction to set bits in the table,
> indicating that regions of memory are entirely HV-owned.  Hardware
> automatically clears bits in the RMPOPT table when RMP contents are
> changed during RMPUPDATE instruction.
> 
> For more information on the RMPOPT instruction, see the AMD64 RMPOPT
> technical documentation.
> 
> As SNP is enabled by default the hypervisor and non-SNP guests are
> subject to RMP write checks to provide integrity of SNP guest memory.
> 
> This patch-series adds support to enable RMP optimizations for up to
> 2TB of system RAM across the system and allow RMPUPDATE to disable
> those optimizations as SNP guests are launched.
> 
> Support for RAM larger than 2 TB will be added in follow-on series.
> 
> This series also adds support to disable CPU hotplug while SNP is
> active, as the SEV firmware enumerates CPUs at SNP initialization and is
> not aware of the OS bringing CPUs online or offline afterwards.  This
> also keeps the set of CPUs stable for the asynchronous RMPOPT scan, so
> the per-core RMPOPT_BASE MSRs programmed during setup remain valid.
> 
> This series also introduces support to re-enable RMP optimizations
> during SNP guest termination, after guest pages have been converted
> back to shared.
> 
> RMP optimizations are performed asynchronously by queuing work on a
> dedicated workqueue after a 10 second delay.
> 
> Delaying work allows batching of multiple SNP guest terminations.
> 
> Once 1GB hugetlb guest_memfd support is merged, support for
> re-enabling RMPOPT optimizations during 1GB page cleanup will be added
> in follow-on series.
> 
> v10:
> - Rework the CPU-hotplug patch (3/6): disable CPU hotplug in
>   snp_prepare(), before SnpEn is set, instead of late in
>   __sev_snp_init_locked(), so no CPU can come online without SnpEn during
>   SNP initialization (per upstream review).  Tie hotplug to SnpEn: it
>   stays disabled while SnpEn is set -- including across a failed SNP_INIT
>   and across the legacy SNP_SHUTDOWN_EX path -- and is re-enabled only
>   once the firmware clears SnpEn on the x86_snp_shutdown path.  Drop the
>   separate idempotent flag: snp_prepare() re-enables hotplug on its own
>   early failure, and a kexec target that boots with SnpEn already set
>   disables hotplug once in snp_rmptable_init().  Reword the commit log and
>   comments accordingly.
> - Emit a pr_warn() in rmpopt_work_handler() (4/6) when the follower
>   cpumask allocation fails, instead of silently skipping the optimization
>   pass.
> 
>   Sashiko AI upstream review identified several of the above issues.
> 
> v9:
> - Rename rmpopt_configured to rmpopt_capable.
> - Make rmpopt_cpumask a cpumask_var_t (allocated/freed at setup/cleanup)
>   instead of a static cpumask_t.
> - Drop the v8 WARN_ON_ONCE() on the RMPOPT_BASE writes; use a plain
>   wrmsrq_on_cpu(), matching the SNP MSR-write convention in this file.
> - Disable CPU hotplug with cpu_hotplug_disable()/cpu_hotplug_enable()
>   (per tglx); re-enable only on the full x86_snp_shutdown path.
> - Simplify rmpopt_work_handler() to a single leader-then-followers path:
>   with CPU hotplug disabled while SNP is active and snp_prepare()
>   requiring all CPUs online when RMPOPT_BASE is programmed, every core is
>   always programmed, so the explicit-leader fallback is now unreachable.
>   Drop it along with the v8 work_on_cpu()/rmpopt_leader_fn() helper.
> - Drop the debugfs interface (was patch 7/7) and its report-only
>   plumbing; observability will be revisited after this series is merged.
> - Restrict snp_rmpopt_all_physmem()'s export to the kvm-amd module.
> - Use scoped_guard(cpus_read_lock) for the per-CPU MSR and follower
>   loops.
> 
>   Sashiko AI upstream review identified several of the above issues.
> 
> v8:
> - Add a new patch to disable CPU hotplug while SNP is active, keeping
>   the CPU set stable for the RMPOPT work handler.
> - Drop the setup_clear_cpu_cap(X86_FEATURE_RMPOPT) calls; the
>   rmpopt_configured bool is the runtime guard.
> - WARN_ON_ONCE() on the RMPOPT_BASE MSR writes that previously ignored
>   their return value.
> - Simplify rmpopt_work_handler() by removing the explicit-leader
>   fallback: with CPU hotplug disabled while SNP is active and
>   snp_prepare() requiring all CPUs online when RMPOPT_BASE is programmed,
>   every core is always programmed, so the running CPU can always be the
>   leader.  This drops the smp_call_function_single() fallback (and with
>   it the AB-BA deadlock and IRQ-latency concerns) and collapses the
>   leader selection into a single leader-then-followers path.
> - Use mod_delayed_work() in snp_rmpopt_all_physmem() so the batching
>   delay tracks the last SNP guest termination.
> 
>   Sashiko AI code review identified several of the above issues.
> 
> v7:
> - Sync tools/arch/x86/include/asm/cpufeatures.h to mirror the kernel
>   header for X86_FEATURE_RMPOPT.
> - Fix commit title to use X86_FEATURE_RMPOPT to match the code
>   (was X86_FEATURE_AMD_RMPOPT).
> - Add static bool rmpopt_configured, set only when segmented RMP setup
>   succeeds in setup_rmptable().  Check rmpopt_configured alongside
>   cpu_feature_enabled(X86_FEATURE_RMPOPT) in snp_setup_rmpopt() and
>   snp_rmpopt_all_physmem(), because setup_clear_cpu_cap() is unreliable
>   after alternatives are patched.  Add snp_clear_rmpopt_configured()
>   called from amd_cc_platform_clear() when CC_ATTR_HOST_SEV_SNP is
>   cleared.  Do not use __ro_after_init on rmpopt_configured since the
>   writer snp_clear_rmpopt_configured() is not __init.
> - Add cond_resched() to all three leader loops in rmpopt_work_handler()
>   to prevent soft lockups on systems with up to 2TB of RAM.
> - Add comment above __rmpopt() documenting the RMPOPT instruction
>   encoding (F2 0F 01 FC) and register interface (RAX = system physical
>   address input, RCX = operation type input, RFLAGS.CF = output).
>   Note: RMPOPT does not modify RAX unlike PVALIDATE/RMPUPDATE, so
>   the existing "a" (input-only) constraint is correct.
> 
>   Sashiko AI code review identified several of the above issues.
> 
> v6:
> - Drop wrmsrq_on_cpus() helper; use for_each_cpu() with wrmsrq_on_cpu()
>   instead, as RMPOPT_BASE MSR programming is not performance-critical.
> - Rewrite rmpopt_work_handler() leader selection to use a local
>   follower_mask copy instead of modifying the global rmpopt_cpumask.
>   This eliminates the current_cpu_cleared tracking and the restore at
>   the end, and removes the need for synchronization comments about
>   transient cpumask inconsistency.
> - Add three-way leader selection in rmpopt_work_handler():
>   1. Current CPU is a primary thread in cpumask: run leader locally.
>   2. Current CPU is a sibling thread whose primary is in cpumask:
>      run leader locally (RMPOPT_BASE MSR is per-core), remove the
>      primary from followers via cpumask_andnot(topology_sibling_cpumask).
>   3. Current CPU's core has no RMPOPT_BASE MSR programmed: pick an
>      explicit leader via cpumask_first() + smp_call_function_single()
>      to avoid #UD, with cpus_read_lock() around the IPI loop.
> - Add WARN_ON_ONCE guard for empty cpumask in the explicit leader
>   fallback path, with migrate_enable() before goto out.
> - Add .llseek = seq_lseek to rmpopt_table_fops for consistency with
>   other seq_file-based debugfs files and to support tools like "less".
> - Change debugfs file permissions from 0444 to 0400 to restrict access
>   to root only.
> - Add comment in rmpopt_table_seq_show() explaining why cpu_online_mask
>   is safe: RMPOPT_BASE MSR is per-core and snp_prepare() ensures all
>   CPUs are online when the MSR is programmed.
> 
>   Sashiko AI code review identified several of the above issues.
> 
> v5:
> - Introduce rmpopt_cleanup() to tear down workqueue, debugfs, cpumask,
>   and MSR state, called from snp_shutdown().
> - Introduce rmpopt_wq_mutex to serialize snp_setup_rmpopt(),
>   snp_rmpopt_all_physmem(), and rmpopt_cleanup().
> - Introduce rmpopt_show_mutex to serialize debugfs reporting of
>   rmpopt_report_cpumask.
> - Move snp_rmpopt_all_physmem() call after SNP DECOMMISSION during
>   guest shutdown.
> - Use migrate_disable()/migrate_enable() for CPU pinning in the
>   rmpopt_work_handler() leader loop to maintain CPU affinity without
>   disabling preemption for the entire RMPOPT scan.
> - Add cpus_read_lock()/cpus_read_unlock() around the follower
>   on_each_cpu_mask() loop in rmpopt_work_handler().
> - Guard snp_setup_rmpopt() against re-initialization when
>   SNP_SHUTDOWN_EX with x86_snp_shutdown=0 skips rmpopt_cleanup()
>   but clears snp_initialized, preventing workqueue and resource
>   leaks on repeated init/shutdown cycles.
> - Replace setup_clear_cpu_cap() with pr_err() on alloc_workqueue()
>   failure in snp_setup_rmpopt(), as setup_clear_cpu_cap() cannot be
>   used after alternatives are patched; callers check rmpopt_wq != NULL
>   as the runtime guard instead.
> - Add pr_info() when RMPOPT coverage is capped at 2TB.
> - Add comments noting CPU hotplug is not supported with SNP enabled
>   and only online primary threads are covered by rmpopt_cpumask.
> - Add comment in setup_rmptable() noting Segmented RMP must be
>   enabled to enable RMPOPT.
> - Simplify cpumask setup loop to set if primary thread rather than
>   skip if not primary.
> - Improve grammar and clarity in snp_setup_rmpopt() comments.
> - Added Reviewed-by's.
> 
>   Sashiko AI code review identified several of the above issues.
> 
> v4:
> - Add new wrmsrq_on_cpus() helper to write same u64 value to a
>   per-CPU MSR across a cpumask without per-cpu struct allocation
>   overhead.
> - Rename configure_and_enable_rmpopt() to snp_setup_rmpopt().
> - Use wrmsrq_on_cpus() instead of wrmsrq_on_cpu() loop for
>   programming RMPOPT_BASE MSRs.
> - Add setup_clear_cpu_cap(X86_FEATURE_RMPOPT) if segmented RMP
>   setup fails or workqueue allocation fails.
> - Add X86_FEATURE_RMPOPT feature clear logic in amd_cc_platform_clear()
>   for CC_ATTR_HOST_SEV_SNP.
> - All of the above allow checking for only X86_FEATURE_RMPOPT for both
>   RMPOPT setup/enable and RMP re-optimizations.
> - Rename snp_perform_rmp_optimization() to snp_rmpopt_all_physmem().
> - Split rmpopt() into rmpopt() and rmpopt_smp() for SMP callback use.
> - Introduce separate rmpopt_report_cpumask for debugfs reporting,
>   distinct from rmpopt_cpumask used for primary thread tracking.
> - Remove snp_perform_rmp_optimization() call from __sev_snp_init_locked()
>   and instead setup and enable RMPOPT after SNP is enabled and
>   initialized.
> 
> v3:
> - Drop all RMPOPT kthread support and introduce adding custom and
>   dedicated workqueue to schedule delayed and asynchronous RMPOPT work.
> - Drop the guest_memfd inode cleanup interface and add support to
>   re-enable RMP optimizations during guest shutdown using the
>   asynchronous and delayed workqueue interface.
> - Introduce new __rmpopt() helper and rmpopt() and
>   rmpopt_report_status() wrappers on top which use rax and rcx
>   parameters to closely match RMPOPT specs.
> - Use new optimized RMPOPT loop to issue RMPOPT instructions on all
>   system RAM upto 2TB and all CPUs, by optimizing each range on one CPU
>   first, then let other CPUs execute RMPOPT in parallel so they can skip
>   most work as the range has already been optimized.
> - Also add support for running the optimized RMPOPT loop only on
>   one thread per core.
> - Replace all PUD_SIZE references with SZ_1G to conform to 1GB regions
>   as specified by RMPOPT specifications and not be dependent on PUD_SIZE
>   which makes the RMPOPT patch-set independent of x86 page table sizes.
> - Use wrmsrq_on_cpu() to program the RMPOPT_BASE MSR registers on
>   all CPUs that removes all ugly casting to use on_each_cpu_mask().
> - Fix inline commits and patch commit messages
> 
> 
> v2:
> - Drop all NUMA and Socket configuration and enablement support and
>   enable RMPOPT support for up to 2TB of system RAM.
> - Drop get_cpumask_of_primary_threads() and enable per-core RMPOPT
>   base MSRs and issue RMPOPT instruction on all CPUs.
> - Drop the configfs interface to manually re-enable RMP optimizations.
> - Add new guest_memfd cleanup interface to automatically re-enable
>   RMP optimizations during guest shutdown.
> - Include references to the public RMPOPT documentation.
> - Move debugfs directory for RMPOPT under architecuture specific
>   parent directory.
> 
> Ashish Kalra (6):
>   x86/cpufeatures: Add X86_FEATURE_RMPOPT feature flag
>   x86/sev: Initialize RMPOPT configuration MSRs
>   x86/sev: Disable CPU hotplug while SNP is active
>   x86/sev: Add support to perform RMP optimizations asynchronously
>   x86/sev: Add interface to re-enable RMP optimizations.
>   KVM: SEV: Perform RMP optimizations on SNP guest shutdown
> 
>  arch/x86/coco/core.c                     |   2 +
>  arch/x86/include/asm/cpufeatures.h       |   2 +-
>  arch/x86/include/asm/msr-index.h         |   3 +
>  arch/x86/include/asm/sev.h               |   6 +
>  arch/x86/kernel/cpu/scattered.c          |   1 +
>  arch/x86/kvm/svm/sev.c                   |  10 +
>  arch/x86/virt/svm/sev.c                  | 277 +++++++++++++++++++++++
>  drivers/crypto/ccp/sev-dev.c             |   3 +
>  tools/arch/x86/include/asm/cpufeatures.h |   2 +-
>  9 files changed, 304 insertions(+), 2 deletions(-)
> 

^ permalink raw reply

* Re: [PATCH v7 09/11] x86/virt/tdx: Enable Dynamic PAMT
From: Edgecombe, Rick P @ 2026-07-20 18:34 UTC (permalink / raw)
  To: sashiko-reviews@lists.linux.dev
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Huang, Kai,
	Hansen, Dave, Zhao, Yan Y, tony.lindgren@linux.intel.com,
	Wu, Binbin, kas@kernel.org, seanjc@google.com, mingo@redhat.com,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	nik.borisov@suse.com, linux-doc@vger.kernel.org, hpa@zytor.com,
	tglx@kernel.org, Annapurve, Vishal, bp@alien8.de, Gao, Chao,
	x86@kernel.org
In-Reply-To: <20260718015627.21D9F1F000E9@smtp.kernel.org>

On Sat, 2026-07-18 at 01:56 +0000, sashiko-bot@kernel.org wrote:
> > Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> > - [Low] The global variable `tdx_enable_dpamt` is defined without the `static` keyword, polluting the global namespace.
> > --
> > 
> > --- Patch [9]: [PATCH v7 09/11] x86/virt/tdx: Enable Dynamic PAMT ---
> > commit a5e057035573a191a887572dacd25f3faa79112d
> > Author: Kiryl Shutsemau <kas@kernel.org>
> > 
> > x86/virt/tdx: Enable Dynamic PAMT
> > 
> > This commit enables Dynamic PAMT for TDX modules to reduce memory usage
> > during module initialization. It introduces a kernel command line parameter
> > "tdx_dpamt" to allow users to opt-in to this feature when supported by the
> > hardware.
> > 
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > index e31b62f36300d..69834b0d3e602 100644
> > > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -47,6 +47,8 @@
> > > >   #include "seamcall_internal.h"
> > > >   #include "tdx.h"
> > > >   
> > > > +bool tdx_enable_dpamt __ro_after_init;
> > 
> > [Severity: Low]
> > Since tdx_enable_dpamt is only used within this file, should it be marked
> > as static? Leaving it globally exposed can trigger compiler warnings like
> > -Wmissing-declarations when building with W=1 and unnecessarily pollutes
> > the global namespace.

Sigh. Yes this is valid. I forgot to make it static after deciding to move
tdx_supports_dynamic_pamt() out of the header instead. Then it didn't need to
reference out of this file. Doh.

Seems maybe something that could be fixed up when applying if the rest of the
series look ok?

^ permalink raw reply

* Re: [PATCH v7 11/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Edgecombe, Rick P @ 2026-07-20 18:33 UTC (permalink / raw)
  To: tony.lindgren@linux.intel.com, linux-coco@lists.linux.dev,
	Huang, Kai, Hansen, Dave, Zhao, Yan Y, Wu, Binbin, kas@kernel.org,
	seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, sashiko-reviews@lists.linux.dev,
	hpa@zytor.com, tglx@kernel.org, Annapurve, Vishal, bp@alien8.de,
	Gao, Chao, x86@kernel.org
  Cc: kvm@vger.kernel.org
In-Reply-To: <20260718020053.C2FC81F000E9@smtp.kernel.org>

On Sat, 2026-07-18 at 02:00 +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] Using `atomic_dec_and_lock` in `tdx_pamt_put` lacks underflow protection, causing an unbalanced put (refcount 0) to silently underflow the refcount to -1, which breaks future `tdx_pamt_get` calls for that PFN.

The report is essentially that this patch doesn't handle unbalanced
tdx_pamt_put() calls (a tdx_pamt_put() without a previous paired
tdx_pamt_get()). So this isn't a bug per-se, but it is true that the
optimization patch has a bit worse warnings in the case of buggy calling code.

More analysis... The function is:

int atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock)
{
	/* Subtract 1 from counter unless that drops it to 0 (ie. it was 1) */
	if (atomic_add_unless(atomic, -1, 1))
		return 0;

	/* Otherwise do it the slow way */
	spin_lock(lock);
	if (atomic_dec_and_test(atomic))
		return 1;
	spin_unlock(lock);
	return 0;
}

In the fast path atomic_add_unless() subtracts 1 unless the counter was exactly
1. So if it was 0 (no paired get has been called), the counter would underflow.

In the non-optimized patch, the put code was more robust to this scenario. In
that code if a put was done with the refcount at 0, the put would see it was
less than 1 and try to remove it. The result would be a warning from the
SEAMCALL failure.

So this patch trades some warnings for performance. But the warning loss is
pretty small. Future buggy code is unlikely to get away without a warning.
Missing PAMT is not like double free or bug like that. With the count
unbalanced, then next tdx_pamt_get() will fail to add the Dynamic PAMT page pair
and the TDX module will return an error when it tries to use the page.

Getting a warning exactly at the point of the unpaired put would be better. But
this version has other clear benefits. On balance, I'd prefer this one. It might
be possible to utilize refcount_t to get some better underflow warnings, but
probably for the cost of more memory ordering things to consider.

Hmm, another cheap, but not complete option would be to just warn on get/put out
paths if any strange refcounts are seen. Since the point is just to catch the
error closer to where it happens. It's just developer help. Simpler code with
less well timed warnings seems like a good tradeoff for developers?

> --
> 
> --- Patch [11]: [PATCH v7 11/11] x86/virt/tdx: Optimize tdx_pamt_get/put() ---
> commit f3bde8ccf9659f8b1e1cedb9b81df0b3b209a9fa
> Author: Kiryl Shutsemau <kas@kernel.org>
> 
> x86/virt/tdx: Optimize tdx_pamt_get/put()
> 
> The Dynamic PAMT get/put helpers use a global spinlock to serialize all
> refcount updates and SEAMCALL invocations. This gives correct behavior for
> concurrent callers, but leads to contention.
> 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 69834b0d3e602..4394bba7c9a7c 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> 
> [ ... ]
> 
> > @@ -2214,17 +2227,13 @@ void tdx_pamt_put(kvm_pfn_t pfn)
> >  
> >  	pamt_refcount = tdx_find_pamt_refcount(pfn);
> >  
> > -	spin_lock(&pamt_lock);
> >  	/*
> >  	 * If there is more than 1 reference on the pamt page, don't
> >  	 * remove it yet. Just decrement the refcount.
> >  	 */
> > -	if (atomic_read(pamt_refcount) > 1) {
> > -		atomic_dec(pamt_refcount);
> > -		goto out_unlock;
> > -	}
> > +	if (!atomic_dec_and_lock(pamt_refcount, &pamt_lock))
> > +		return;
> 
> [Severity: Medium]
> If an unbalanced put occurs on an un-added PAMT page (where the refcount
> is 0), does atomic_dec_and_lock() lack underflow protection and silently
> drop the refcount to -1 instead of proceeding to the SEAMCALL failure
> warning as the previous logic did?
> 
> lib/dec_and_lock.c:atomic_dec_and_lock() {
>     /* Subtract 1 from counter unless that drops it to 0 (ie. it was 1) */
>     if (atomic_add_unless(atomic, -1, 1))
>         return 0;
>     ...
> }
> 
> If the refcount silently becomes -1, would a subsequent call to
> tdx_pamt_get() for this PFN incorrectly succeed without adding the PAMT
> memory?
> 
> arch/x86/virt/vmx/tdx/tdx.c:tdx_pamt_get() {
>     ...
>     if (atomic_inc_not_zero(pamt_refcount))
>         return 0;
>     ...
> }
> 
> Since -1 is not zero, atomic_inc_not_zero() would increment the refcount
> to 0 and return true, bypassing the tdh_phymem_pamt_add() SEAMCALL.
> 

^ permalink raw reply

* Re: [PATCH v7 08/11] KVM: TDX: Get/put PAMT pages when (un)mapping private memory
From: Edgecombe, Rick P @ 2026-07-20 16:48 UTC (permalink / raw)
  To: sashiko-reviews@lists.linux.dev
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Huang, Kai,
	Hansen, Dave, Zhao, Yan Y, tony.lindgren@linux.intel.com,
	Wu, Binbin, kas@kernel.org, seanjc@google.com, mingo@redhat.com,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	nik.borisov@suse.com, linux-doc@vger.kernel.org, hpa@zytor.com,
	tglx@kernel.org, Annapurve, Vishal, bp@alien8.de, Gao, Chao,
	x86@kernel.org
In-Reply-To: <20260718061050.E17B01F000E9@smtp.kernel.org>

On Sat, 2026-07-18 at 06:10 +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] Infinite kernel loop in `kvm_tdp_mmu_map_private_pfn` due to permanent PAMT cache depletion on transient TDX module contention.
> --

Our internal Sashiko found this too. It's a false positive as a real bug.

Today kvm_tdp_mmu_map_private_pfn() is only called tdx_gmem_post_populate()
during TD setup. It holds the heavyweight tdx_vm_state_guard which grabs vm-
>lock, kvm->slots_lock, and all vcpu->mutex. So there should be no contention
possible.

Any potential confusion is not new either, because a similar thing could happen
with the external page tables.

But Yan and I were discussing that it would be a good cleanup to fix this anyway
because the reason it is not a functional issue is not clear from the code. For
improved readability (and quieter sashiko reports) the topup can happen inside
the retry loop. Either by moving the retry loop or moving the topup.

> 
> --- Patch [8]: [PATCH v7 08/11] KVM: TDX: Get/put PAMT pages when (un)mapping private memory ---
> commit 3474807f1c1bc4dedc75e2a756b65254c986a41d
> Author: Kiryl Shutsemau <kas@kernel.org>
> 
> KVM: TDX: Get/put PAMT pages when (un)mapping private memory
>     
> Add Dynamic PAMT support to KVM's S-EPT MMU by "getting" a PAMT page when
> adding guest memory (PAGE.ADD or PAGE.AUG), and "putting" the page when
> removing guest memory (PAGE.REMOVE).
> 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 528ebb3cb0389..049da8559882b 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1679,16 +1693,28 @@ static struct page *tdx_spte_to_sept_pt(struct kvm *kvm, gfn_t gfn,
> >  static int tdx_sept_map_nonleaf_spte(struct kvm *kvm, gfn_t gfn,
> >  				     enum pg_level level, u64 new_spte)
> >  {
> > +	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> >  	gpa_t gpa = gfn_to_gpa(gfn);
> >  	u64 err, entry, level_state;
> >  	struct page *sept_pt;
> > +	int ret;
> > +
> > +	if (KVM_BUG_ON(!vcpu, kvm))
> > +		return -EIO;
> >  
> >  	sept_pt = tdx_spte_to_sept_pt(kvm, gfn, new_spte, level);
> >  	if (!sept_pt)
> >  		return -EIO;
> >  
> > +	ret = tdx_pamt_get(page_to_pfn(sept_pt), &to_tdx(vcpu)->pamt_cache);
> > +	if (ret)
> > +		return ret;
> > +
> >  	err = tdh_mem_sept_add(&to_kvm_tdx(kvm)->td, gpa, level, sept_pt,
> >  			       &entry, &level_state);
> > +	if (err)
> > +		tdx_pamt_put(page_to_pfn(sept_pt));
> > +
> >  	if (unlikely(tdx_operand_busy(err)))
> >  		return -EBUSY;
> 
> [Severity: High]
> Does this error path risk causing an infinite loop if the TDX module
> contention leads to TDX_OPERAND_BUSY?
> 
> When tdh_mem_sept_add() returns TDX_OPERAND_BUSY, tdx_pamt_put() frees the
> PAMT pages directly back to the buddy allocator, not back into the vCPU's
> PAMT cache.
> 
> If this happens during VM initialization in a retry loop like
> kvm_tdp_mmu_map_private_pfn(), the vCPU memory caches are topped up outside
> of the loop:
> 
>     r = mmu_topup_memory_caches(vcpu, false);
>     if (r)
>         return r;
> 
>     do {
>         ...
>         r = kvm_tdp_mmu_map(vcpu, &fault);
>     } while (r == RET_PF_RETRY);
> 
> Since the PAMT pages are discarded rather than returned to the cache, a few
> busy retries will completely empty the vCPU's PAMT cache.
> 
> On the next retry, tdx_pamt_get() will return -ENOMEM because the cache is
> empty and it does not fall back to alloc_page() when a cache is provided.
> 
> The TDP MMU handles -ENOMEM by returning RET_PF_RETRY, but because the cache
> is never replenished inside the loop, the loop will infinitely fail and retry.
> 
> Could the PAMT pages be returned to the vCPU cache instead of the buddy
> allocator during transient failures, or should the cache top-up be moved
> inside the retry loop?
> 


^ permalink raw reply

* Re: [PATCH v2 2/2] virt: tdx-guest: Allocate Quote buffer dynamically
From: Dave Hansen @ 2026-07-20 13:35 UTC (permalink / raw)
  To: Peter Fang, Dave Hansen, Kiryl Shutsemau, Rick Edgecombe,
	Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, linux-coco, kvm, Xiaoyao Li,
	Binbin Wu
In-Reply-To: <20260717214349.4075994-3-peter.fang@intel.com>

On 7/17/26 14:43, Peter Fang wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> The TDX attestation driver currently uses a fixed 128 KB Quote buffer
> shared with the host VMM. This may be too small for Quotes using schemes
> such as post-quantum cryptography (PQC), where larger certificate chains
> can increase the Quote size significantly.
> 
> Allocate the Quote buffer based on the size reported by the TDX module
> instead of always reserving a fixed-size buffer. This avoids wasting
> memory on platforms that do not require larger Quotes. Older platforms
> fall back to the default 128 KB buffer.
> 
> Because the Quote buffer must be physically contiguous, its size is
> bound by the buddy allocator's maximum page order (4 MB), which should
> be sufficient for current attestation needs.

This is all talking about post-quantum-crypto and all that fancy stuff.

Isn't the important part here that the old TDX module ABI had static
quote sizes and now they're dynamic? Now, the reason it changed is all
the fancy stuff.

But the ABI changed. Right?

> -static void *alloc_quote_buf(void)
> +static size_t get_quote_buf_size(void)
>  {
> -	size_t len = PAGE_ALIGN(GET_QUOTE_BUF_SIZE);
> -	unsigned int count = len >> PAGE_SHIFT;
> +	size_t buf_size = GET_QUOTE_DEFAULT_BUF_SIZE;
> +	u32 quote_size;
> +
> +	quote_size = tdx_get_max_quote_size();
> +
> +	if (quote_size)
> +		/* Reported size does not include GetQuote header */
> +		buf_size = TDX_QUOTE_BUF_LEN(quote_size);
> +
> +	return PAGE_ALIGN(buf_size);
> +}

This code is almost nonsensical on the surface.

It _really_ needs some commenting. Things like:

	/* Start with the default quote buffer size: */

...

	/* Override the default when ... */


You could even comment the function to say what it is trying to do overall.

> +static void *alloc_quote_buf(size_t *buflen)
> +{
> +	unsigned int count;
> +	size_t len;
>  	void *addr;
>  
> -	addr = alloc_pages_exact(len, GFP_KERNEL | __GFP_ZERO);
> +	len = get_quote_buf_size();
> +
> +	/*
> +	 * This fails if the requested size exceeds the buddy allocator's
> +	 * maximum order. Use __GFP_NOWARN since the size comes from the host
> +	 * and should fail quietly rather than warn.
> +	 */
> +	addr = alloc_pages_exact(len, GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN);

Bad Sashiko. Bad.

The host may be untrusted, but it's also a critical part of the system.
Are we sure we want to be completely quiet?

I used to see little dmesg warnings about TCP window shenanigans from
random systems on the Internet. Maybe that's not how we do things today,
but if a random dude on the Internet can spew one line to dmesg, is it
that crazy that a bad VMM be able to spew a warning?

>  	if (!addr)
>  		return NULL;
>  
> +	count = len >> PAGE_SHIFT;
> +
>  	if (set_memory_decrypted((unsigned long)addr, count))
>  		return NULL;
>  
> +	*buflen = len;
> +
>  	return addr;
>  }

This feels weird to me.

If the upper-layer function needs to know the size, why not have it just
call get_quote_buf_size()? Then there's no pass-by-address.

> @@ -285,7 +310,7 @@ static int tdx_report_new_locked(struct tsm_report *report, void *data)
>  	if (desc->inblob_len != TDX_REPORTDATA_LEN)
>  		return -EINVAL;
>  
> -	memset(quote_data, 0, GET_QUOTE_BUF_SIZE);
> +	memset(quote_data, 0, quote_data_len);
>  
>  	/* Update Quote buffer header */
>  	quote_buf->version = GET_QUOTE_CMD_VER;
> @@ -296,7 +321,7 @@ static int tdx_report_new_locked(struct tsm_report *report, void *data)
>  	if (ret)
>  		return ret;
>  
> -	err = tdx_hcall_get_quote(quote_data, GET_QUOTE_BUF_SIZE);
> +	err = tdx_hcall_get_quote(quote_data, quote_data_len);
>  	if (err) {
>  		pr_err("GetQuote hypercall failed, status:%llx\n", err);
>  		return -EIO;
> @@ -315,7 +340,7 @@ static int tdx_report_new_locked(struct tsm_report *report, void *data)
>  
>  	out_len = READ_ONCE(quote_buf->out_len);
>  
> -	if (out_len > TDX_QUOTE_MAX_LEN)
> +	if (TDX_QUOTE_BUF_LEN(out_len) > quote_data_len)
>  		return -EFBIG;
>  
>  	buf = kvmemdup(quote_buf->data, out_len, GFP_KERNEL);
> @@ -417,7 +442,7 @@ static int __init tdx_guest_init(void)
>  	if (ret)
>  		goto deinit_mr;
>  
> -	quote_data = alloc_quote_buf();
> +	quote_data = alloc_quote_buf(&quote_data_len);
>  	if (!quote_data) {
>  		pr_err("Failed to allocate Quote buffer\n");
>  		ret = -ENOMEM;
> @@ -431,7 +456,7 @@ static int __init tdx_guest_init(void)
>  	return 0;
>  
>  free_quote:
> -	free_quote_buf(quote_data);
> +	free_quote_buf(quote_data, quote_data_len);
>  free_misc:
>  	misc_deregister(&tdx_misc_dev);
>  deinit_mr:
> @@ -444,7 +469,7 @@ module_init(tdx_guest_init);
>  static void __exit tdx_guest_exit(void)
>  {
>  	tsm_report_unregister(&tdx_tsm_ops);
> -	free_quote_buf(quote_data);
> +	free_quote_buf(quote_data, quote_data_len);
>  	misc_deregister(&tdx_misc_dev);
>  	tdx_mr_deinit(tdx_attr_groups[0]);
>  }

So, yeah, this patch isn't gigantic. But it's also fundamentally not
doing a _nice_ refactoring the way we expect them to be done.

1. Refactor old code to make it nice for adding features
2. Add the feature

If this was doing it the nice way, we would *actually* have something
that's really close to s/GET_QUOTE_BUF_SIZE/quote_data_len/. But,
instead, this chose to cram the mechanical changes and the new feature
together.

Can we do it the right way, please? If for nothing else, for practice.

^ permalink raw reply

* Re: [PATCH 3/6] configfs: Treat attribute structures as const internally
From: Breno Leitao @ 2026-07-20 10:28 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Andreas Hindborg, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Daniel Almeida, Tamir Duberstein,
	Alexandre Courbot, Onur Özkan, Matthew Brost,
	Thomas Hellström, Rodrigo Vivi, David Airlie, Simona Vetter,
	Dan Williams, Rafael J. Wysocki, Len Brown, rust-for-linux,
	linux-kernel, intel-xe, dri-devel, linux-coco, linux-acpi
In-Reply-To: <20260716-configfs-const-base-v1-3-c545a4053cb5@weissschuh.net>

On Thu, Jul 16, 2026 at 07:09:28PM +0200, Thomas Weißschuh wrote:
> The configfs core never modifies the attribute structures defined in
> driver core.
> 
> Reflect this in the types used internally in the configfs core.
> 
> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>

Reviewed-by: Breno Leitao <leitao@debian.org>

^ permalink raw reply

* Re: [PATCH 5/6] configfs: Allow the registration of const struct configfs_attribute
From: Breno Leitao @ 2026-07-20 10:05 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Andreas Hindborg, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Daniel Almeida, Tamir Duberstein,
	Alexandre Courbot, Onur Özkan, Matthew Brost,
	Thomas Hellström, Rodrigo Vivi, David Airlie, Simona Vetter,
	Dan Williams, Rafael J. Wysocki, Len Brown, rust-for-linux,
	linux-kernel, intel-xe, dri-devel, linux-coco, linux-acpi
In-Reply-To: <91f5e9dc-1ba0-4e20-9f1f-5100aaaf0a88@t-8ch.de>

On Fri, Jul 17, 2026 at 05:35:27PM +0200, Thomas Weißschuh wrote:
> On 2026-07-17 01:43:36-0700, Breno Leitao wrote:
> > On Thu, Jul 16, 2026 at 07:09:30PM +0200, Thomas Weißschuh wrote:
> > > The attribute structure defined in driver code never need to be
> > > modified. Allow them to be marked as const.
> > > 
> > > As there are many drivers which use these attributes, prepare for a
> > > phased transition by using a union of const and non-const attributes.
> > 
> > How many drivers need to be migrated? Isn't this a mechanism move?
> 
> The actual constification of the attribute will happen with a central
> change to the CONFIGFS_ATTR* macros. But all users of the macro
> will need to be prepared to handle the constness.
> 
> I have a branch with the full conversion here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/thomas.weissschuh/linux.git/log/?h=b4/configfs-const
> 
>  73 files changed, 241 insertions(+), 248 deletions(-)
> 
>  Most driver changes are really trivial. But a few do more interesting
>  things. My goal is to heavily reduce the amount of patches in this
>  branch by merging patches to the same subsystem.
> 
>  The last four patches will be the finalization going again through
>  the configfs tree.

Oh, that makes total sense now.

Thanks,
--breno

^ permalink raw reply

* Re: [PATCH linux-6.12.y v1 0/2] Backporting SEV-SNP CVE-2023-20585 to linux-stable
From: Sasha Levin @ 2026-07-19 15:00 UTC (permalink / raw)
  To: stable
  Cc: Sasha Levin, vasant.hegde, joerg.roedel, iommu,
	dheerajkumar.srivastava, bp, Michael.Roth, linux-coco,
	liam.merwick
In-Reply-To: <20260717104909.3850331-1-liam.merwick@oracle.com>

> The upstream commits apply cleanly to linux-7.0.y and linux-6.18.y
> and an AUTOSEL email[2] was sent (but not pulled so far).

Queued the series for 6.12, and picked up the upstream commits for 6.18
as well, thanks.

-- 
Thanks,
Sasha

^ permalink raw reply

* [PATCH v7 11/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Rick Edgecombe @ 2026-07-18  1:45 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The Dynamic PAMT get/put helpers use a global spinlock to serialize all
refcount updates and SEAMCALL invocations. This gives correct behavior for
concurrent callers, but leads to contention. It is especially bad from the
KVM side, which is designed to allow faulting in EPT under a shared lock.
With the global spinlock, not only is the lock an exclusive one, but it is
for all TDs instead of just a single one.

But taking the global lock each time is actually unnecessary. Only the 0->1
and 1->0 refcount transitions actually need the lock (to pair with
SEAMCALLs that actually add and remove with the Dynamic PAMT pages). The
common case of incrementing or decrementing a non-zero refcount can be
done locklessly.

So create a fast and slow path. Check the refcount outside the lock and
only take it for the slow path (0->1 and 1->0 transitions).

On the put side make the refcount adjustment and lock taking atomic so if
a 'get' happens between them, it doesn't cause the Dynamic PAMT to be
freed incorrectly. On the get side there is no technique for doing the
refcount adjustment and lock atomically, so check the refcount again
inside the lock.

AI was used under supervision to collect/apply feedback, review code and
workshop logs. It assisted in identifying/evaluating the stale
conditionals for the races resolved from the atomic_dec_and_lock() change.
Separate from atomic_dec_and_lock() fallout, it suggested to change
atomic_inc() to atomic_set(pamt_refcount, 1) in the put error path for the
sake of being more precise, which Kiryl had also suggested in the past.
The model also suggested updated comments following the
atomic_dec_and_lock() change based on some directed prompting. The
comments were subsequently edited or further prompted for fine tuning.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
v7:
 - Drop Assisted-by tag and cover AI use in log (Dave)
 - Move to end of the series
 - Use atomic_inc_not_zero() in this patch inside the spin_lock(), as
   suggested on non-optimized patch by. (Dave)

v6:
 - Fix "tdx_pamt_add()" typo to "tdx_pamt_get()" in lost-race comment
 - Fix error path bug: set ret = -EIO and use WARN_ON_ONCE() instead of
   pr_err() for unexpected PAMT.ADD failures (Sean)
 - Use "set the refcount 0->1" wording to match atomic_set() usage
 - Wrap comments to 80 columns
 - Switch to atomic_dec_and_lock() and remove handling of races that are
   no longer needed as a result. Adjust comments as appropriate. (Dave)
 - Adjustments from dropping error helper patches
---
 arch/x86/virt/vmx/tdx/tdx.c | 44 ++++++++++++++++++++++++-------------
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f397ea17248e3..cb0f0ef0dd5f1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -2166,28 +2166,41 @@ int tdx_pamt_get(kvm_pfn_t pfn, struct tdx_pamt_cache *cache)
 	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
 		return 0;

-	ret = alloc_pamt_array(pamt_pages, cache);
-	if (ret)
-		return ret;
-
 	pamt_refcount = tdx_find_pamt_refcount(pfn);

-	spin_lock(&pamt_lock);
-
 	/*
 	 * If the pamt page is already added (i.e. refcount >= 1),
 	 * then just increment the refcount.
 	 */
+	if (atomic_inc_not_zero(pamt_refcount))
+		return 0;
+
+	ret = alloc_pamt_array(pamt_pages, cache);
+	if (ret)
+		return ret;
+
+	spin_lock(&pamt_lock);
+
+	/*
+	 * Unlike tdx_pamt_put() which uses atomic_dec_and_lock() to
+	 * atomically handle the 1->0 transition, the get side has no
+	 * equivalent combined primitive for 0->1. Recheck under the
+	 * lock since another get may have already done the 0->1
+	 * transition after both saw atomic_inc_not_zero() fail.
+	 */
 	if (atomic_inc_not_zero(pamt_refcount))
 		goto out_free;

-	/* Try to add the pamt page and take the refcount 0->1. */
 	tdx_status = tdh_phymem_pamt_add(pfn, pamt_pages);
 	if (WARN_ON_ONCE(tdx_status != TDX_SUCCESS)) {
 		ret = -EIO;
 		goto out_free;
 	}

+	/*
+	 * The refcount is zero, and this locked path is the
+	 * only way to increase it from 0->1.
+	 */
 	atomic_set(pamt_refcount, 1);
 	spin_unlock(&pamt_lock);
 	return 0;
@@ -2212,17 +2225,13 @@ void tdx_pamt_put(kvm_pfn_t pfn)

 	pamt_refcount = tdx_find_pamt_refcount(pfn);

-	spin_lock(&pamt_lock);
 	/*
 	 * If there is more than 1 reference on the pamt page, don't
 	 * remove it yet. Just decrement the refcount.
 	 */
-	if (atomic_read(pamt_refcount) > 1) {
-		atomic_dec(pamt_refcount);
-		goto out_unlock;
-	}
+	if (!atomic_dec_and_lock(pamt_refcount, &pamt_lock))
+		return;

-	/* Try to remove the pamt page and take the refcount 1->0. */
 	tdx_status = tdh_phymem_pamt_remove(pfn, pamt_pages);

 	/*
@@ -2232,10 +2241,15 @@ void tdx_pamt_put(kvm_pfn_t pfn)
 	 * failure indicates a kernel bug, memory is being leaked, and
 	 * the dangling PAMT entry may cause future operations to fail.
 	 */
-	if (WARN_ON_ONCE(tdx_status != TDX_SUCCESS))
+	if (WARN_ON_ONCE(tdx_status != TDX_SUCCESS)) {
+		/*
+		 * atomic_dec_and_lock() already decremented it to 0,
+		 * but the PAMT entry still exists since REMOVE failed.
+		 */
+		atomic_set(pamt_refcount, 1);
 		goto out_unlock;
+	}

-	atomic_set(pamt_refcount, 0);
 	spin_unlock(&pamt_lock);
 	free_pamt_array(pamt_pages);
 	return;
-- 
2.54.0

^ permalink raw reply related

* [PATCH v7 09/11] x86/virt/tdx: Enable Dynamic PAMT
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The Physical Address Metadata Table (PAMT) holds TDX metadata for
physical memory and must be allocated by the kernel during TDX module
initialization. Dynamic PAMT is a TDX module feature that can reduce this
memory use by allocating part of the PAMT dynamically.

The TDX module exposes whether Dynamic PAMT is supported via a bit in the
'features0' metadata. Unfortunately, the TDX module exposes the feature as
supported even when it does not support using it with the number of keyids
currently configured in the BIOS. Since no TDX modules exist today with
that issue fixed, make the feature default off to prevent users from
upgrading their kernel and encountering TDX erroring out when trying to
enable Dynamic PAMT.

For the decision of whether to make it a boot time option and/or compile
time option, consider that Dynamic PAMT's memory savings are significant
enough to make it a good default configuration. That is most TDX users
should want it unless they have strange keyid configurations.

The feature increases the kernel size by 2KB (when TDX is configured in the
build).

All pieces are in place to enable Dynamic PAMT if it is supported and the
user passes a kernel parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
v7:
 - Add kernel parameter following some twists and turns, derriving
   originally from a comment by (Chao)

v6:
 - After Nikolai pointed out that the TDX docs actually have the Dynamic
   PAMT pages-per-2MB region fixed at 2 instead of variable sized, I
   checked over the docs more closely looking for anything else that might
   have been missed. Spotted this 48 bit physical address bit check in the
   docs, so added it.
---
 .../admin-guide/kernel-parameters.txt         |  7 +++++++
 arch/x86/include/asm/tdx.h                    |  1 +
 arch/x86/virt/vmx/tdx/tdx.c                   | 21 +++++++++++++++++--
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b5493a7f8f228..be4928489b02f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1321,6 +1321,13 @@ Kernel parameters
 			The filter can be disabled or changed to another
 			driver later using sysfs.
 
+	tdx_dpamt=
+			[X86] Controls whether TDX will use Dynamic PAMT
+			to save memory, when supported.
+
+			Valid parameters: "on", "off"
+			Default: "off"
+
 	reg_file_data_sampling=
 			[X86] Controls mitigation for Register File Data
 			Sampling (RFDS) vulnerability. RFDS is a CPU
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9cbd250bbd39b..7910901a7ba21 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -36,6 +36,7 @@
 /* Bit definitions of TDX_FEATURES0 metadata field */
 #define TDX_FEATURES0_TD_PRESERVING	BIT_ULL(1)
 #define TDX_FEATURES0_NO_RBP_MOD	BIT_ULL(18)
+#define TDX_FEATURES0_DYNAMIC_PAMT	BIT_ULL(36)
 
 #ifndef __ASSEMBLER__
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 61f48fff69df6..f397ea17248e3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -47,6 +47,8 @@
 #include "seamcall_internal.h"
 #include "tdx.h"
 
+bool tdx_enable_dpamt __ro_after_init;
+
 struct tdx_module_state {
 	bool initialized;
 	bool sysinit_done;
@@ -1028,6 +1030,8 @@ static __init int construct_tdmrs(struct list_head *tmb_list,
 	return ret;
 }
 
+#define TDX_SYS_CONFIG_DYNAMIC_PAMT	BIT(16)
+
 static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
 				    u64 global_keyid)
 {
@@ -1056,6 +1060,12 @@ static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
 	args.rcx = __pa(tdmr_pa_array);
 	args.rdx = tdmr_list->nr_consumed_tdmrs;
 	args.r8 = global_keyid;
+
+	if (tdx_supports_dynamic_pamt(&tdx_sysinfo)) {
+		pr_info("Enable Dynamic PAMT\n");
+		args.r8 |= TDX_SYS_CONFIG_DYNAMIC_PAMT;
+	}
+
 	ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
 
 	/* Free the array as it is not required anymore. */
@@ -2041,8 +2051,8 @@ EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_hkid);
 
 bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
 {
-	/* To be enabled when kernel is ready. */
-	return false;
+	return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT &&
+	       tdx_enable_dpamt;
 }
 EXPORT_SYMBOL_FOR_KVM(tdx_supports_dynamic_pamt);
 
@@ -2300,6 +2310,13 @@ void tdx_free_control_page(struct page *page)
 }
 EXPORT_SYMBOL_FOR_KVM(tdx_free_control_page);
 
+static int __init tdx_dpamt_setup(char *str)
+{
+	return kstrtobool(str, &tdx_enable_dpamt) == 0;
+}
+
+__setup("tdx_dpamt=", tdx_dpamt_setup);
+
 void tdx_sys_disable(void)
 {
 	struct tdx_module_args args = {};
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 10/11] Documentation/x86: Add documentation for TDX's Dynamic PAMT
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe, Binbin Wu
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Expand TDX documentation to include information on the Dynamic PAMT
feature.

The new section explains PAMT support in the TDX module and how Dynamic
PAMT affects the kernel memory use.

AI was used under supervision to review the docs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
v7:
 - Spell out PAMT acronym (Binbin)
 - Drop Assisted-by tag and cover AI use in log (Dave)
 - Add info about kernel parameter

v6:
 - Add missing word (Binbin)
 - Use "::" instead of ":"
 - Make format of dmesg example accurate
---
 .../admin-guide/kernel-parameters.txt         |  3 ++
 Documentation/arch/x86/tdx.rst                | 28 +++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index be4928489b02f..7c8627ce07eb4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1328,6 +1328,9 @@ Kernel parameters
 			Valid parameters: "on", "off"
 			Default: "off"
 
+			For details see:
+			Documentation/arch/x86/tdx.rst
+
 	reg_file_data_sampling=
 			[X86] Controls mitigation for Register File Data
 			Sampling (RFDS) vulnerability. RFDS is a CPU
diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index 3303499ad4c6f..a1c2309230580 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -200,6 +200,34 @@ reflects the TCB of the currently running TDX module and therefore
 changes after an update. By contrast, TEE_TCB_SVN reflects the TCB at TD
 launch time and is not affected.
 
+Dynamic PAMT
+------------
+
+Physical Address Metadata Table (PAMT) is memory that the TDX module needs
+to keep data about each page (think like struct page). It needs to be handed
+to the TDX module for its exclusive use. For normal PAMT, this is installed
+when the TDX module is first loaded and comes to about 0.4% of system memory.
+
+Dynamic PAMT is a TDX module feature that allows VMM to allocate part of the
+PAMT as needed (the parts for tracking 4KB size pages). The other page sizes
+(1GB and 2MB) are still allocated statically at the time of TDX module
+initialization. This reduces the amount of memory that TDX uses while TDs are
+not in use.
+
+When Dynamic PAMT is in use, dmesg shows it like::
+
+  [..] virt/tdx: Enable Dynamic PAMT
+  [..] virt/tdx: 10092 KB allocated for PAMT
+  [..] virt/tdx: TDX-Module initialized
+
+Dynamic PAMT is only enabled when supported and the ``tdx_dpamt=`` kernel
+parameter is set to "on". The feature is off by default because TDX module
+internal details prevent Dynamic PAMT from working on all keyid partitioning
+configurations. When the TDX module is fixed to include these constraints in
+its enumeration of Dynamic PAMT support, kernel support can be changed to
+default on. For more information, consult the Intel TDX documentation about
+Dynamic PAMT.
+
 TDX Interaction to Other Kernel Components
 ------------------------------------------
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 08/11] KVM: TDX: Get/put PAMT pages when (un)mapping private memory
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe, Binbin Wu
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Add Dynamic PAMT support to KVM's S-EPT MMU by "getting" a PAMT page when
adding guest memory (PAGE.ADD or PAGE.AUG), and "putting" the page when
removing guest memory (PAGE.REMOVE).

To access the per-vCPU PAMT caches without plumbing @vcpu throughout the
TDP MMU, begrudgingly use kvm_get_running_vcpu() to get the vCPU, and bug
the VM if KVM attempts to set an S-EPT leaf without an active vCPU.  KVM
only supports creating _new_ mappings in page (pre)fault paths, all of
which require an active vCPU.

The PAMT memory holds metadata for TDX protected memory. With Dynamic
PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
with a few pages that cover 2MB of host physical memory.

Releases are balanced via tdx_pamt_put(): every control-page free goes
through tdx_free_control_page(), and guest data pages are put directly on
the successful tdh_mem_page_remove() path and in the
tdx_mem_page_add/aug() error path.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
[rick: enhance log, reviewing, rebase, with help from AI tooling]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
v7:
 - Don't do to_tdx() before NULL check for readability (Binbin, Sean)
 - Fixup tags (Sean)
 - Export tdx_supports_dynamic_pamt() since it's used in KVM here and no
   longer an static inline.
v6:
 - Don't have topup op take a min param (Yan, Sean)
 - Make log match style of the rest of the series
 - Adjustments from dropping error helper patches
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  2 +
 arch/x86/kvm/mmu/mmu.c             |  4 ++
 arch/x86/kvm/vmx/tdx.c             | 63 ++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/tdx.h             |  2 +
 arch/x86/virt/vmx/tdx/tdx.c        |  1 +
 6 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 83dc5086138b3..588563dfe88d5 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -98,6 +98,7 @@ KVM_X86_OP_OPTIONAL_RET0(tdp_has_smep)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
 KVM_X86_OP_OPTIONAL(free_external_spt)
+KVM_X86_OP_OPTIONAL_RET0(topup_external_cache)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5f6c1ce9673b7..1c706e2d773b0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1922,6 +1922,8 @@ struct kvm_x86_ops {
 	/* Update external page tables for page table about to be freed. */
 	void (*free_external_spt)(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+	int (*topup_external_cache)(struct kvm_vcpu *vcpu, int min_nr_spts);
+
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 234d0a95abf53..6dab99654f170 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -614,6 +614,10 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
+
+		r = kvm_x86_call(topup_external_cache)(vcpu, PT64_ROOT_MAX_LEVEL);
+		if (r)
+			return r;
 	}
 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
 				       PT64_ROOT_MAX_LEVEL);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f6c01eab8113b..c3b1d1f056cea 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -681,6 +681,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	if (!irqchip_split(vcpu->kvm))
 		return -EINVAL;
 
+	tdx_init_pamt_cache(&tdx->pamt_cache);
+
 	fpstate_set_confidential(&vcpu->arch.guest_fpu);
 	vcpu->arch.apic->guest_apic_protected = true;
 	INIT_LIST_HEAD(&tdx->vt.pi_wakeup_list);
@@ -866,6 +868,8 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	int i;
 
+	tdx_free_pamt_cache(&tdx->pamt_cache);
+
 	if (vcpu->cpu != -1) {
 		KVM_BUG_ON(tdx->state == VCPU_TD_STATE_INITIALIZED, vcpu->kvm);
 		tdx_flush_vp_on_cpu(vcpu);
@@ -1621,6 +1625,16 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
+static int tdx_topup_external_pamt_cache(struct kvm_vcpu *vcpu, int min_nr_spts)
+{
+	/*
+	 * Don't cover the root SPT, but cover a possible 4KB private
+	 * page in addition to the SPTs. So -1 to exclude the root
+	 * SPT, and +1 for the guest page cancel out.
+	 */
+	return tdx_topup_pamt_cache(&to_tdx(vcpu)->pamt_cache, min_nr_spts);
+}
+
 static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 			    kvm_pfn_t pfn)
 {
@@ -1679,16 +1693,28 @@ static struct page *tdx_spte_to_sept_pt(struct kvm *kvm, gfn_t gfn,
 static int tdx_sept_map_nonleaf_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, u64 new_spte)
 {
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
 	struct page *sept_pt;
+	int ret;
+
+	if (KVM_BUG_ON(!vcpu, kvm))
+		return -EIO;
 
 	sept_pt = tdx_spte_to_sept_pt(kvm, gfn, new_spte, level);
 	if (!sept_pt)
 		return -EIO;
 
+	ret = tdx_pamt_get(page_to_pfn(sept_pt), &to_tdx(vcpu)->pamt_cache);
+	if (ret)
+		return ret;
+
 	err = tdh_mem_sept_add(&to_kvm_tdx(kvm)->td, gpa, level, sept_pt,
 			       &entry, &level_state);
+	if (err)
+		tdx_pamt_put(page_to_pfn(sept_pt));
+
 	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
 
@@ -1701,8 +1727,13 @@ static int tdx_sept_map_nonleaf_spte(struct kvm *kvm, gfn_t gfn,
 static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				  u64 new_spte)
 {
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	kvm_pfn_t pfn = spte_to_pfn(new_spte);
+	int ret;
+
+	if (KVM_BUG_ON(!vcpu, kvm))
+		return -EIO;
 
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
@@ -1710,6 +1741,10 @@ static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level leve
 
 	WARN_ON_ONCE((new_spte & VMX_EPT_RWX_MASK) != VMX_EPT_RWX_MASK);
 
+	ret = tdx_pamt_get(pfn, &to_tdx(vcpu)->pamt_cache);
+	if (ret)
+		return ret;
+
 	/*
 	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
 	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
@@ -1722,10 +1757,15 @@ static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level leve
 	 * If the TD isn't finalized/runnable, then userspace is initializing
 	 * the VM image via KVM_TDX_INIT_MEM_REGION; ADD the page to the TD.
 	 */
-	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
-		return tdx_mem_page_add(kvm, gfn, level, pfn);
+	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
+		ret = tdx_mem_page_aug(kvm, gfn, level, pfn);
+	else
+		ret = tdx_mem_page_add(kvm, gfn, level, pfn);
 
-	return tdx_mem_page_aug(kvm, gfn, level, pfn);
+	if (ret)
+		tdx_pamt_put(pfn);
+
+	return ret;
 }
 
 /*
@@ -1822,6 +1862,7 @@ static int tdx_sept_remove_leaf_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 
 	tdx_quirk_reset_paddr(PFN_PHYS(pfn), PAGE_SIZE);
+	tdx_pamt_put(pfn);
 	return 0;
 }
 
@@ -1865,6 +1906,8 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
  */
 static void tdx_sept_free_private_spt(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
+	struct page *sept_pt = virt_to_page(sp->external_spt);
+
 	/*
 	 * KVM doesn't (yet) zap page table pages in mirror page table while
 	 * TD is active, though guest pages mapped in mirror page table could be
@@ -1878,15 +1921,15 @@ static void tdx_sept_free_private_spt(struct kvm *kvm, struct kvm_mmu_page *sp)
 	 * the page to prevent the kernel from accessing the encrypted page.
 	 */
 	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
-	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
+	    tdx_reclaim_page(sept_pt))
 		goto out;
 
 	/*
-	 * Immediately free the S-EPT page because RCU-time free is unnecessary
-	 * after TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding
-	 * readers.
+	 * Immediately free the S-EPT page as the TDX subsystem doesn't support
+	 * freeing pages from RCU callbacks, and more importantly because
+	 * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding readers.
 	 */
-	free_page((unsigned long)sp->external_spt);
+	tdx_free_control_page(sept_pt);
 out:
 	sp->external_spt = NULL;
 }
@@ -3482,6 +3525,10 @@ int __init tdx_hardware_setup(void)
 
 	vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
 	vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
+
+	if (tdx_supports_dynamic_pamt(tdx_sysinfo))
+		vt_x86_ops.topup_external_cache = tdx_topup_external_pamt_cache;
+
 	vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
 	return 0;
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ac8323a68b163..fd368e3ee060b 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -72,6 +72,8 @@ struct vcpu_tdx {
 
 	u64 map_gpa_next;
 	u64 map_gpa_end;
+
+	struct tdx_pamt_cache pamt_cache;
 };
 
 void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e2f11b0ba46ce..61f48fff69df6 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -2044,6 +2044,7 @@ bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
 	/* To be enabled when kernel is ready. */
 	return false;
 }
+EXPORT_SYMBOL_FOR_KVM(tdx_supports_dynamic_pamt);
 
 static struct page *tdx_alloc_page_pamt_cache(struct tdx_pamt_cache *cache)
 {
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 06/11] KVM: TDX: Allocate PAMT memory for TD and vCPU control structures
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe, Binbin Wu
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Use control page helpers for allocating and freeing TD control structures,
such that these operations can work for Dynamic PAMT.

The TDX module tracks some state for each page of physical memory that it
might use. It calls this state the PAMT. It includes separate state for
each page size a physical page could be utilized at within the TDX module
(1GB, 2MB, 4KB). In Dynamic PAMT, only the 4KB page size state is
allocated dynamically. So the kernel must ensure PAMT backing is installed
for any 4KB page being gifted to the TDX module, and must tear down the
backing when all associated gifted pages are reclaimed.

TD scoped control pages (TDR, TDCS) and vCPU scoped control pages (TDVPR,
TDCX) are all handed to the TDX module at 4KB page size and are therefore
subject to this requirement. Replace the raw alloc_page()/__free_page()
calls for these pages with tdx_alloc/free_control_page().

Switching between special Dynamic PAMT operations or normal page
alloc/free operations is handled internally in
tdx_alloc/free_control_page(). So don't check for Dynamic PAMT around these
calls. Just call them unconditionally. Similarly, drop the NULL checks
before freeing, as tdx_free_control_page() handles NULL internally.

No functional change intended when Dynamic PAMT is not in use.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[sean: handle alloc+free+reclaim in one patch]
Signed-off-by: Sean Christopherson <seanjc@google.com>
[rick: enhance log, reviewing, rebase, with help from AI tooling]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
v7:
 - Fixup tags (Sean)
 - Missing word in log (Binbin)
 - Log smoothness (Yan)
---
 arch/x86/kvm/vmx/tdx.c | 35 ++++++++++++++---------------------
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 989ab29b8c6fb..f6c01eab8113b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -362,7 +362,7 @@ static void tdx_reclaim_control_page(struct page *ctrl_page)
 	if (tdx_reclaim_page(ctrl_page))
 		return;
 
-	__free_page(ctrl_page);
+	tdx_free_control_page(ctrl_page);
 }
 
 struct tdx_flush_vp_arg {
@@ -589,7 +589,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 
 	tdx_quirk_reset_paddr(page_to_phys(kvm_tdx->td.tdr_page), PAGE_SIZE);
 
-	__free_page(kvm_tdx->td.tdr_page);
+	tdx_free_control_page(kvm_tdx->td.tdr_page);
 	kvm_tdx->td.tdr_page = NULL;
 }
 
@@ -2459,7 +2459,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 
 	ret = -ENOMEM;
 
-	tdr_page = alloc_page(GFP_KERNEL_ACCOUNT);
+	tdr_page = tdx_alloc_control_page();
 	if (!tdr_page)
 		goto free_hkid;
 
@@ -2472,7 +2472,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 		goto free_tdr;
 
 	for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++) {
-		tdcs_pages[i] = alloc_page(GFP_KERNEL_ACCOUNT);
+		tdcs_pages[i] = tdx_alloc_control_page();
 		if (!tdcs_pages[i])
 			goto free_tdcs;
 	}
@@ -2590,10 +2590,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 teardown:
 	/* Only free pages not yet added, so start at 'i' */
 	for (; i < kvm_tdx->td.tdcs_nr_pages; i++) {
-		if (tdcs_pages[i]) {
-			__free_page(tdcs_pages[i]);
-			tdcs_pages[i] = NULL;
-		}
+		tdx_free_control_page(tdcs_pages[i]);
+		tdcs_pages[i] = NULL;
 	}
 	if (!kvm_tdx->td.tdcs_pages)
 		kfree(tdcs_pages);
@@ -2608,16 +2606,13 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 	free_cpumask_var(packages);
 
 free_tdcs:
-	for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++) {
-		if (tdcs_pages[i])
-			__free_page(tdcs_pages[i]);
-	}
+	for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++)
+		tdx_free_control_page(tdcs_pages[i]);
 	kfree(tdcs_pages);
 	kvm_tdx->td.tdcs_pages = NULL;
 
 free_tdr:
-	if (tdr_page)
-		__free_page(tdr_page);
+	tdx_free_control_page(tdr_page);
 	kvm_tdx->td.tdr_page = NULL;
 
 free_hkid:
@@ -2947,7 +2942,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 	int ret, i;
 	u64 err;
 
-	page = alloc_page(GFP_KERNEL_ACCOUNT);
+	page = tdx_alloc_control_page();
 	if (!page)
 		return -ENOMEM;
 	tdx->vp.tdvpr_page = page;
@@ -2967,7 +2962,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 	}
 
 	for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) {
-		page = alloc_page(GFP_KERNEL_ACCOUNT);
+		page = tdx_alloc_control_page();
 		if (!page) {
 			ret = -ENOMEM;
 			goto free_tdcx;
@@ -2989,7 +2984,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 			 * method, but the rest are freed here.
 			 */
 			for (; i < kvm_tdx->td.tdcx_nr_pages; i++) {
-				__free_page(tdx->vp.tdcx_pages[i]);
+				tdx_free_control_page(tdx->vp.tdcx_pages[i]);
 				tdx->vp.tdcx_pages[i] = NULL;
 			}
 			return -EIO;
@@ -3017,16 +3012,14 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 
 free_tdcx:
 	for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) {
-		if (tdx->vp.tdcx_pages[i])
-			__free_page(tdx->vp.tdcx_pages[i]);
+		tdx_free_control_page(tdx->vp.tdcx_pages[i]);
 		tdx->vp.tdcx_pages[i] = NULL;
 	}
 	kfree(tdx->vp.tdcx_pages);
 	tdx->vp.tdcx_pages = NULL;
 
 free_tdvpr:
-	if (tdx->vp.tdvpr_page)
-		__free_page(tdx->vp.tdvpr_page);
+	tdx_free_control_page(tdx->vp.tdvpr_page);
 	tdx->vp.tdvpr_page = NULL;
 	tdx->vp.tdvpr_pa = 0;
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 05/11] x86/virt/tdx: Handle multiple callers in tdx_pamt_get/put()
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe, Binbin Wu
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

tdx_pamt_get()/tdx_pamt_put() unconditionally add or remove Dynamic PAMT
backing for the 2MB region covering the passed page. However, multiple
callers can add or remove 4KB pages that fall within the same 2MB region
and in that scenario only a single PAMT entry is required.

Make the helpers handle only adding/removing Dynamic PAMT backing when
required, by refcounting each 2MB range. Gate the actual Dynamic PAMT add
and remove on refcount transitions (0->1 and 1->0). Serialize the refcount
check and SEAMCALL with a global spinlock so the read-decide-act sequence
is atomic. This also avoids TDX module BUSY errors, as the Dynamic PAMT add
and remove SEAMCALLs take internal TDX module locks for the 2MB ranges of
the specified PFN and the PAMT page pair PFNs. So simultaneous attempts on
the same 2MB ranges of the PFNs would otherwise encounter an error, which
would not be handleable in the put case.

The lock is global and heavyweight. Use simple conditional logic to keep
correctness obvious. This will be optimized in a later change.

The pamt_refcounts[] are atomic_t's. They do not strictly need to be
because all access is protected by pamt_lock. The overhead of an atomic_t
in this situation is minuscule compared to the global lock. Leave the
atomic_t in place to enable future optimization with minimal churn.

AI was used under supervision to collect/apply feedback, split patches,
review code and workshop logs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
v7:
 - Convert scoped_guard() blocks to use normal spin_un/lock() for the
   sake of making next patches diff cleaner
 - Drop __maybe_unused from tdx_find_pamt_refcount() (Binbin)
 - Switch to atomic_inc_not_zero() (Dave)
 - Justify use of atomic_t in log (Sohil)
 - Log/comments (Yan)
 - Drop Assisted-by tag and cover AI use in log (Dave)

v6:
 - Split from "x86/virt/tdx: Add tdx_alloc/free_control_page() helpers"
 - Return 0 instead of ret to be clearer (Binbin)
 - Clarify log (Nikolay)
 - Justify why the patch is not optimized in response to comments by
   (Nikolay)
 - Move tdx_find_pamt_refcount() to faciliate patch re-order
 - Adjustments from dropping error helper patches
 - Log tweaks
---
 arch/x86/virt/vmx/tdx/tdx.c | 48 +++++++++++++++++++++++++++++++++----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b2ddd3c192645..343d0cccc9874 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -289,7 +289,7 @@ static __init void free_pamt_refcounts(void)
 	pamt_refcounts = NULL;
 }
 
-static atomic_t * __maybe_unused tdx_find_pamt_refcount(unsigned long pfn)
+static atomic_t *tdx_find_pamt_refcount(unsigned long pfn)
 {
 	/* Find which PMD a PFN is in. */
 	unsigned long index = pfn >> (PMD_SHIFT - PAGE_SHIFT);
@@ -2120,10 +2120,14 @@ static u64 tdh_phymem_pamt_remove(kvm_pfn_t pfn, struct page **pamt_pages)
 	return 0;
 }
 
-/* Allocate PAMT memory for the given page */
+/* Serializes adding/removing PAMT memory */
+static DEFINE_SPINLOCK(pamt_lock);
+
+/* Bump PAMT refcount for the given pfn and allocate PAMT backing if needed. */
 static int tdx_pamt_get(kvm_pfn_t pfn)
 {
 	struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT];
+	atomic_t *pamt_refcount;
 	u64 tdx_status;
 	int ret;
 
@@ -2134,29 +2138,58 @@ static int tdx_pamt_get(kvm_pfn_t pfn)
 	if (ret)
 		return ret;
 
+	pamt_refcount = tdx_find_pamt_refcount(pfn);
+
+	spin_lock(&pamt_lock);
+
+	/*
+	 * If the pamt page is already added (i.e. refcount >= 1),
+	 * then just increment the refcount.
+	 */
+	if (atomic_inc_not_zero(pamt_refcount))
+		goto out_free;
+
+	/* Try to add the pamt page and take the refcount 0->1. */
 	tdx_status = tdh_phymem_pamt_add(pfn, pamt_pages);
-	if (tdx_status != TDX_SUCCESS) {
+	if (WARN_ON_ONCE(tdx_status != TDX_SUCCESS)) {
 		ret = -EIO;
 		goto out_free;
 	}
 
+	atomic_set(pamt_refcount, 1);
+	spin_unlock(&pamt_lock);
 	return 0;
 
 out_free:
+	spin_unlock(&pamt_lock);
 	free_pamt_array(pamt_pages);
 
 	return ret;
 }
 
-/* Free PAMT memory for the given page */
+/* Drop PAMT refcount for the given pfn and free PAMT backing if needed. */
 static void tdx_pamt_put(kvm_pfn_t pfn)
 {
 	struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT] = {};
+	atomic_t *pamt_refcount;
 	u64 tdx_status;
 
 	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
 		return;
 
+	pamt_refcount = tdx_find_pamt_refcount(pfn);
+
+	spin_lock(&pamt_lock);
+	/*
+	 * If there is more than 1 reference on the pamt page, don't
+	 * remove it yet. Just decrement the refcount.
+	 */
+	if (atomic_read(pamt_refcount) > 1) {
+		atomic_dec(pamt_refcount);
+		goto out_unlock;
+	}
+
+	/* Try to remove the pamt page and take the refcount 1->0. */
 	tdx_status = tdh_phymem_pamt_remove(pfn, pamt_pages);
 
 	/*
@@ -2167,9 +2200,14 @@ static void tdx_pamt_put(kvm_pfn_t pfn)
 	 * the dangling PAMT entry may cause future operations to fail.
 	 */
 	if (WARN_ON_ONCE(tdx_status != TDX_SUCCESS))
-		return;
+		goto out_unlock;
 
+	atomic_set(pamt_refcount, 0);
+	spin_unlock(&pamt_lock);
 	free_pamt_array(pamt_pages);
+	return;
+out_unlock:
+	spin_unlock(&pamt_lock);
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 04/11] x86/virt/tdx: Allocate refcounts for Dynamic PAMT memory
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe, Binbin Wu
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The PAMT memory holds metadata for all possible TDX protected memory. Each
physical address range is covered by PAMT entries at three levels (1GB,
2MB, 4KB). With Dynamic PAMT, the 4KB level of PAMT is allocated on
demand. The kernel supplies the TDX module with page pairs to store the
4KB level entries, which cover 2MB of host physical memory. The kernel must
provide this page pair before using pages from the range for TDX. If this
is not done, SEAMCALLs that give the pages to be protected by the TDX
module will fail.

Allocate reference counters for every 2MB range to track TDX memory usage.
This can be used to handle concurrent get/put callers, in order to
accurately determine when the dynamic 4KB level of Dynamic PAMT needs to
be allocated and when it can be freed.

This allocation will currently consume 2MB for every 1TB of address
space from 0 to max_pfn. The allocation size will depend on how the RAM is
physically laid out. In a worst case scenario where the entire 52 bit
address space is covered this would be 8GB. Then the Dynamic PAMT refcount
allocations could hypothetically cause the savings from Dynamic PAMT to go
negative on exotic platforms with sparse, small amounts of memory.

Future changes could reduce this refcount overhead to be only allocating
refcounts for physical ranges that contain memory that TDX can use.
However, this is left for future work.

AI was used under supervision to collect/apply feedback, review code and
workshop logs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
v7:
 - Annote functions __init (Chao)
 - Log tweaks (Yan)
 - Stanardize on memory units in the text (Sohil)
 - Delete unneeded comment (Sohil)
 - Use vzalloc() (Sohil)
 - Drop Assisted-by tag and cover AI use in log (Dave)

v6:
 - Remove confusing reference to allocating PAMT memory in
   pamt_refcounts comment. (Yan)
 - Rename "metadata" function names that really deal with refcounts, as
   metadata already has a different meaning in TDX.
 - Move tdx_find_pamt_refcount() to this patch to aid in reviewability
---
 arch/x86/virt/vmx/tdx/tdx.c | 53 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index bfd9928c10249..b2ddd3c192645 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -30,6 +30,7 @@
 #include <linux/suspend.h>
 #include <linux/syscore_ops.h>
 #include <linux/idr.h>
+#include <linux/vmalloc.h>
 #include <asm/page.h>
 #include <asm/special_insns.h>
 #include <asm/msr-index.h>
@@ -63,6 +64,14 @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
 
 static struct tdmr_info_list tdx_tdmr_list;
 
+/*
+ * On a machine with Dynamic PAMT, the kernel maintains a reference counter
+ * for every 2MB range. The counter indicates how many users there are for
+ * the PAMT memory of the 2MB range. The kernel allocates PAMT refcounts at
+ * initialization.
+ */
+static atomic_t *pamt_refcounts;
+
 /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
 static LIST_HEAD(tdx_memlist);
 
@@ -252,6 +261,42 @@ static struct syscore tdx_syscore = {
 	.ops = &tdx_syscore_ops,
 };
 
+/*
+ * Allocate PAMT reference counters for all physical memory.
+ *
+ * It consumes 2MB for every 1TB of physical memory.
+ */
+static __init int init_pamt_refcounts(void)
+{
+	size_t size = DIV_ROUND_UP(max_pfn, PTRS_PER_PTE) * sizeof(*pamt_refcounts);
+
+	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+		return 0;
+
+	pamt_refcounts = vzalloc(size);
+	if (!pamt_refcounts)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static __init void free_pamt_refcounts(void)
+{
+	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+		return;
+
+	vfree(pamt_refcounts);
+	pamt_refcounts = NULL;
+}
+
+static atomic_t * __maybe_unused tdx_find_pamt_refcount(unsigned long pfn)
+{
+	/* Find which PMD a PFN is in. */
+	unsigned long index = pfn >> (PMD_SHIFT - PAGE_SHIFT);
+
+	return &pamt_refcounts[index];
+}
+
 /*
  * Add a memory region as a TDX memory block.  The caller must make sure
  * all memory regions are added in address ascending order and don't
@@ -1152,10 +1197,14 @@ static __init int init_tdx_module(void)
 	 */
 	get_online_mems();
 
-	ret = build_tdx_memlist(&tdx_memlist);
+	ret = init_pamt_refcounts();
 	if (ret)
 		goto out_put_tdxmem;
 
+	ret = build_tdx_memlist(&tdx_memlist);
+	if (ret)
+		goto err_free_pamt_refcounts;
+
 	/* Allocate enough space for constructing TDMRs */
 	ret = alloc_tdmr_list(&tdx_tdmr_list, &tdx_sysinfo.tdmr);
 	if (ret)
@@ -1205,6 +1254,8 @@ static __init int init_tdx_module(void)
 	free_tdmr_list(&tdx_tdmr_list);
 err_free_tdxmem:
 	free_tdx_memlist(&tdx_memlist);
+err_free_pamt_refcounts:
+	free_pamt_refcounts();
 	goto out_put_tdxmem;
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 07/11] x86/tdx: Add APIs to support Dynamic PAMT ops from KVM's fault path
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe, Binbin Wu
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

When handling an EPT violation, KVM holds a spinlock while manipulating
the EPT. Before entering the spinlock it doesn't know how many EPT page
tables will need to be installed or whether a huge page will be used. For
this reason it allocates a worst case number of page tables that it might
need as part of servicing the EPT violation.

Under Dynamic PAMT these pre-allocated pages will potentially need to have
Dynamic PAMT backing pages installed for them. KVM already has helpers to
manage topping up page caches before taking the MMU lock, but they cannot
be passed from KVM to arch/x86 code.

The problem of how and when to install the Dynamic PAMT backing pages for
the pages given to the TDX module during the fault path has had a lot of
design attempts.
 - Extracting KVM's MMU caches requires too much inlined code added to
   headers.
 - A few varieties of installing Dynamic PAMT backing when allocating the
   S-EPT page tables. (see links)
 - Using mempool_t to transfer the pages between KVM and arch/x86 doesn't
   work because the component is designed more around maintaining a pool
   of pages, rather than topping up a continually drained cache.

So don't do these as they all had various problems. Instead just create a
small simple data structure to use for handing a pre-allocated list of
pages between KVM and arch/x86 code. Model this on KVM's existing MMU
memory caches.

Add a tdx_pamt_cache arg to tdx_pamt_get() so it can draw pages from a
cache when needed. Not all Dynamic PAMT page installations will happen
under spinlock, for example TD and vCPU scoped control pages. So have
tdx_pamt_get() maintain the existing behavior of allocating from the page
allocator when NULL is passed for the struct tdx_pamt_cache arg. This
prevents excess allocations for cases where it can be avoided.

Export the new helpers for KVM.

AI was used under supervision to review code and workshop logs.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Link: https://lore.kernel.org/kvm/aXENNKjAKTM9UJNH@google.com/
Link: https://lore.kernel.org/kvm/20260129011517.3545883-20-seanjc@google.com/
Link: https://lore.kernel.org/kvm/aYW5CbUvZrLogsWF@yzhao56-desk.sh.intel.com/
---
v7:
 - Log/comment tweaks (Yan)
 - Drop Assisted-by tag and cover AI use in log (Dave)

v6:
 - Filled out log from Sean's series
---
 arch/x86/include/asm/tdx.h  | 17 ++++++++++
 arch/x86/virt/vmx/tdx/tdx.c | 65 +++++++++++++++++++++++++++++++++----
 2 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 86bf37b15c705..9cbd250bbd39b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -120,6 +120,23 @@ static inline bool tdx_supports_runtime_update(const struct tdx_sys_info *sysinf
 
 bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo);
 
+/* Simple structure for pre-allocating Dynamic PAMT pages outside of spinlocks. */
+struct tdx_pamt_cache {
+	struct list_head page_list;
+	int cnt;
+};
+
+static inline void tdx_init_pamt_cache(struct tdx_pamt_cache *cache)
+{
+	INIT_LIST_HEAD(&cache->page_list);
+	cache->cnt = 0;
+}
+
+void tdx_free_pamt_cache(struct tdx_pamt_cache *cache);
+int tdx_topup_pamt_cache(struct tdx_pamt_cache *cache, unsigned long npages);
+int tdx_pamt_get(kvm_pfn_t pfn, struct tdx_pamt_cache *cache);
+void tdx_pamt_put(kvm_pfn_t pfn);
+
 int tdx_guest_keyid_alloc(void);
 u32 tdx_get_nr_guest_keyids(void);
 void tdx_guest_keyid_free(unsigned int keyid);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 343d0cccc9874..e2f11b0ba46ce 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -2045,12 +2045,33 @@ bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
 	return false;
 }
 
-static int alloc_pamt_array(struct page **pamt_pages)
+static struct page *tdx_alloc_page_pamt_cache(struct tdx_pamt_cache *cache)
+{
+	struct page *page;
+
+	page = list_first_entry_or_null(&cache->page_list, struct page, lru);
+	if (page) {
+		list_del(&page->lru);
+		cache->cnt--;
+	}
+
+	return page;
+}
+
+static struct page *alloc_dpamt_page(struct tdx_pamt_cache *cache)
+{
+	if (cache)
+		return tdx_alloc_page_pamt_cache(cache);
+
+	return alloc_page(GFP_KERNEL_ACCOUNT);
+}
+
+static int alloc_pamt_array(struct page **pamt_pages, struct tdx_pamt_cache *cache)
 {
 	int i, j;
 
 	for (i = 0; i < TDX_DPAMT_ENTRY_PAGE_CNT; i++) {
-		pamt_pages[i] = alloc_page(GFP_KERNEL_ACCOUNT);
+		pamt_pages[i] = alloc_dpamt_page(cache);
 		if (!pamt_pages[i])
 			goto err;
 	}
@@ -2124,7 +2145,7 @@ static u64 tdh_phymem_pamt_remove(kvm_pfn_t pfn, struct page **pamt_pages)
 static DEFINE_SPINLOCK(pamt_lock);
 
 /* Bump PAMT refcount for the given pfn and allocate PAMT backing if needed. */
-static int tdx_pamt_get(kvm_pfn_t pfn)
+int tdx_pamt_get(kvm_pfn_t pfn, struct tdx_pamt_cache *cache)
 {
 	struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT];
 	atomic_t *pamt_refcount;
@@ -2134,7 +2155,7 @@ static int tdx_pamt_get(kvm_pfn_t pfn)
 	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
 		return 0;
 
-	ret = alloc_pamt_array(pamt_pages);
+	ret = alloc_pamt_array(pamt_pages, cache);
 	if (ret)
 		return ret;
 
@@ -2166,9 +2187,10 @@ static int tdx_pamt_get(kvm_pfn_t pfn)
 
 	return ret;
 }
+EXPORT_SYMBOL_FOR_KVM(tdx_pamt_get);
 
 /* Drop PAMT refcount for the given pfn and free PAMT backing if needed. */
-static void tdx_pamt_put(kvm_pfn_t pfn)
+void tdx_pamt_put(kvm_pfn_t pfn)
 {
 	struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT] = {};
 	atomic_t *pamt_refcount;
@@ -2209,6 +2231,37 @@ static void tdx_pamt_put(kvm_pfn_t pfn)
 out_unlock:
 	spin_unlock(&pamt_lock);
 }
+EXPORT_SYMBOL_FOR_KVM(tdx_pamt_put);
+
+void tdx_free_pamt_cache(struct tdx_pamt_cache *cache)
+{
+	struct page *page;
+
+	while ((page = tdx_alloc_page_pamt_cache(cache)))
+		__free_page(page);
+}
+EXPORT_SYMBOL_FOR_KVM(tdx_free_pamt_cache);
+
+int tdx_topup_pamt_cache(struct tdx_pamt_cache *cache, unsigned long npages)
+{
+	if (WARN_ON_ONCE(!tdx_supports_dynamic_pamt(&tdx_sysinfo)))
+		return 0;
+
+	npages *= TDX_DPAMT_ENTRY_PAGE_CNT;
+
+	while (cache->cnt < npages) {
+		struct page *page = alloc_page(GFP_KERNEL_ACCOUNT);
+
+		if (!page)
+			return -ENOMEM;
+
+		list_add(&page->lru, &cache->page_list);
+		cache->cnt++;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_FOR_KVM(tdx_topup_pamt_cache);
 
 /*
  * Return a page that can be gifted to the TDX-Module for use as a "control"
@@ -2223,7 +2276,7 @@ struct page *tdx_alloc_control_page(void)
 	if (!page)
 		return NULL;
 
-	if (tdx_pamt_get(page_to_pfn(page))) {
+	if (tdx_pamt_get(page_to_pfn(page), NULL)) {
 		__free_page(page);
 		return NULL;
 	}
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 03/11] x86/virt/tdx: Add tdx_alloc/free_control_page() helpers
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe, Binbin Wu
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Add helpers to use when allocating or preparing pages that are handed to
the TDX module for use as control/S-EPT pages, and thus need Dynamic PAMT
adjustments.

The TDX module tracks some state for each page of physical memory that it
might use. It calls this state the PAMT. It includes separate state for
each page size a physical page could be utilized at within the TDX module
(1GB, 2MB, 4KB). In Dynamic PAMT, only the 4KB page size state is
allocated dynamically.

KVM will need to hand pages to the TDX module that it will use at 4KB
granularity. So these pages will need Dynamic PAMT backing added before
they are used by the TDX module, and removed afterwards.

Add tdx_alloc_control_page() and tdx_free_control_page() to handle both
page allocation and Dynamic PAMT installation. Make them behave like
normal alloc/free functions where allocation can fail in the case of no
memory, but free (with any necessary Dynamic PAMT release) always
succeeds. Do this so they can support the existing TDX flows that require
teardowns to succeed.

Also create tdx_pamt_get/put() to handle installing Dynamic PAMT 4KB
backing for pages that are already allocated (such as KVM's use of S-EPT
page tables or guest private memory). Have them take a pfn instead of a
struct page, as future changes will want to use these helpers for guest
pages which are tracked by PFN.

Don't CLFLUSH the Dynamic PAMT pages handed to the TDX module, as is done
for some other SEAMCALLs, as the TDX docs specify that this is only
needed on "TD private memory or TD control structure page".

Since these allocations will be easily user triggerable, account the
memory.

Leave logic to handle concurrency issues for future changes.

AI was used under supervision to collect/apply feedback, split patches,
review code and workshop logs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
v7:
 - Comment improvements (Chao)
 - Drop unneeded addition of mm.h header include (Binbin)
 - Log clarity, code style nits (Sohil)
 - Drop Assisted-by tag and cover AI use in log (Dave)

v6:
The major change was to split out the concurrency stuff into a future
patch. It makes it easier to explain in the log. This one is the basic
functionality. Then the simple version of the concurrency and why in the
next patch. Also, to get rid of the dynamically sized DPAMT backing
support which was not based on a formal spec.

Details:
 - Split out concurrency stuff into next patch because the log was too long
 - Switch to fixed size pamt page arrays (Nikolay)
 - Rename tdx_alloc_page()/tdx_free_page() to tdx_alloc_control_page()/
   tdx_free_control_page() to reflect control/S-EPT purpose (Sean)
 - Take gfp from the caller in tdx_alloc_control_page() (Sean)
 - Narrow external API: make tdx_pamt_get()/tdx_pamt_put() static and
   export only tdx_alloc_control_page()/tdx_free_control_page() (note:
   dropped inline helpers since the discussion on Sean's series resulted
   in them not being needed)
 - Switch EXPORT_SYMBOL_GPL to EXPORT_SYMBOL_FOR_KVM (Sean)
 - Use WARN_ON_ONCE() instead of pr_err() for TDX module failures (Sean)
 - Fold alloc_pamt_array()/free_pamt_array() helpers back in and fix the
   error-unwind index bug (dpamt_pages[i] -> [j])
 - Adjustments after struct page->pfn
 - Adjustments from dropping error helper patches
 - Make the free error paths more normal
 - Drop gfp_t arg in tdx_alloc_control_page(). In the Sean mega v5, it
   was really needed because the kvm_mmu_memory_cache had a gfp_t it
   needed something to do with. But this was still weird because that
   version didn't handle allocating the DPAMT pages as the gfp_t. And in
   the end all the callers pass GFP_KERNEL_ACCOUNT. So just drop the arg.
 - Log tweaks
---
 arch/x86/include/asm/tdx.h  |   6 ++
 arch/x86/virt/vmx/tdx/tdx.c | 163 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |   2 +
 3 files changed, 171 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d414064436221..86bf37b15c705 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -126,6 +126,12 @@ void tdx_guest_keyid_free(unsigned int keyid);
 
 void tdx_quirk_reset_paddr(unsigned long base, unsigned long size);
 
+/* Number of PAMT pages to be provided to TDX module per 2MB region of PA */
+#define TDX_DPAMT_ENTRY_PAGE_CNT 2
+
+struct page *tdx_alloc_control_page(void);
+void tdx_free_control_page(struct page *page);
+
 struct tdx_td {
 	/* TD root structure: */
 	struct page *tdr_page;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f395f1fe95093..bfd9928c10249 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1994,6 +1994,169 @@ bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
 	return false;
 }
 
+static int alloc_pamt_array(struct page **pamt_pages)
+{
+	int i, j;
+
+	for (i = 0; i < TDX_DPAMT_ENTRY_PAGE_CNT; i++) {
+		pamt_pages[i] = alloc_page(GFP_KERNEL_ACCOUNT);
+		if (!pamt_pages[i])
+			goto err;
+	}
+
+	return 0;
+
+err:
+	for (j = 0; j < i; j++)
+		__free_page(pamt_pages[j]);
+
+	return -ENOMEM;
+}
+
+static void free_pamt_array(struct page **pamt_pages)
+{
+	for (int i = 0; i < TDX_DPAMT_ENTRY_PAGE_CNT; i++) {
+		/*
+		 * Reset pages unconditionally to cover cases
+		 * where they were passed to the TDX module.
+		 */
+		tdx_quirk_reset_paddr(page_to_phys(pamt_pages[i]), PAGE_SIZE);
+
+		__free_page(pamt_pages[i]);
+	}
+}
+
+/*
+ * Calculate the arg needed for operating on the DPAMT backing for
+ * a given 4KB page.
+ */
+static u64 pamt_2mb_arg(kvm_pfn_t pfn)
+{
+	/* Arg value will specify a 2MB region of physical address space. */
+	unsigned long hpa_2mb = ALIGN_DOWN(pfn << PAGE_SHIFT, PMD_SIZE);
+
+	return hpa_2mb | TDX_PS_2M;
+}
+
+/* Add PAMT backing for the 2MB region surrounding the given pfn. */
+static u64 tdh_phymem_pamt_add(kvm_pfn_t pfn, struct page **pamt_pages)
+{
+	struct tdx_module_args args = {
+		.rcx = pamt_2mb_arg(pfn),
+		.rdx = page_to_phys(pamt_pages[0]),
+		.r8 = page_to_phys(pamt_pages[1]),
+	};
+
+	return seamcall(TDH_PHYMEM_PAMT_ADD, &args);
+}
+
+/* Remove PAMT backing for the 2MB region surrounding the given pfn. */
+static u64 tdh_phymem_pamt_remove(kvm_pfn_t pfn, struct page **pamt_pages)
+{
+	struct tdx_module_args args = {
+		.rcx = pamt_2mb_arg(pfn),
+	};
+	u64 ret;
+
+	ret = seamcall_ret(TDH_PHYMEM_PAMT_REMOVE, &args);
+	if (ret)
+		return ret;
+
+	/* Copy PAMT pages out of the struct per the TDX ABI */
+	pamt_pages[0] = phys_to_page(args.rdx);
+	pamt_pages[1] = phys_to_page(args.r8);
+
+	return 0;
+}
+
+/* Allocate PAMT memory for the given page */
+static int tdx_pamt_get(kvm_pfn_t pfn)
+{
+	struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT];
+	u64 tdx_status;
+	int ret;
+
+	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+		return 0;
+
+	ret = alloc_pamt_array(pamt_pages);
+	if (ret)
+		return ret;
+
+	tdx_status = tdh_phymem_pamt_add(pfn, pamt_pages);
+	if (tdx_status != TDX_SUCCESS) {
+		ret = -EIO;
+		goto out_free;
+	}
+
+	return 0;
+
+out_free:
+	free_pamt_array(pamt_pages);
+
+	return ret;
+}
+
+/* Free PAMT memory for the given page */
+static void tdx_pamt_put(kvm_pfn_t pfn)
+{
+	struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT] = {};
+	u64 tdx_status;
+
+	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+		return;
+
+	tdx_status = tdh_phymem_pamt_remove(pfn, pamt_pages);
+
+	/*
+	 * Don't free pamt_pages as it could hold garbage when
+	 * tdh_phymem_pamt_remove() fails.  Don't panic/BUG_ON(), as
+	 * there is no risk of data corruption, but do yell loudly as
+	 * failure indicates a kernel bug, memory is being leaked, and
+	 * the dangling PAMT entry may cause future operations to fail.
+	 */
+	if (WARN_ON_ONCE(tdx_status != TDX_SUCCESS))
+		return;
+
+	free_pamt_array(pamt_pages);
+}
+
+/*
+ * Return a page that can be gifted to the TDX-Module for use as a "control"
+ * page, i.e. pages that are used for control structures for a given TDX
+ * guest, and thus obtain TDX protections, including PAMT tracking.
+ */
+struct page *tdx_alloc_control_page(void)
+{
+	struct page *page;
+
+	page = alloc_page(GFP_KERNEL_ACCOUNT);
+	if (!page)
+		return NULL;
+
+	if (tdx_pamt_get(page_to_pfn(page))) {
+		__free_page(page);
+		return NULL;
+	}
+
+	return page;
+}
+EXPORT_SYMBOL_FOR_KVM(tdx_alloc_control_page);
+
+/*
+ * Free a page that was gifted to the TDX-Module for use as a control
+ * page. After this, the page is no longer protected by TDX.
+ */
+void tdx_free_control_page(struct page *page)
+{
+	if (!page)
+		return;
+
+	tdx_pamt_put(page_to_pfn(page));
+	__free_page(page);
+}
+EXPORT_SYMBOL_FOR_KVM(tdx_free_control_page);
+
 void tdx_sys_disable(void)
 {
 	struct tdx_module_args args = {};
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index bdfd0e1e337ac..a886c54decaad 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -48,6 +48,8 @@
 #define TDH_SYS_CONFIG			45
 #define TDH_SYS_SHUTDOWN		52
 #define TDH_SYS_UPDATE			53
+#define TDH_PHYMEM_PAMT_ADD		58
+#define TDH_PHYMEM_PAMT_REMOVE		59
 #define TDH_SYS_DISABLE			69
 
 /*
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 01/11] x86/virt/tdx: Simplify PAMT layout calculation
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe, Binbin Wu
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

For each memory region that the TDX module might use (called TDMR), three
separate traditional PAMT allocations are needed. There is one for each
supported page size (1GB, 2MB, 4KB). These store information on each page
in the TDMR. In Linux, they are allocated out of one physically contiguous
block, in order to more efficiently use some internal TDX module
bookkeeping resources. So some simple math is needed to break the single
large allocation into three smaller allocations for each page size.

There are some commonalities in the math needed to calculate the base and
size for each smaller allocation, and so an effort was made to share logic
across the three. Unfortunately doing this turned out unnaturally tortured,
with a loop iterating over the three page sizes, only to call into a
function with case statements for each page size. In the future Dynamic
PAMT will add more logic that is special to the 4KB page size, making the
benefit of the math sharing even more questionable.

Three is not a very high number, so get rid of the loop and just duplicate
the small calculation three times. In doing so, setup for future Dynamic
PAMT changes.

Since the loop that iterates over it is gone, further simplify the code by
dropping the array of intermediate size and base storage. Just store the
values to their final locations. Accept the small complication of having
to clear tdmr->pamt_4k_base in the error path, so that tdmr_do_pamt_func()
will not try to operate on the TDMR struct when attempting to free it.

AI was used under supervision to collect/apply feedback, review code and
workshop logs.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
v7:
 - Remove accidentital whitespace changes (Kiryl)
 - Drop stale sentence in log (Chao)
 - Better patch subject (Yan)
 - Drop Assisted-by tag and cover AI use in log (Dave)

v6:
 - Drop {} by moving a comment (Binbin)
 - Log tweaks
---
 arch/x86/virt/vmx/tdx/tdx.c | 90 ++++++++++++-------------------------
 1 file changed, 28 insertions(+), 62 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 42df8ea464c47..e77a5265c2c84 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -514,31 +514,21 @@ static __init int fill_out_tdmrs(struct list_head *tmb_list,
  * Calculate PAMT size given a TDMR and a page size.  The returned
  * PAMT size is always aligned up to 4K page boundary.
  */
-static __init unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
-					     u16 pamt_entry_size)
+static __init unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz)
 {
 	unsigned long pamt_sz, nr_pamt_entries;
+	const int tdx_pg_size_shift[TDX_PS_NR] = { PAGE_SHIFT, PMD_SHIFT, PUD_SHIFT };
+	const u16 pamt_entry_size[TDX_PS_NR] = {
+		tdx_sysinfo.tdmr.pamt_4k_entry_size,
+		tdx_sysinfo.tdmr.pamt_2m_entry_size,
+		tdx_sysinfo.tdmr.pamt_1g_entry_size,
+	};
 
-	switch (pgsz) {
-	case TDX_PS_4K:
-		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
-		break;
-	case TDX_PS_2M:
-		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
-		break;
-	case TDX_PS_1G:
-		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
-		break;
-	default:
-		WARN_ON_ONCE(1);
-		return 0;
-	}
+	nr_pamt_entries = tdmr->size >> tdx_pg_size_shift[pgsz];
+	pamt_sz = nr_pamt_entries * pamt_entry_size[pgsz];
 
-	pamt_sz = nr_pamt_entries * pamt_entry_size;
 	/* TDX requires PAMT size must be 4K aligned */
-	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
-
-	return pamt_sz;
+	return PAGE_ALIGN(pamt_sz);
 }
 
 /*
@@ -576,15 +566,11 @@ static __init int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_lis
  * within @tdmr, and set up PAMTs for @tdmr.
  */
 static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
-				   struct list_head *tmb_list,
-				   u16 pamt_entry_size[])
+				   struct list_head *tmb_list)
 {
-	unsigned long pamt_base[TDX_PS_NR];
-	unsigned long pamt_size[TDX_PS_NR];
-	unsigned long tdmr_pamt_base;
 	unsigned long tdmr_pamt_size;
 	struct page *pamt;
-	int pgsz, nid;
+	int nid;
 
 	nid = tdmr_get_nid(tdmr, tmb_list);
 
@@ -592,12 +578,10 @@ static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
 	 * Calculate the PAMT size for each TDX supported page size
 	 * and the total PAMT size.
 	 */
-	tdmr_pamt_size = 0;
-	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
-		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
-					pamt_entry_size[pgsz]);
-		tdmr_pamt_size += pamt_size[pgsz];
-	}
+	tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
+	tdmr->pamt_2m_size = tdmr_get_pamt_sz(tdmr, TDX_PS_2M);
+	tdmr->pamt_1g_size = tdmr_get_pamt_sz(tdmr, TDX_PS_1G);
+	tdmr_pamt_size = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
 
 	/*
 	 * Allocate one chunk of physically contiguous memory for all
@@ -606,25 +590,17 @@ static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
 	 */
 	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
 			nid, &node_online_map);
+
+	/*
+	 * tdmr->pamt_4k_base is still zero so the error
+	 * path of the caller will skip freeing the pamt.
+	 */
 	if (!pamt)
 		return -ENOMEM;
 
-	/*
-	 * Break the contiguous allocation back up into the
-	 * individual PAMTs for each page size.
-	 */
-	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
-	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
-		pamt_base[pgsz] = tdmr_pamt_base;
-		tdmr_pamt_base += pamt_size[pgsz];
-	}
-
-	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
-	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
-	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
-	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
-	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
-	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
+	tdmr->pamt_4k_base = page_to_phys(pamt);
+	tdmr->pamt_2m_base = tdmr->pamt_4k_base + tdmr->pamt_4k_size;
+	tdmr->pamt_1g_base = tdmr->pamt_2m_base + tdmr->pamt_2m_size;
 
 	return 0;
 }
@@ -655,10 +631,7 @@ static __init void tdmr_do_pamt_func(struct tdmr_info *tdmr,
 	tdmr_get_pamt(tdmr, &pamt_base, &pamt_size);
 
 	/* Do nothing if PAMT hasn't been allocated for this TDMR */
-	if (!pamt_size)
-		return;
-
-	if (WARN_ON_ONCE(!pamt_base))
+	if (!pamt_base)
 		return;
 
 	pamt_func(pamt_base, pamt_size);
@@ -684,14 +657,12 @@ static __init void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
 
 /* Allocate and set up PAMTs for all TDMRs */
 static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
-					struct list_head *tmb_list,
-					u16 pamt_entry_size[])
+				 struct list_head *tmb_list)
 {
 	int i, ret = 0;
 
 	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
-		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
-				pamt_entry_size);
+		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list);
 		if (ret)
 			goto err;
 	}
@@ -968,18 +939,13 @@ static __init int construct_tdmrs(struct list_head *tmb_list,
 				  struct tdmr_info_list *tdmr_list,
 				  struct tdx_sys_info_tdmr *sysinfo_tdmr)
 {
-	u16 pamt_entry_size[TDX_PS_NR] = {
-		sysinfo_tdmr->pamt_4k_entry_size,
-		sysinfo_tdmr->pamt_2m_entry_size,
-		sysinfo_tdmr->pamt_1g_entry_size,
-	};
 	int ret;
 
 	ret = fill_out_tdmrs(tmb_list, tdmr_list);
 	if (ret)
 		return ret;
 
-	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list, pamt_entry_size);
+	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list);
 	if (ret)
 		return ret;
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 02/11] x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe, Binbin Wu
In-Reply-To: <20260718014500.2231262-1-rick.p.edgecombe@intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The TDX Physical Address Metadata Table (PAMT) holds data about the
physical memory used by TDX, and must be allocated by the kernel during
TDX module initialization.

The exact size of the required PAMT memory is determined by the TDX module
and may vary between TDX module versions. Currently it is approximately
0.4% of the system memory. This is a significant commitment, especially if
it is not known upfront whether the machine will run any TDX guests.

Each memory region that the TDX module might use needs three separate PAMT
allocations. One for each supported page size (1GB, 2MB, 4KB). The
TDX module supports a new feature designed to reduce PAMT overhead called
Dynamic PAMT. Under Dynamic PAMT the 4KB level is allocated dynamically
during runtime, while the 1GB and 2MB levels remain allocated on TDX
module initialization.

However, in the details, Dynamic PAMT still needs some smaller per 4KB
page scoped data (currently it is 1 bit per page). The TDX module exposes
the number of bits as a separate piece of metadata than the 4KB static
allocation for normal PAMT. Although the size is enumerated differently,
it is handed to the TDX module in the same way the 4KB page size PAMT
allocation is for normal PAMT.

Begin to implement Dynamic PAMT in the kernel by reading the bits-per-page
needed for Dynamic PAMT. Calculate the size needed for the bitmap,
and use it instead of the 4KB size determined for normal PAMT, in the case
of Dynamic PAMT.

The existing metadata reading code was generated by a script, but the
current plan is to stop generating this code, as the script has continued
to need adjustments. So add manually written code and adjust the comment
about it being autogenerated to be more generic. Start to adopt a more
normal kernel code style without the ternary statements and if
conditionals assignments that the auto generated code has.

AI was used under supervision to collect/apply feedback, review code and
workshop logs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Vishal Annapurve <vannapurve@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
---
v7:
 - Re-order pamt size calculations for greater readability (Kiryl)
 - Move comment to its own line in tdx_supports_dynamic_pamt() (Yan,
   Sohil)
 - Log tweak (Sohil)
 - Make comment in metadata reading more appropriate (Sohil)
 - Drop Assisted-by tag and cover AI use in log (Dave)
 - Move tdx_supports_dynamic_pamt() to not static inline to reduce
   churn in later changes

v6:
 - Improve comment (Binbin)
 - Log tweaks
 - Mark tdmr_get_pamt_bitmap_sz() __init in response to upstream
   changes
 - Switch to more normal kernel code style, even though it differs from
   the existing auto generated code.
---
 arch/x86/include/asm/tdx.h                  |  2 ++
 arch/x86/include/asm/tdx_global_metadata.h  |  3 +++
 arch/x86/virt/vmx/tdx/tdx.c                 | 29 +++++++++++++++++++--
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 23 +++++++++++++++-
 4 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 89e97d5761d89..d414064436221 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -118,6 +118,8 @@ static inline bool tdx_supports_runtime_update(const struct tdx_sys_info *sysinf
 	return sysinfo->features.tdx_features0 & TDX_FEATURES0_TD_PRESERVING;
 }
 
+bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo);
+
 int tdx_guest_keyid_alloc(void);
 u32 tdx_get_nr_guest_keyids(void);
 void tdx_guest_keyid_free(unsigned int keyid);
diff --git a/arch/x86/include/asm/tdx_global_metadata.h b/arch/x86/include/asm/tdx_global_metadata.h
index 41150d546589c..2a42551fc33cd 100644
--- a/arch/x86/include/asm/tdx_global_metadata.h
+++ b/arch/x86/include/asm/tdx_global_metadata.h
@@ -21,6 +21,9 @@ struct tdx_sys_info_tdmr {
 	u16 pamt_4k_entry_size;
 	u16 pamt_2m_entry_size;
 	u16 pamt_1g_entry_size;
+
+	/* Optional metadata, if Dynamic PAMT is supported */
+	u8  pamt_page_bitmap_entry_bits;
 };
 
 struct tdx_sys_info_td_ctrl {
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e77a5265c2c84..f395f1fe95093 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -510,6 +510,18 @@ static __init int fill_out_tdmrs(struct list_head *tmb_list,
 	return 0;
 }
 
+static __init unsigned long tdmr_get_pamt_bitmap_sz(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_sz, nr_pamt_entries;
+	int bits_per_entry;
+
+	bits_per_entry = tdx_sysinfo.tdmr.pamt_page_bitmap_entry_bits;
+	nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
+	pamt_sz = DIV_ROUND_UP(nr_pamt_entries * bits_per_entry, BITS_PER_BYTE);
+
+	return PAGE_ALIGN(pamt_sz);
+}
+
 /*
  * Calculate PAMT size given a TDMR and a page size.  The returned
  * PAMT size is always aligned up to 4K page boundary.
@@ -578,9 +590,16 @@ static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
 	 * Calculate the PAMT size for each TDX supported page size
 	 * and the total PAMT size.
 	 */
-	tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
-	tdmr->pamt_2m_size = tdmr_get_pamt_sz(tdmr, TDX_PS_2M);
 	tdmr->pamt_1g_size = tdmr_get_pamt_sz(tdmr, TDX_PS_1G);
+	tdmr->pamt_2m_size = tdmr_get_pamt_sz(tdmr, TDX_PS_2M);
+
+	if (tdx_supports_dynamic_pamt(&tdx_sysinfo)) {
+		/* With Dynamic PAMT, PAMT_4K is replaced with a bitmap */
+		tdmr->pamt_4k_size = tdmr_get_pamt_bitmap_sz(tdmr);
+	} else {
+		tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
+	}
+
 	tdmr_pamt_size = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
 
 	/*
@@ -1969,6 +1988,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, kvm_pfn_t pfn)
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_hkid);
 
+bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
+{
+	/* To be enabled when kernel is ready. */
+	return false;
+}
+
 void tdx_sys_disable(void)
 {
 	struct tdx_module_args args = {};
diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
index e49c300f23d43..8393d2aa59dbe 100644
--- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
+++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * Automatically generated functions to read TDX global metadata.
+ * Functions to read TDX global metadata.
  *
  * This file doesn't compile on its own as it lacks of inclusion
  * of SEAMCALL wrapper primitive which reads global metadata.
@@ -33,6 +33,18 @@ static __init int get_tdx_sys_info_features(struct tdx_sys_info_features *sysinf
 	return ret;
 }
 
+static __init int get_tdx_sys_info_tdmr_dpamt(struct tdx_sys_info_tdmr *sysinfo_tdmr)
+{
+	int ret;
+	u64 val;
+
+	ret = read_sys_metadata_field(0x9100000100000013, &val);
+	if (!ret)
+		sysinfo_tdmr->pamt_page_bitmap_entry_bits = val;
+
+	return ret;
+}
+
 static __init int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
 {
 	int ret = 0;
@@ -129,5 +141,14 @@ static __init int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
 	ret = ret ?: get_tdx_sys_info_td_ctrl(&sysinfo->td_ctrl);
 	ret = ret ?: get_tdx_sys_info_td_conf(&sysinfo->td_conf);
 
+	/*
+	 * The kernel supports using TDX without Dynamic PAMT, so
+	 * avoid reporting failure if it's not supported. Don't try
+	 * to support buggy TDX modules that advertise Dynamic PAMT
+	 * but don't expose the metadata.
+	 */
+	if (!ret && tdx_supports_dynamic_pamt(sysinfo))
+		ret = get_tdx_sys_info_tdmr_dpamt(&sysinfo->tdmr);
+
 	return ret;
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH v7 00/11] Dynamic PAMT
From: Rick Edgecombe @ 2026-07-18  1:44 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, chao.gao, yan.y.zhao, kai.huang, tony.lindgren,
	binbin.wu
  Cc: rick.p.edgecombe

Hi,

This is hopefully the last revision of Dynamic PAMT TDX series. Thank you 
to all the reviewers that helped polish off the last rough spots in v6[0]. 
Sean please consider acking the two KVM patches. Dave, please consider 
taking through tip.

Kiryl and Vishal, I left your RBs because the other changes were trivial.
Please shout if you prefer to drop them.

Background
==========
Dynamic PAMT is a TDX feature that allows saving memory by allocating some 
of its page tracking metadata dynamically, instead of statically at boot. 
These static allocations take roughly 0.4% of system memory. The savings 
are variable depending on system and TDX usage, but could be up to 100x. 
For more Dynamic PAMT background, please refer to [1]. For more analysis 
of the savings in different scenarios, see the v6 coverleter[0].

It occurred to me that since the Dynamic PAMT effort began, RAM has become
much more expensive. Consequently, this feature is even more valuable now.
It would be good to enable it for TDX users.

Changes
=======
Besides the polishing type comments, there were two substantial ones. These
ended up getting addressed with the same small change.

Chao asked why the TDX module doesn't do the keyid range checks itself 
that it requires, and then only expose the Dynamic PAMT feature0 bit when 
it actually can support Dynamic PAMT. It seems the TDX module is open to 
this change, but in any case, no modules exist today that have it. Since 
Dynamic PAMT enablement failure will cause TDX enablement to fail, Dynamic 
PAMT is made an opt-in for now by adding a kernel parameter for it. Then 
the kernel side keyid checks are dropped.

The other significant comment was Dave asking whether the "x86/virt/tdx: 
Optimize tdx_pamt_get/put()" was really needed. Later we discussed offline 
to keep the patch for the sake of maintaining performance and being kind 
to KVM's efforts to fault under a shared lock. However, now that the 
feature requires an opt-in, it could be for limited use and not disturb any
kernel upgraders. Then the optimization patch actually does become more 
optional. So here it is moved to the end. I think the patch is in good 
shape, but if there are any doubts we can drop it out of the initial
support.

Base
====
This is based on v7.2-rc3. A full branch can be found here: [2].

Testing
=======
This series was tested in the usual suite, and also with the optimization
patch removed.

[0] https://lore.kernel.org/lkml/20260526023515.288829-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/lkml/20250918232224.2202592-1-rick.p.edgecombe@intel.com/
[2] https://github.com/intel-staging/tdx/tree/dpamt_v7

Kiryl Shutsemau (9):
  x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
  x86/virt/tdx: Add tdx_alloc/free_control_page() helpers
  x86/virt/tdx: Allocate refcounts for Dynamic PAMT memory
  x86/virt/tdx: Handle multiple callers in tdx_pamt_get/put()
  KVM: TDX: Allocate PAMT memory for TD and vCPU control structures
  KVM: TDX: Get/put PAMT pages when (un)mapping private memory
  x86/virt/tdx: Enable Dynamic PAMT
  Documentation/x86: Add documentation for TDX's Dynamic PAMT
  x86/virt/tdx: Optimize tdx_pamt_get/put()

Rick Edgecombe (2):
  x86/virt/tdx: Simplify PAMT layout calculation
  x86/tdx: Add APIs to support Dynamic PAMT ops from KVM's fault path

 .../admin-guide/kernel-parameters.txt         |  10 +
 Documentation/arch/x86/tdx.rst                |  28 ++
 arch/x86/include/asm/kvm-x86-ops.h            |   1 +
 arch/x86/include/asm/kvm_host.h               |   2 +
 arch/x86/include/asm/tdx.h                    |  26 +
 arch/x86/include/asm/tdx_global_metadata.h    |   3 +
 arch/x86/kvm/mmu/mmu.c                        |   4 +
 arch/x86/kvm/vmx/tdx.c                        |  98 ++--
 arch/x86/kvm/vmx/tdx.h                        |   2 +
 arch/x86/virt/vmx/tdx/tdx.c                   | 452 +++++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h                   |   2 +
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c   |  23 +-
 12 files changed, 559 insertions(+), 92 deletions(-)

-- 
2.54.0

^ permalink raw reply

page: next (older)
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox