* [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
@ 2025-05-02 13:08 Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 01/12] x86/virt/tdx: Allocate page bitmap for " Kirill A. Shutemov
` (13 more replies)
0 siblings, 14 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be
applied, but rather to receive early feedback on the feature design and
enabling.
From our perspective, this feature has a lower priority compared to huge
page support. I will rebase this patchset on top of Yan's huge page
enabling at a later time, as it requires additional work.
Any feedback is welcome. We are open to ideas.
=========================================================================
The Physical Address Metadata Table (PAMT) holds TDX metadata for
physical memory and must be allocated by the kernel during TDX module
initialization.
The exact size of the required PAMT memory is determined by the TDX
module and may vary between TDX module versions, but currently it is
approximately 0.4% of the system memory. This is a significant
commitment, especially if it is not known upfront whether the machine
will run any TDX guests.
The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
PAMT_2M levels are still allocated on TDX module initialization, but the
PAMT_4K level is allocated dynamically, reducing static allocations to
approximately 0.004% of the system memory.
PAMT memory is dynamically allocated as pages gain TDX protections.
It is reclaimed when TDX protections have been removed from all
pages in a contiguous area.
TODO:
- Rebase on top of Yan's huge page support series. Demotion requires
additional handling with Dynamic PAMT;
- Get better vmalloc API from core-mm and simplify patch 02/12.
Kirill A. Shutemov (12):
x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
x86/virt/tdx: Allocate reference counters for PAMT memory
x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE
x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo
KVM: TDX: Add tdx_pamt_get()/put() helpers
KVM: TDX: Allocate PAMT memory in __tdx_td_init()
KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init()
KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
KVM: TDX: Preallocate PAMT pages to be used in page fault path
KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops
KVM: TDX: Reclaim PAMT memory
x86/virt/tdx: Enable Dynamic PAMT
arch/x86/include/asm/kvm-x86-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 5 +
arch/x86/include/asm/set_memory.h | 2 +
arch/x86/include/asm/tdx.h | 22 ++
arch/x86/include/asm/tdx_global_metadata.h | 1 +
arch/x86/kvm/mmu/mmu.c | 10 +
arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++-
arch/x86/kvm/vmx/main.c | 2 +
arch/x86/kvm/vmx/tdx.c | 215 ++++++++++++++++++--
arch/x86/kvm/vmx/tdx_errno.h | 1 +
arch/x86/kvm/vmx/x86_ops.h | 9 +
arch/x86/mm/Makefile | 2 +
arch/x86/mm/meminfo.c | 11 +
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/virt/vmx/tdx/tdx.c | 211 ++++++++++++++++++-
arch/x86/virt/vmx/tdx/tdx.h | 5 +-
arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 +
virt/kvm/kvm_main.c | 1 +
18 files changed, 522 insertions(+), 29 deletions(-)
create mode 100644 arch/x86/mm/meminfo.c
--
2.47.2
^ permalink raw reply [flat|nested] 63+ messages in thread
* [RFC, PATCH 01/12] x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-05 10:08 ` Huang, Kai
2025-05-02 13:08 ` [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory Kirill A. Shutemov
` (12 subsequent siblings)
13 siblings, 1 reply; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
The Physical Address Metadata Table (PAMT) holds TDX metadata for
physical memory and must be allocated by the kernel during TDX module
initialization.
The exact size of the required PAMT memory is determined by the TDX
module and may vary between TDX module versions, but currently it is
approximately 0.4% of the system memory. This is a significant
commitment, especially if it is not known upfront whether the machine
will run any TDX guests.
The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
PAMT_2M levels are still allocated on TDX module initialization, but the
PAMT_4K level is allocated dynamically, reducing static allocations to
approximately 0.004% of the system memory.
With Dynamic PAMT, the kernel no longer needs to allocate PAMT_4K on
boot, but instead must allocate a page bitmap. The TDX module determines
how many bits per page need to be allocated (currently it is 1).
Allocate the bitmap if the kernel boots on a machine with Dynamic PAMT.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/include/asm/tdx.h | 5 +++++
arch/x86/include/asm/tdx_global_metadata.h | 1 +
arch/x86/virt/vmx/tdx/tdx.c | 23 ++++++++++++++++++++-
arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 +++
4 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 26ffc792e673..9701876d4e16 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -125,6 +125,11 @@ int tdx_enable(void);
const char *tdx_dump_mce_info(struct mce *m);
const struct tdx_sys_info *tdx_get_sysinfo(void);
+static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
+{
+ return false; /* To be enabled when kernel is ready */
+}
+
int tdx_guest_keyid_alloc(void);
u32 tdx_get_nr_guest_keyids(void);
void tdx_guest_keyid_free(unsigned int keyid);
diff --git a/arch/x86/include/asm/tdx_global_metadata.h b/arch/x86/include/asm/tdx_global_metadata.h
index 060a2ad744bf..5eb808b23997 100644
--- a/arch/x86/include/asm/tdx_global_metadata.h
+++ b/arch/x86/include/asm/tdx_global_metadata.h
@@ -15,6 +15,7 @@ struct tdx_sys_info_tdmr {
u16 pamt_4k_entry_size;
u16 pamt_2m_entry_size;
u16 pamt_1g_entry_size;
+ u8 pamt_page_bitmap_entry_bits;
};
struct tdx_sys_info_td_ctrl {
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f5e2a937c1e7..c8bfd765e451 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -470,6 +470,18 @@ static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
return pamt_sz;
}
+static unsigned long tdmr_get_pamt_bitmap_sz(struct tdmr_info *tdmr)
+{
+ unsigned long pamt_sz, nr_pamt_entries;
+ int bits_per_entry;
+
+ bits_per_entry = tdx_sysinfo.tdmr.pamt_page_bitmap_entry_bits;
+ nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
+ pamt_sz = DIV_ROUND_UP(nr_pamt_entries * bits_per_entry, BITS_PER_BYTE);
+
+ return ALIGN(pamt_sz, PAGE_SIZE);
+}
+
/*
* Locate a NUMA node which should hold the allocation of the @tdmr
* PAMT. This node will have some memory covered by the TDMR. The
@@ -522,7 +534,16 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
* and the total PAMT size.
*/
tdmr_pamt_size = 0;
- for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
+ pgsz = TDX_PS_4K;
+
+ /* With Dynamic PAMT, PAMT_4K is replaced with a bitmap */
+ if (tdx_supports_dynamic_pamt(&tdx_sysinfo)) {
+ pamt_size[pgsz] = tdmr_get_pamt_bitmap_sz(tdmr);
+ tdmr_pamt_size += pamt_size[pgsz];
+ pgsz++;
+ }
+
+ for (; pgsz < TDX_PS_NR; pgsz++) {
pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
pamt_entry_size[pgsz]);
tdmr_pamt_size += pamt_size[pgsz];
diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
index 13ad2663488b..683925bcc9eb 100644
--- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
+++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
@@ -33,6 +33,9 @@ static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
sysinfo_tdmr->pamt_2m_entry_size = val;
if (!ret && !(ret = read_sys_metadata_field(0x9100000100000012, &val)))
sysinfo_tdmr->pamt_1g_entry_size = val;
+ if (!ret && tdx_supports_dynamic_pamt(&tdx_sysinfo) &&
+ !(ret = read_sys_metadata_field(0x9100000100000013, &val)))
+ sysinfo_tdmr->pamt_page_bitmap_entry_bits = val;
return ret;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 01/12] x86/virt/tdx: Allocate page bitmap for " Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-05 11:05 ` Huang, Kai
2025-05-09 9:52 ` Chao Gao
2025-05-02 13:08 ` [RFC, PATCH 03/12] x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE Kirill A. Shutemov
` (11 subsequent siblings)
13 siblings, 2 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
The PAMT memory holds metadata for TDX-protected memory. With Dynamic
PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
with a page pair that covers 2M of host physical memory.
The kernel must provide this page pair before using pages from the range
for TDX. If this is not done, any SEAMCALL that attempts to use the
memory will fail.
Allocate reference counters for every 2M range to track PAMT memory
usage. This is necessary to accurately determine when PAMT memory needs
to be allocated and when it can be freed.
This allocation will consume 2MiB for every 1TiB of physical memory.
Tracking PAMT memory usage on the kernel side duplicates what TDX module
does. It is possible to avoid this by lazily allocating PAMT memory on
SEAMCALL failure and freeing it based on hints provided by the TDX
module when the last user of PAMT memory is no longer present.
However, this approach complicates serialization.
The TDX module takes locks when dealing with PAMT: a shared lock on any
SEAMCALL that uses explicit HPA and an exclusive lock on PAMT.ADD and
PAMT.REMOVE. Any SEAMCALL that uses explicit HPA as an operand may fail
if it races with PAMT.ADD/REMOVE.
Since PAMT is a global resource, to prevent failure the kernel would
need global locking (per-TD is not sufficient). Or, it has to retry on
TDX_OPERATOR_BUSY.
Both options are not ideal, and tracking PAMT usage on the kernel side
seems like a reasonable alternative.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/virt/vmx/tdx/tdx.c | 113 +++++++++++++++++++++++++++++++++++-
1 file changed, 111 insertions(+), 2 deletions(-)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c8bfd765e451..00e07a0c908a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -29,6 +29,7 @@
#include <linux/acpi.h>
#include <linux/suspend.h>
#include <linux/idr.h>
+#include <linux/vmalloc.h>
#include <asm/page.h>
#include <asm/special_insns.h>
#include <asm/msr-index.h>
@@ -50,6 +51,8 @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
static struct tdmr_info_list tdx_tdmr_list;
+static atomic_t *pamt_refcounts;
+
static enum tdx_module_status_t tdx_module_status;
static DEFINE_MUTEX(tdx_module_lock);
@@ -1035,9 +1038,108 @@ static int config_global_keyid(void)
return ret;
}
+atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
+{
+ return &pamt_refcounts[hpa / PMD_SIZE];
+}
+EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
+
+static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
+{
+ unsigned long vaddr;
+ pte_t entry;
+
+ if (!pte_none(ptep_get(pte)))
+ return 0;
+
+ vaddr = __get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!vaddr)
+ return -ENOMEM;
+
+ entry = pfn_pte(PFN_DOWN(__pa(vaddr)), PAGE_KERNEL);
+
+ spin_lock(&init_mm.page_table_lock);
+ if (pte_none(ptep_get(pte)))
+ set_pte_at(&init_mm, addr, pte, entry);
+ else
+ free_page(vaddr);
+ spin_unlock(&init_mm.page_table_lock);
+
+ return 0;
+}
+
+static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr,
+ void *data)
+{
+ unsigned long vaddr;
+
+ vaddr = (unsigned long)__va(PFN_PHYS(pte_pfn(ptep_get(pte))));
+
+ spin_lock(&init_mm.page_table_lock);
+ if (!pte_none(ptep_get(pte))) {
+ pte_clear(&init_mm, addr, pte);
+ free_page(vaddr);
+ }
+ spin_unlock(&init_mm.page_table_lock);
+
+ return 0;
+}
+
+static int alloc_tdmr_pamt_refcount(struct tdmr_info *tdmr)
+{
+ unsigned long start, end;
+
+ start = (unsigned long)tdx_get_pamt_refcount(tdmr->base);
+ end = (unsigned long)tdx_get_pamt_refcount(tdmr->base + tdmr->size);
+ start = round_down(start, PAGE_SIZE);
+ end = round_up(end, PAGE_SIZE);
+
+ return apply_to_page_range(&init_mm, start, end - start,
+ pamt_refcount_populate, NULL);
+}
+
+static int init_pamt_metadata(void)
+{
+ size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
+ struct vm_struct *area;
+
+ if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+ return 0;
+
+ /*
+ * Reserve vmalloc range for PAMT reference counters. It covers all
+ * physical address space up to max_pfn. It is going to be populated
+ * from init_tdmr() only for present memory that available for TDX use.
+ */
+ area = get_vm_area(size, VM_IOREMAP);
+ if (!area)
+ return -ENOMEM;
+
+ pamt_refcounts = area->addr;
+ return 0;
+}
+
+static void free_pamt_metadata(void)
+{
+ size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
+
+ size = round_up(size, PAGE_SIZE);
+ apply_to_existing_page_range(&init_mm,
+ (unsigned long)pamt_refcounts,
+ size, pamt_refcount_depopulate,
+ NULL);
+ vfree(pamt_refcounts);
+ pamt_refcounts = NULL;
+}
+
static int init_tdmr(struct tdmr_info *tdmr)
{
u64 next;
+ int ret;
+
+ ret = alloc_tdmr_pamt_refcount(tdmr);
+ if (ret)
+ return ret;
/*
* Initializing a TDMR can be time consuming. To avoid long
@@ -1048,7 +1150,6 @@ static int init_tdmr(struct tdmr_info *tdmr)
struct tdx_module_args args = {
.rcx = tdmr->base,
};
- int ret;
ret = seamcall_prerr_ret(TDH_SYS_TDMR_INIT, &args);
if (ret)
@@ -1134,10 +1235,15 @@ static int init_tdx_module(void)
if (ret)
goto err_reset_pamts;
+ /* Reserve vmalloc range for PAMT reference counters */
+ ret = init_pamt_metadata();
+ if (ret)
+ goto err_reset_pamts;
+
/* Initialize TDMRs to complete the TDX module initialization */
ret = init_tdmrs(&tdx_tdmr_list);
if (ret)
- goto err_reset_pamts;
+ goto err_free_pamt_metadata;
pr_info("%lu KB allocated for PAMT\n", tdmrs_count_pamt_kb(&tdx_tdmr_list));
@@ -1149,6 +1255,9 @@ static int init_tdx_module(void)
put_online_mems();
return ret;
+err_free_pamt_metadata:
+ free_pamt_metadata();
+
err_reset_pamts:
/*
* Part of PAMTs may already have been initialized by the
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 03/12] x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 01/12] x86/virt/tdx: Allocate page bitmap for " Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-09 10:18 ` Chao Gao
2025-05-02 13:08 ` [RFC, PATCH 04/12] x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo Kirill A. Shutemov
` (10 subsequent siblings)
13 siblings, 1 reply; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
On a system with Dynamic PAMT enabled, the kernel must allocate memory
for PAMT_4K as needed and reclaim it when it is no longer in use.
The TDX module requires space to store 16 bytes of metadata per page or
8k for every 2M range of physical memory. The TDX module takes this 8k
of memory as a pair of 4k pages. These pages do not need to be contiguous.
The number of pages needed to cover 2M range can grow if size of PAMT
entry increases. tdx_nr_pamt_pages() reports needed number of pages.
TDH.PHYMEM.PAMT.ADD populates PAMT_4K for a given HPA. The kernel must
provide addresses for two pages, covering a 2M range starting from HPA.
TDH.PHYMEM.PAMT.REMOVE withdraws PAMT_4K memory for a given HPA,
returning the addresses of the pages used for PAMT_4K before the call.
Add wrappers for these SEAMCALLs.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/include/asm/tdx.h | 9 ++++++++
arch/x86/virt/vmx/tdx/tdx.c | 45 +++++++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 2 ++
3 files changed, 56 insertions(+)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9701876d4e16..a134cf3ecd17 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -130,6 +130,11 @@ static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
return false; /* To be enabled when kernel is ready */
}
+static inline int tdx_nr_pamt_pages(const struct tdx_sys_info *sysinfo)
+{
+ return sysinfo->tdmr.pamt_4k_entry_size * PTRS_PER_PTE / PAGE_SIZE;
+}
+
int tdx_guest_keyid_alloc(void);
u32 tdx_get_nr_guest_keyids(void);
void tdx_guest_keyid_free(unsigned int keyid);
@@ -197,6 +202,9 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
u64 tdh_phymem_cache_wb(bool resume);
u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
+u64 tdh_phymem_pamt_add(unsigned long hpa, struct list_head *pamt_pages);
+u64 tdh_phymem_pamt_remove(unsigned long hpa, struct list_head *pamt_pages);
+
#else
static inline void tdx_init(void) { }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
@@ -204,6 +212,7 @@ static inline int tdx_enable(void) { return -ENODEV; }
static inline u32 tdx_get_nr_guest_keyids(void) { return 0; }
static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
+static inline int tdx_nr_pamt_pages(const struct tdx_sys_info *sysinfo) { return 0; }
#endif /* CONFIG_INTEL_TDX_HOST */
#endif /* !__ASSEMBLER__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 00e07a0c908a..29defdb7f6bc 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1999,3 +1999,48 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
}
EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+
+u64 tdh_phymem_pamt_add(unsigned long hpa, struct list_head *pamt_pages)
+{
+ struct tdx_module_args args = {
+ .rcx = hpa,
+ };
+ struct page *page;
+ u64 *p;
+
+ WARN_ON_ONCE(!IS_ALIGNED(hpa & PAGE_MASK, PMD_SIZE));
+
+ p = &args.rdx;
+ list_for_each_entry(page, pamt_pages, lru) {
+ *p = page_to_phys(page);
+ p++;
+ }
+
+ return seamcall(TDH_PHYMEM_PAMT_ADD, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_phymem_pamt_add);
+
+u64 tdh_phymem_pamt_remove(unsigned long hpa, struct list_head *pamt_pages)
+{
+ struct tdx_module_args args = {
+ .rcx = hpa,
+ };
+ struct page *page;
+ u64 *p, ret;
+
+ WARN_ON_ONCE(!IS_ALIGNED(hpa & PAGE_MASK, PMD_SIZE));
+
+ ret = seamcall_ret(TDH_PHYMEM_PAMT_REMOVE, &args);
+ if (ret)
+ return ret;
+
+ p = &args.rdx;
+ for (int i = 0; i < tdx_nr_pamt_pages(&tdx_sysinfo); i++) {
+ page = phys_to_page(*p);
+ list_add(&page->lru, pamt_pages);
+ p++;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_phymem_pamt_remove);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 82bb82be8567..46c4214b79fb 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -46,6 +46,8 @@
#define TDH_PHYMEM_PAGE_WBINVD 41
#define TDH_VP_WR 43
#define TDH_SYS_CONFIG 45
+#define TDH_PHYMEM_PAMT_ADD 58
+#define TDH_PHYMEM_PAMT_REMOVE 59
/*
* SEAMCALL leaf:
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 04/12] x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (2 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 03/12] x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers Kirill A. Shutemov
` (9 subsequent siblings)
13 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
PAMT memory can add up to substantial portion of system memory.
Account these pages and print them into /proc/meminfo as TDX.
When no TD running PAMT memory consumption suppose to be zero.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/include/asm/set_memory.h | 2 ++
arch/x86/include/asm/tdx.h | 2 ++
arch/x86/mm/Makefile | 2 ++
arch/x86/mm/meminfo.c | 11 +++++++++++
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/virt/vmx/tdx/tdx.c | 26 ++++++++++++++++++++++++--
6 files changed, 42 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/mm/meminfo.c
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 8d9f1c9aaa4c..e729e9f86e67 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -90,6 +90,8 @@ int set_direct_map_default_noflush(struct page *page);
int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
bool kernel_page_present(struct page *page);
+void direct_pages_meminfo(struct seq_file *m);
+
extern int kernel_set_to_readonly;
#endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a134cf3ecd17..8091bf5b43cc 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -205,6 +205,7 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
u64 tdh_phymem_pamt_add(unsigned long hpa, struct list_head *pamt_pages);
u64 tdh_phymem_pamt_remove(unsigned long hpa, struct list_head *pamt_pages);
+void tdx_meminfo(struct seq_file *m);
#else
static inline void tdx_init(void) { }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
@@ -213,6 +214,7 @@ static inline u32 tdx_get_nr_guest_keyids(void) { return 0; }
static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
static inline int tdx_nr_pamt_pages(const struct tdx_sys_info *sysinfo) { return 0; }
+static inline void tdx_meminfo(struct seq_file *m) {}
#endif /* CONFIG_INTEL_TDX_HOST */
#endif /* !__ASSEMBLER__ */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 32035d5be5a0..311d60801871 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -38,6 +38,8 @@ CFLAGS_fault.o := -I $(src)/../include/asm/trace
obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o
+obj-$(CONFIG_PROC_FS) += meminfo.o
+
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_PTDUMP) += dump_pagetables.o
obj-$(CONFIG_PTDUMP_DEBUGFS) += debug_pagetables.o
diff --git a/arch/x86/mm/meminfo.c b/arch/x86/mm/meminfo.c
new file mode 100644
index 000000000000..7bdb5df014de
--- /dev/null
+++ b/arch/x86/mm/meminfo.c
@@ -0,0 +1,11 @@
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+
+#include <asm/set_memory.h>
+#include <asm/tdx.h>
+
+void arch_report_meminfo(struct seq_file *m)
+{
+ direct_pages_meminfo(m);
+ tdx_meminfo(m);
+}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index def3d9284254..59432b92e80e 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -118,7 +118,7 @@ static void collapse_page_count(int level)
direct_pages_count[level - 1] -= PTRS_PER_PTE;
}
-void arch_report_meminfo(struct seq_file *m)
+void direct_pages_meminfo(struct seq_file *m)
{
seq_printf(m, "DirectMap4k: %8lu kB\n",
direct_pages_count[PG_LEVEL_4K] << 2);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 29defdb7f6bc..74bd81acef7b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -2000,13 +2000,28 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
}
EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+static atomic_long_t tdx_pamt_count = ATOMIC_LONG_INIT(0);
+
+void tdx_meminfo(struct seq_file *m)
+{
+ unsigned long usage;
+
+ if (!cpu_feature_enabled(X86_FEATURE_TDX_HOST_PLATFORM))
+ return;
+
+ usage = atomic_long_read(&tdx_pamt_count) *
+ tdx_nr_pamt_pages(&tdx_sysinfo) * PAGE_SIZE / SZ_1K;
+
+ seq_printf(m, "TDX: %8lu kB\n", usage);
+}
+
u64 tdh_phymem_pamt_add(unsigned long hpa, struct list_head *pamt_pages)
{
struct tdx_module_args args = {
.rcx = hpa,
};
struct page *page;
- u64 *p;
+ u64 *p, ret;
WARN_ON_ONCE(!IS_ALIGNED(hpa & PAGE_MASK, PMD_SIZE));
@@ -2016,7 +2031,12 @@ u64 tdh_phymem_pamt_add(unsigned long hpa, struct list_head *pamt_pages)
p++;
}
- return seamcall(TDH_PHYMEM_PAMT_ADD, &args);
+ ret = seamcall(TDH_PHYMEM_PAMT_ADD, &args);
+
+ if (!ret)
+ atomic_long_inc(&tdx_pamt_count);
+
+ return ret;
}
EXPORT_SYMBOL_GPL(tdh_phymem_pamt_add);
@@ -2034,6 +2054,8 @@ u64 tdh_phymem_pamt_remove(unsigned long hpa, struct list_head *pamt_pages)
if (ret)
return ret;
+ atomic_long_dec(&tdx_pamt_count);
+
p = &args.rdx;
for (int i = 0; i < tdx_nr_pamt_pages(&tdx_sysinfo); i++) {
page = phys_to_page(*p);
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (3 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 04/12] x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-05 12:44 ` Huang, Kai
` (2 more replies)
2025-05-02 13:08 ` [RFC, PATCH 06/12] KVM: TDX: Allocate PAMT memory in __tdx_td_init() Kirill A. Shutemov
` (8 subsequent siblings)
13 siblings, 3 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
Introduce a pair of helpers to allocate and free memory for a given 2M
range. The range is represented by struct page for any memory in the
range and the PAMT memory by a list of pages.
Use per-2M refcounts to detect when PAMT memory has to be allocated and
when it can be freed.
pamt_lock spinlock serializes against races between multiple
tdx_pamt_add() as well as tdx_pamt_add() vs tdx_pamt_put().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/include/asm/tdx.h | 2 +
arch/x86/kvm/vmx/tdx.c | 123 +++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx_errno.h | 1 +
3 files changed, 126 insertions(+)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8091bf5b43cc..42449c054938 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -135,6 +135,8 @@ static inline int tdx_nr_pamt_pages(const struct tdx_sys_info *sysinfo)
return sysinfo->tdmr.pamt_4k_entry_size * PTRS_PER_PTE / PAGE_SIZE;
}
+atomic_t *tdx_get_pamt_refcount(unsigned long hpa);
+
int tdx_guest_keyid_alloc(void);
u32 tdx_get_nr_guest_keyids(void);
void tdx_guest_keyid_free(unsigned int keyid);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..ea7e2d93fb44 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -207,6 +207,10 @@ static bool tdx_operand_busy(u64 err)
return (err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY;
}
+static bool tdx_hpa_range_not_free(u64 err)
+{
+ return (err & TDX_SEAMCALL_STATUS_MASK) == TDX_HPA_RANGE_NOT_FREE;
+}
/*
* A per-CPU list of TD vCPUs associated with a given CPU.
@@ -276,6 +280,125 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
vcpu->cpu = -1;
}
+static DEFINE_SPINLOCK(pamt_lock);
+
+static void tdx_free_pamt_pages(struct list_head *pamt_pages)
+{
+ struct page *page;
+
+ while ((page = list_first_entry_or_null(pamt_pages, struct page, lru))) {
+ list_del(&page->lru);
+ __free_page(page);
+ }
+}
+
+static int tdx_alloc_pamt_pages(struct list_head *pamt_pages)
+{
+ for (int i = 0; i < tdx_nr_pamt_pages(tdx_sysinfo); i++) {
+ struct page *page = alloc_page(GFP_KERNEL);
+ if (!page)
+ goto fail;
+ list_add(&page->lru, pamt_pages);
+ }
+ return 0;
+fail:
+ tdx_free_pamt_pages(pamt_pages);
+ return -ENOMEM;
+}
+
+static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
+ struct list_head *pamt_pages)
+{
+ u64 err;
+
+ hpa = ALIGN_DOWN(hpa, SZ_2M);
+
+ spin_lock(&pamt_lock);
+
+ /* Lost race to other tdx_pamt_add() */
+ if (atomic_read(pamt_refcount) != 0) {
+ atomic_inc(pamt_refcount);
+ spin_unlock(&pamt_lock);
+ tdx_free_pamt_pages(pamt_pages);
+ return 0;
+ }
+
+ err = tdh_phymem_pamt_add(hpa | TDX_PS_2M, pamt_pages);
+
+ if (err)
+ tdx_free_pamt_pages(pamt_pages);
+
+ /*
+ * tdx_hpa_range_not_free() is true if current task won race
+ * against tdx_pamt_put().
+ */
+ if (err && !tdx_hpa_range_not_free(err)) {
+ spin_unlock(&pamt_lock);
+ pr_tdx_error(TDH_PHYMEM_PAMT_ADD, err);
+ return -EIO;
+ }
+
+ atomic_set(pamt_refcount, 1);
+ spin_unlock(&pamt_lock);
+ return 0;
+}
+
+static int tdx_pamt_get(struct page *page)
+{
+ unsigned long hpa = page_to_phys(page);
+ atomic_t *pamt_refcount;
+ LIST_HEAD(pamt_pages);
+
+ if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
+ return 0;
+
+ pamt_refcount = tdx_get_pamt_refcount(hpa);
+ WARN_ON_ONCE(atomic_read(pamt_refcount) < 0);
+
+ if (atomic_inc_not_zero(pamt_refcount))
+ return 0;
+
+ if (tdx_alloc_pamt_pages(&pamt_pages))
+ return -ENOMEM;
+
+ return tdx_pamt_add(pamt_refcount, hpa, &pamt_pages);
+}
+
+static void tdx_pamt_put(struct page *page)
+{
+ unsigned long hpa = page_to_phys(page);
+ atomic_t *pamt_refcount;
+ LIST_HEAD(pamt_pages);
+ u64 err;
+
+ if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
+ return;
+
+ hpa = ALIGN_DOWN(hpa, SZ_2M);
+
+ pamt_refcount = tdx_get_pamt_refcount(hpa);
+ if (!atomic_dec_and_test(pamt_refcount))
+ return;
+
+ spin_lock(&pamt_lock);
+
+ /* Lost race against tdx_pamt_add()? */
+ if (atomic_read(pamt_refcount) != 0) {
+ spin_unlock(&pamt_lock);
+ return;
+ }
+
+ err = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, &pamt_pages);
+ spin_unlock(&pamt_lock);
+
+ if (err) {
+ pr_tdx_error(TDH_PHYMEM_PAMT_REMOVE, err);
+ return;
+ }
+
+ tdx_free_pamt_pages(&pamt_pages);
+}
+
static void tdx_clear_page(struct page *page)
{
const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index 6ff4672c4181..c8a471d6b991 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -18,6 +18,7 @@
#define TDX_OPERAND_BUSY 0x8000020000000000ULL
#define TDX_PREVIOUS_TLB_EPOCH_BUSY 0x8000020100000000ULL
#define TDX_PAGE_METADATA_INCORRECT 0xC000030000000000ULL
+#define TDX_HPA_RANGE_NOT_FREE 0xC000030400000000ULL
#define TDX_VCPU_NOT_ASSOCIATED 0x8000070200000000ULL
#define TDX_KEY_GENERATION_FAILED 0x8000080000000000ULL
#define TDX_KEY_STATE_INCORRECT 0xC000081100000000ULL
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 06/12] KVM: TDX: Allocate PAMT memory in __tdx_td_init()
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (4 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-05 12:46 ` Huang, Kai
2025-05-02 13:08 ` [RFC, PATCH 07/12] KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init() Kirill A. Shutemov
` (7 subsequent siblings)
13 siblings, 1 reply; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
Allocate PAMT memory for TDH.MNG.CREATE and TDH.MNG.ADDCX.
PAMT memory that is associated with pages successfully added to the TD
with TDH.MNG.ADDCX will be removed in tdx_reclaim_page() on
tdx_reclaim_control_page().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/kvm/vmx/tdx.c | 41 +++++++++++++++++++++++++++++++----------
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ea7e2d93fb44..59bbae2df485 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -399,6 +399,31 @@ static void tdx_pamt_put(struct page *page)
tdx_free_pamt_pages(&pamt_pages);
}
+static struct page *tdx_alloc_page(void)
+{
+ struct page *page;
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page)
+ return NULL;
+
+ if (tdx_pamt_get(page)) {
+ __free_page(page);
+ return NULL;
+ }
+
+ return page;
+}
+
+static void tdx_free_page(struct page *page)
+{
+ if (!page)
+ return;
+
+ tdx_pamt_put(page);
+ __free_page(page);
+}
+
static void tdx_clear_page(struct page *page)
{
const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
@@ -2499,7 +2524,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
atomic_inc(&nr_configured_hkid);
- tdr_page = alloc_page(GFP_KERNEL);
+ tdr_page = tdx_alloc_page();
if (!tdr_page)
goto free_hkid;
@@ -2512,7 +2537,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
goto free_tdr;
for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++) {
- tdcs_pages[i] = alloc_page(GFP_KERNEL);
+ tdcs_pages[i] = tdx_alloc_page();
if (!tdcs_pages[i])
goto free_tdcs;
}
@@ -2633,10 +2658,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
teardown:
/* Only free pages not yet added, so start at 'i' */
for (; i < kvm_tdx->td.tdcs_nr_pages; i++) {
- if (tdcs_pages[i]) {
- __free_page(tdcs_pages[i]);
- tdcs_pages[i] = NULL;
- }
+ tdx_free_page(tdcs_pages[i]);
+ tdcs_pages[i] = NULL;
}
if (!kvm_tdx->td.tdcs_pages)
kfree(tdcs_pages);
@@ -2652,15 +2675,13 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
free_tdcs:
for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++) {
- if (tdcs_pages[i])
- __free_page(tdcs_pages[i]);
+ tdx_free_page(tdcs_pages[i]);
}
kfree(tdcs_pages);
kvm_tdx->td.tdcs_pages = NULL;
free_tdr:
- if (tdr_page)
- __free_page(tdr_page);
+ tdx_free_page(tdr_page);
kvm_tdx->td.tdr_page = 0;
free_hkid:
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 07/12] KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init()
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (5 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 06/12] KVM: TDX: Allocate PAMT memory in __tdx_td_init() Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops Kirill A. Shutemov
` (6 subsequent siblings)
13 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
Allocate PAMT memory for TDH.VP.CREATE and TDH.VP.ADDCX.
PAMT memory that is associated with pages successfully added to the TD
with TDH.VP.ADDCX will be removed in tdx_reclaim_page() on
tdx_reclaim_control_page().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/kvm/vmx/tdx.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 59bbae2df485..18c4ae00cd8d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2983,7 +2983,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
int ret, i;
u64 err;
- page = alloc_page(GFP_KERNEL);
+ page = tdx_alloc_page();
if (!page)
return -ENOMEM;
tdx->vp.tdvpr_page = page;
@@ -2996,7 +2996,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
}
for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) {
- page = alloc_page(GFP_KERNEL);
+ page = tdx_alloc_page();
if (!page) {
ret = -ENOMEM;
goto free_tdcx;
@@ -3020,7 +3020,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
* method, but the rest are freed here.
*/
for (; i < kvm_tdx->td.tdcx_nr_pages; i++) {
- __free_page(tdx->vp.tdcx_pages[i]);
+ tdx_free_page(tdx->vp.tdcx_pages[i]);
tdx->vp.tdcx_pages[i] = NULL;
}
return -EIO;
@@ -3039,16 +3039,15 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
free_tdcx:
for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) {
- if (tdx->vp.tdcx_pages[i])
- __free_page(tdx->vp.tdcx_pages[i]);
+ tdx_free_page(tdx->vp.tdcx_pages[i]);
tdx->vp.tdcx_pages[i] = NULL;
}
kfree(tdx->vp.tdcx_pages);
tdx->vp.tdcx_pages = NULL;
free_tdvpr:
- if (tdx->vp.tdvpr_page)
- __free_page(tdx->vp.tdvpr_page);
+ tdx_free_page(tdx->vp.tdvpr_page);
+
tdx->vp.tdvpr_page = 0;
return ret;
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (6 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 07/12] KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init() Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-06 11:55 ` Yan Zhao
2025-05-14 6:15 ` Chao Gao
2025-05-02 13:08 ` [RFC, PATCH 09/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path Kirill A. Shutemov
` (5 subsequent siblings)
13 siblings, 2 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
The functions kvm_x86_ops::link_external_spt() and
kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
When using TDX with Dynamic PAMT enabled, the assigned memory must be
covered by PAMT.
The new function kvm_x86_ops::phys_prepare() is called before
link_external_spt() and set_external_spte() to ensure that the memory is
ready to be assigned to the virtual machine. In the case of TDX, it
makes sure that the memory is covered by PAMT.
kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
is available, allowing the implementation to allocate memory from a
per-VCPU pool.
The function kvm_x86_ops::phys_cleanup() frees PAMT memory in case of
failure.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/kvm/mmu/tdp_mmu.c | 47 +++++++++++++++++++++++++++---
3 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 79406bf07a1c..37081d04e82f 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -99,6 +99,8 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
KVM_X86_OP_OPTIONAL(set_external_spte)
KVM_X86_OP_OPTIONAL(free_external_spt)
KVM_X86_OP_OPTIONAL(remove_external_spte)
+KVM_X86_OP_OPTIONAL(phys_prepare)
+KVM_X86_OP_OPTIONAL(phys_cleanup)
KVM_X86_OP(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6c06f3d6e081..91958c55f918 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1813,6 +1813,9 @@ struct kvm_x86_ops {
int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
kvm_pfn_t pfn_for_gfn);
+ int (*phys_prepare)(struct kvm_vcpu *vcpu, kvm_pfn_t pfn);
+ void (*phys_cleanup)(kvm_pfn_t pfn);
+
bool (*has_wbinvd_exit)(void);
u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 405874f4d088..f6c836b2e6fc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1137,6 +1137,26 @@ void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
}
}
+static int tdp_mmu_install_spte(struct kvm_vcpu *vcpu,
+ struct tdp_iter *iter,
+ u64 spte)
+{
+ kvm_pfn_t pfn = 0;
+ int ret = 0;
+
+ if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(spte)) {
+ pfn = spte_to_pfn(spte);
+ ret = static_call(kvm_x86_phys_prepare)(vcpu, pfn);
+ }
+ if (ret)
+ return ret;
+ ret = tdp_mmu_set_spte_atomic(vcpu->kvm, iter, spte);
+ if (pfn && ret)
+ static_call(kvm_x86_phys_cleanup)(pfn);
+
+ return ret;
+}
+
/*
* Installs a last-level SPTE to handle a TDP page fault.
* (NPT/EPT violation/misconfiguration)
@@ -1170,7 +1190,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
- else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
+ else if (tdp_mmu_install_spte(vcpu, iter, new_spte))
return RET_PF_RETRY;
else if (is_shadow_present_pte(iter->old_spte) &&
(!is_last_spte(iter->old_spte, iter->level) ||
@@ -1211,7 +1231,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
* Returns: 0 if the new page table was installed. Non-0 if the page table
* could not be installed (e.g. the atomic compare-exchange failed).
*/
-static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
+static int __tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_mmu_page *sp, bool shared)
{
u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled);
@@ -1230,6 +1250,25 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
return 0;
}
+static int tdp_mmu_link_sp(struct kvm_vcpu *vcpu, struct tdp_iter *iter,
+ struct kvm_mmu_page *sp, bool shared)
+{
+ kvm_pfn_t pfn = 0;
+ int ret = 0;
+
+ if (sp->external_spt) {
+ pfn = __pa(sp->external_spt) >> PAGE_SHIFT;
+ ret = static_call(kvm_x86_phys_prepare)(vcpu, pfn);
+ if (ret)
+ return ret;
+ }
+ ret = __tdp_mmu_link_sp(vcpu->kvm, iter, sp, shared);
+ if (pfn && ret)
+ static_call(kvm_x86_phys_cleanup)(pfn);
+
+ return ret;
+}
+
static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_mmu_page *sp, bool shared);
@@ -1288,7 +1327,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm);
r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
} else {
- r = tdp_mmu_link_sp(kvm, &iter, sp, true);
+ r = tdp_mmu_link_sp(vcpu, &iter, sp, true);
}
/*
@@ -1514,7 +1553,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
* correctness standpoint since the translation will be the same either
* way.
*/
- ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
+ ret = __tdp_mmu_link_sp(kvm, iter, sp, shared);
if (ret)
goto out;
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 09/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (7 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-14 0:07 ` Huang, Kai
2025-05-14 6:30 ` Chao Gao
2025-05-02 13:08 ` [RFC, PATCH 10/12] KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops Kirill A. Shutemov
` (4 subsequent siblings)
13 siblings, 2 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
Preallocate a page to be used in the link_external_spt() and
set_external_spte() paths.
In the worst-case scenario, handling a page fault might require a
tdx_nr_pamt_pages() pages for each page table level.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
2 files changed, 12 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 91958c55f918..a5661499a176 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -849,6 +849,8 @@ struct kvm_vcpu_arch {
*/
struct kvm_mmu_memory_cache mmu_external_spt_cache;
+ struct kvm_mmu_memory_cache pamt_page_cache;
+
/*
* QEMU userspace and the guest each have their own FPU state.
* In vcpu_run, we switch between the user and guest FPU contexts.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a284dce227a0..7bfa0dc50440 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -616,6 +616,15 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
if (r)
return r;
}
+
+ if (vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM) {
+ int nr = tdx_nr_pamt_pages(tdx_get_sysinfo());
+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.pamt_page_cache,
+ nr * PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+ }
+
return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
PT64_ROOT_MAX_LEVEL);
}
@@ -626,6 +635,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_external_spt_cache);
+ kvm_mmu_free_memory_cache(&vcpu->arch.pamt_page_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 10/12] KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (8 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 09/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 11/12] KVM: TDX: Reclaim PAMT memory Kirill A. Shutemov
` (3 subsequent siblings)
13 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
Allocate PAMT memory from a per-VCPU pool in kvm_x86_ops::phys_prepare()
and release memory in kvm_x86_ops::phys_cleanup().
The TDP code invokes these callbacks to handle PAMT memory management.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/kvm/vmx/main.c | 2 ++
arch/x86/kvm/vmx/tdx.c | 30 ++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 9 +++++++++
virt/kvm/kvm_main.c | 1 +
4 files changed, 42 insertions(+)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 94d5d907d37b..665a3dbd4ba5 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -63,6 +63,8 @@ static __init int vt_hardware_setup(void)
vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
+ vt_x86_ops.phys_prepare = tdx_phys_prepare;
+ vt_x86_ops.phys_cleanup = tdx_phys_cleanup;
}
return 0;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 18c4ae00cd8d..0f06ae7ff6b9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1958,6 +1958,36 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
return tdx_sept_drop_private_spte(kvm, gfn, level, page);
}
+int tdx_phys_prepare(struct kvm_vcpu *vcpu, kvm_pfn_t pfn)
+{
+ unsigned long hpa = pfn << PAGE_SHIFT;
+ atomic_t *pamt_refcount;
+ LIST_HEAD(pamt_pages);
+
+ if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
+ return 0;
+
+ pamt_refcount = tdx_get_pamt_refcount(hpa);
+ if (atomic_inc_not_zero(pamt_refcount))
+ return 0;
+
+ for (int i = 0; i < tdx_nr_pamt_pages(tdx_sysinfo); i++) {
+ struct page *page;
+ void *p;
+
+ p = kvm_mmu_memory_cache_alloc(&vcpu->arch.pamt_page_cache);
+ page = virt_to_page(p);
+ list_add(&page->lru, &pamt_pages);
+ }
+
+ return tdx_pamt_add(pamt_refcount, hpa, &pamt_pages);
+}
+
+void tdx_phys_cleanup(kvm_pfn_t pfn)
+{
+ tdx_pamt_put(pfn_to_page(pfn));
+}
+
void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector)
{
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 6bf8be570b2e..111f16c3039f 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -158,6 +158,8 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn);
int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn);
+int tdx_phys_prepare(struct kvm_vcpu *vcpu, kvm_pfn_t pfn);
+void tdx_phys_cleanup(kvm_pfn_t pfn);
void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
@@ -224,6 +226,13 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
return -EOPNOTSUPP;
}
+static inline int tdx_phys_prepare(struct kvm_vcpu *vcpu, kvm_pfn_t pfn)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline void tdx_phys_cleanup(kvm_pfn_t pfn) {}
+
static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 69782df3617f..c3ba3ca37940 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -436,6 +436,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
BUG_ON(!p);
return p;
}
+EXPORT_SYMBOL_GPL(kvm_mmu_memory_cache_alloc);
#endif
static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 11/12] KVM: TDX: Reclaim PAMT memory
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (9 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 10/12] KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-14 1:11 ` Huang, Kai
2025-05-02 13:08 ` [RFC, PATCH 12/12] x86/virt/tdx: Enable Dynamic PAMT Kirill A. Shutemov
` (2 subsequent siblings)
13 siblings, 1 reply; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
The PAMT memory holds metadata for TDX-protected memory. With Dynamic
PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
with a few pages that cover 2M of host physical memory.
PAMT memory can be reclaimed when the last user is gone. It can happen
in a few code paths:
- On TDH.PHYMEM.PAGE.RECLAIM in tdx_reclaim_td_control_pages() and
tdx_reclaim_page().
- On TDH.MEM.PAGE.REMOVE in tdx_sept_drop_private_spte().
- In tdx_sept_zap_private_spte() for pages that were in the queue to be
added with TDH.MEM.PAGE.ADD, but it never happened due to an error.
Add tdx_pamt_put() in these code paths.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/kvm/vmx/tdx.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0f06ae7ff6b9..352f7b41f611 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -487,8 +487,11 @@ static int tdx_reclaim_page(struct page *page)
int r;
r = __tdx_reclaim_page(page);
- if (!r)
+ if (!r) {
tdx_clear_page(page);
+ tdx_pamt_put(page);
+ }
+
return r;
}
@@ -737,6 +740,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
return;
}
tdx_clear_page(kvm_tdx->td.tdr_page);
+ tdx_pamt_put(kvm_tdx->td.tdr_page);
__free_page(kvm_tdx->td.tdr_page);
kvm_tdx->td.tdr_page = NULL;
@@ -1768,6 +1772,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
return -EIO;
}
tdx_clear_page(page);
+ tdx_pamt_put(page);
tdx_unpin(kvm, page);
return 0;
}
@@ -1848,6 +1853,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
atomic64_dec(&kvm_tdx->nr_premapped);
+ tdx_pamt_put(page);
tdx_unpin(kvm, page);
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC, PATCH 12/12] x86/virt/tdx: Enable Dynamic PAMT
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (10 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 11/12] KVM: TDX: Reclaim PAMT memory Kirill A. Shutemov
@ 2025-05-02 13:08 ` Kirill A. Shutemov
2025-05-14 13:41 ` [RFC, PATCH 00/12] TDX: " Sean Christopherson
2025-05-14 20:33 ` Zhi Wang
13 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-02 13:08 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao, tglx,
mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel,
Kirill A. Shutemov
The Physical Address Metadata Table (PAMT) holds TDX metadata for
physical memory and must be allocated by the kernel during TDX module
initialization.
The exact size of the required PAMT memory is determined by the TDX
module and may vary between TDX module versions, but currently it is
approximately 0.4% of the system memory. This is a significant
commitment, especially if it is not known upfront whether the machine
will run any TDX guests.
The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
PAMT_2M levels are still allocated on TDX module initialization, but the
PAMT_4K level is allocated dynamically, reducing static allocations to
approximately 0.004% of the system memory.
All pieces are in place. Enable Dynamic PAMT if it is supported.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/include/asm/tdx.h | 6 +++++-
arch/x86/virt/vmx/tdx/tdx.c | 8 ++++++++
arch/x86/virt/vmx/tdx/tdx.h | 3 ---
3 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 42449c054938..5744f98d193e 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -32,6 +32,10 @@
#define TDX_SUCCESS 0ULL
#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL
+/* Bit definitions of TDX_FEATURES0 metadata field */
+#define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18)
+#define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36)
+
#ifndef __ASSEMBLER__
#include <uapi/asm/mce.h>
@@ -127,7 +131,7 @@ const struct tdx_sys_info *tdx_get_sysinfo(void);
static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
{
- return false; /* To be enabled when kernel is ready */
+ return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT;
}
static inline int tdx_nr_pamt_pages(const struct tdx_sys_info *sysinfo)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 74bd81acef7b..f35566c0588d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -945,6 +945,8 @@ static int construct_tdmrs(struct list_head *tmb_list,
return ret;
}
+#define TDX_SYS_CONFIG_DYNAMIC_PAMT BIT(16)
+
static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
{
struct tdx_module_args args = {};
@@ -972,6 +974,12 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
args.rcx = __pa(tdmr_pa_array);
args.rdx = tdmr_list->nr_consumed_tdmrs;
args.r8 = global_keyid;
+
+ if (tdx_supports_dynamic_pamt(&tdx_sysinfo)) {
+ pr_info("Enable Dynamic PAMT\n");
+ args.r8 |= TDX_SYS_CONFIG_DYNAMIC_PAMT;
+ }
+
ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
/* Free the array as it is not required anymore. */
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 46c4214b79fb..096c78a1d438 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -86,9 +86,6 @@ struct tdmr_info {
DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
} __packed __aligned(TDMR_INFO_ALIGNMENT);
-/* Bit definitions of TDX_FEATURES0 metadata field */
-#define TDX_FEATURES0_NO_RBP_MOD BIT(18)
-
/*
* Do not put any hardware-defined TDX structure representations below
* this comment!
--
2.47.2
^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 01/12] x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
2025-05-02 13:08 ` [RFC, PATCH 01/12] x86/virt/tdx: Allocate page bitmap for " Kirill A. Shutemov
@ 2025-05-05 10:08 ` Huang, Kai
0 siblings, 0 replies; 63+ messages in thread
From: Huang, Kai @ 2025-05-05 10:08 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com
Cc: Edgecombe, Rick P, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, mingo@redhat.com, Zhao, Yan Y, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Fri, 2025-05-02 at 16:08 +0300, Kirill A. Shutemov wrote:
> --- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> +++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> @@ -33,6 +33,9 @@ static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
> sysinfo_tdmr->pamt_2m_entry_size = val;
> if (!ret && !(ret = read_sys_metadata_field(0x9100000100000012, &val)))
> sysinfo_tdmr->pamt_1g_entry_size = val;
> + if (!ret && tdx_supports_dynamic_pamt(&tdx_sysinfo) &&
> + !(ret = read_sys_metadata_field(0x9100000100000013, &val)))
> + sysinfo_tdmr->pamt_page_bitmap_entry_bits = val;
>
Currently the global metadata reading code is auto-generated by script, which is
not in upstream yet. It doesn't support generating code to support reading some
field "enumerated by" some TDX feature either.
I'll try to upstream the script and add also the "enumerated by" support in the
script.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory
2025-05-02 13:08 ` [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory Kirill A. Shutemov
@ 2025-05-05 11:05 ` Huang, Kai
2025-05-08 13:03 ` kirill.shutemov
2025-05-09 9:52 ` Chao Gao
1 sibling, 1 reply; 63+ messages in thread
From: Huang, Kai @ 2025-05-05 11:05 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com
Cc: Edgecombe, Rick P, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, mingo@redhat.com, Zhao, Yan Y, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
> +static atomic_t *pamt_refcounts;
> +
> static enum tdx_module_status_t tdx_module_status;
> static DEFINE_MUTEX(tdx_module_lock);
>
> @@ -1035,9 +1038,108 @@ static int config_global_keyid(void)
> return ret;
> }
>
> +atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> +{
> + return &pamt_refcounts[hpa / PMD_SIZE];
> +}
> +EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
It's not quite clear why this function needs to be exported in this patch. IMO
it's better to move the export to the patch which actually needs it.
Looking at patch 5, tdx_pamt_get()/put() use it, and they are in KVM code. But
I think we should just put them here in this file. tdx_alloc_page() and
tdx_free_page() should be in this file too.
And instead of exporting tdx_get_pamt_refcount(), the TDX core code here can
export tdx_alloc_page() and tdx_free_page(), providing two high level helpers to
allow the TDX users (e.g., KVM) to allocate/free TDX private pages. How PAMT
pages are allocated is then hidden in the core TDX code.
> +
> +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
> +{
> + unsigned long vaddr;
> + pte_t entry;
> +
> + if (!pte_none(ptep_get(pte)))
> + return 0;
> +
> + vaddr = __get_free_page(GFP_KERNEL | __GFP_ZERO);
> + if (!vaddr)
> + return -ENOMEM;
> +
> + entry = pfn_pte(PFN_DOWN(__pa(vaddr)), PAGE_KERNEL);
> +
> + spin_lock(&init_mm.page_table_lock);
> + if (pte_none(ptep_get(pte)))
> + set_pte_at(&init_mm, addr, pte, entry);
> + else
> + free_page(vaddr);
> + spin_unlock(&init_mm.page_table_lock);
> +
> + return 0;
> +}
> +
> +static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr,
> + void *data)
> +{
> + unsigned long vaddr;
> +
> + vaddr = (unsigned long)__va(PFN_PHYS(pte_pfn(ptep_get(pte))));
> +
> + spin_lock(&init_mm.page_table_lock);
> + if (!pte_none(ptep_get(pte))) {
> + pte_clear(&init_mm, addr, pte);
> + free_page(vaddr);
> + }
> + spin_unlock(&init_mm.page_table_lock);
> +
> + return 0;
> +}
> +
> +static int alloc_tdmr_pamt_refcount(struct tdmr_info *tdmr)
> +{
> + unsigned long start, end;
> +
> + start = (unsigned long)tdx_get_pamt_refcount(tdmr->base);
> + end = (unsigned long)tdx_get_pamt_refcount(tdmr->base + tdmr->size);
> + start = round_down(start, PAGE_SIZE);
> + end = round_up(end, PAGE_SIZE);
> +
> + return apply_to_page_range(&init_mm, start, end - start,
> + pamt_refcount_populate, NULL);
> +}
IIUC, populating refcount based on TDMR will slightly waste memory. The reason
is IIUC we don't need to populate the refcount for a 2M range if the range is
completely marked as reserved in TDMR, because it's not possible for the kernel
to use such range for TDX.
Populating based on the list of TDX memory blocks should be better. In
practice, the difference should be unnoticeable, but conceptually, using TDX
memory blocks is better.
> +
> +static int init_pamt_metadata(void)
> +{
> + size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> + struct vm_struct *area;
> +
> + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> + return 0;
> +
> + /*
> + * Reserve vmalloc range for PAMT reference counters. It covers all
> + * physical address space up to max_pfn. It is going to be populated
> + * from init_tdmr() only for present memory that available for TDX use.
> + */
> + area = get_vm_area(size, VM_IOREMAP);
> + if (!area)
> + return -ENOMEM;
> +
> + pamt_refcounts = area->addr;
> + return 0;
> +}
> +
> +static void free_pamt_metadata(void)
> +{
> + size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> +
> + size = round_up(size, PAGE_SIZE);
> + apply_to_existing_page_range(&init_mm,
> + (unsigned long)pamt_refcounts,
> + size, pamt_refcount_depopulate,
> + NULL);
> + vfree(pamt_refcounts);
> + pamt_refcounts = NULL;
> +}
> +
> static int init_tdmr(struct tdmr_info *tdmr)
> {
> u64 next;
> + int ret;
> +
> + ret = alloc_tdmr_pamt_refcount(tdmr);
> + if (ret)
> + return ret;
>
> /*
> * Initializing a TDMR can be time consuming. To avoid long
> @@ -1048,7 +1150,6 @@ static int init_tdmr(struct tdmr_info *tdmr)
> struct tdx_module_args args = {
> .rcx = tdmr->base,
> };
> - int ret;
>
> ret = seamcall_prerr_ret(TDH_SYS_TDMR_INIT, &args);
> if (ret)
> @@ -1134,10 +1235,15 @@ static int init_tdx_module(void)
> if (ret)
> goto err_reset_pamts;
>
> + /* Reserve vmalloc range for PAMT reference counters */
> + ret = init_pamt_metadata();
> + if (ret)
> + goto err_reset_pamts;
> +
> /* Initialize TDMRs to complete the TDX module initialization */
> ret = init_tdmrs(&tdx_tdmr_list);
> if (ret)
> - goto err_reset_pamts;
> + goto err_free_pamt_metadata;
>
> pr_info("%lu KB allocated for PAMT\n", tdmrs_count_pamt_kb(&tdx_tdmr_list));
>
> @@ -1149,6 +1255,9 @@ static int init_tdx_module(void)
> put_online_mems();
> return ret;
>
> +err_free_pamt_metadata:
> + free_pamt_metadata();
> +
> err_reset_pamts:
> /*
> * Part of PAMTs may already have been initialized by the
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-02 13:08 ` [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers Kirill A. Shutemov
@ 2025-05-05 12:44 ` Huang, Kai
2025-05-07 1:01 ` Yan Zhao
` (2 more replies)
2025-05-14 5:25 ` Chao Gao
2025-05-14 5:33 ` Chao Gao
2 siblings, 3 replies; 63+ messages in thread
From: Huang, Kai @ 2025-05-05 12:44 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com
Cc: Edgecombe, Rick P, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, mingo@redhat.com, Zhao, Yan Y, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Fri, 2025-05-02 at 16:08 +0300, Kirill A. Shutemov wrote:
> Introduce a pair of helpers to allocate and free memory for a given 2M
> range. The range is represented by struct page for any memory in the
> range and the PAMT memory by a list of pages.
>
> Use per-2M refcounts to detect when PAMT memory has to be allocated and
> when it can be freed.
>
> pamt_lock spinlock serializes against races between multiple
> tdx_pamt_add() as well as tdx_pamt_add() vs tdx_pamt_put().
Maybe elaborate a little bit on _why_ using spinlock?
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
> arch/x86/include/asm/tdx.h | 2 +
> arch/x86/kvm/vmx/tdx.c | 123 +++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx_errno.h | 1 +
> 3 files changed, 126 insertions(+)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 8091bf5b43cc..42449c054938 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -135,6 +135,8 @@ static inline int tdx_nr_pamt_pages(const struct tdx_sys_info *sysinfo)
> return sysinfo->tdmr.pamt_4k_entry_size * PTRS_PER_PTE / PAGE_SIZE;
> }
>
> +atomic_t *tdx_get_pamt_refcount(unsigned long hpa);
> +
This at least needs to be in the same patch which exports it. But as replied to
patch 2, I think we should just move the code in this patch to TDX core code.
> int tdx_guest_keyid_alloc(void);
> u32 tdx_get_nr_guest_keyids(void);
> void tdx_guest_keyid_free(unsigned int keyid);
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..ea7e2d93fb44 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -207,6 +207,10 @@ static bool tdx_operand_busy(u64 err)
> return (err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY;
> }
>
> +static bool tdx_hpa_range_not_free(u64 err)
> +{
> + return (err & TDX_SEAMCALL_STATUS_MASK) == TDX_HPA_RANGE_NOT_FREE;
> +}
>
> /*
> * A per-CPU list of TD vCPUs associated with a given CPU.
> @@ -276,6 +280,125 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> vcpu->cpu = -1;
> }
>
> +static DEFINE_SPINLOCK(pamt_lock);
> +
> +static void tdx_free_pamt_pages(struct list_head *pamt_pages)
> +{
> + struct page *page;
> +
> + while ((page = list_first_entry_or_null(pamt_pages, struct page, lru))) {
> + list_del(&page->lru);
> + __free_page(page);
> + }
> +}
> +
> +static int tdx_alloc_pamt_pages(struct list_head *pamt_pages)
> +{
> + for (int i = 0; i < tdx_nr_pamt_pages(tdx_sysinfo); i++) {
> + struct page *page = alloc_page(GFP_KERNEL);
> + if (!page)
> + goto fail;
> + list_add(&page->lru, pamt_pages);
> + }
> + return 0;
> +fail:
> + tdx_free_pamt_pages(pamt_pages);
> + return -ENOMEM;
> +}
> +
> +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> + struct list_head *pamt_pages)
> +{
> + u64 err;
> +
> + hpa = ALIGN_DOWN(hpa, SZ_2M);
> +
> + spin_lock(&pamt_lock);
Just curious, Can the lock be per-2M-range?
> +
> + /* Lost race to other tdx_pamt_add() */
> + if (atomic_read(pamt_refcount) != 0) {
> + atomic_inc(pamt_refcount);
> + spin_unlock(&pamt_lock);
> + tdx_free_pamt_pages(pamt_pages);
It's unfortunate multiple caller of tdx_pamt_add() needs to firstly allocate
PAMT pages by the caller out of the spinlock and then free them here.
I am thinking if we make tdx_pamt_add() return:
* > 0: PAMT pages already added (another tdx_pamt_add() won)
* = 0: PAMT pages added successfully
* < 0: error code
.. then we at least could move tdx_free_pamt_pages() to the caller too.
> + return 0;
> + }
> +
> + err = tdh_phymem_pamt_add(hpa | TDX_PS_2M, pamt_pages);
> +
> + if (err)
> + tdx_free_pamt_pages(pamt_pages);
Seems we are calling tdx_free_pamt_pages() within spinlock, which is not
consistent with above when another tdx_pamt_add() has won the race.
> +
> + /*
> + * tdx_hpa_range_not_free() is true if current task won race
> + * against tdx_pamt_put().
> + */
> + if (err && !tdx_hpa_range_not_free(err)) {
> + spin_unlock(&pamt_lock);
> + pr_tdx_error(TDH_PHYMEM_PAMT_ADD, err);
> + return -EIO;
> + }
I had hard time to figure out why we need to handle tdx_hpa_range_not_free()
explicitly. IIUC, it is because atomic_dec_and_test() is used in
tdx_pamt_put(), in which case the atomic_t can reach to 0 outside of the
spinlock thus tdh_phymem_pamt_add() can be called when there's still PAMT pages
populated.
But ...
> +
> + atomic_set(pamt_refcount, 1);
> + spin_unlock(&pamt_lock);
> + return 0;
> +}
> +
> +static int tdx_pamt_get(struct page *page)
> +{
> + unsigned long hpa = page_to_phys(page);
> + atomic_t *pamt_refcount;
> + LIST_HEAD(pamt_pages);
> +
> + if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
> + return 0;
> +
> + pamt_refcount = tdx_get_pamt_refcount(hpa);
> + WARN_ON_ONCE(atomic_read(pamt_refcount) < 0);
> +
> + if (atomic_inc_not_zero(pamt_refcount))
> + return 0;
... if we set the initial value of pamt_refcount to -1, and use
atomic_inc_unless_negetive() here:
if (atomic_inc_unless_negative(pamt_refcount))
return 0;
if (tdx_alloc_pamt_pages(&pamt_pages))
return -ENOMEM;
spin_lock(&pamt_lock);
ret = tdx_pamt_add(hpa, &pamt_pages);
if (ret >= 0)
atomic_inc(pamt_refcount, 0);
spin_unlock(&pamt_lock);
/*
* If another tdx_pamt_get() won the race, or in case of
* error, PAMT pages are not used and can be freed.
*/
if (ret)
tdx_free_pamt_pages(&pamt_pages);
return ret >= 0 ? 0 : ret;
and ...
> +
> + if (tdx_alloc_pamt_pages(&pamt_pages))
> + return -ENOMEM;
> +
> + return tdx_pamt_add(pamt_refcount, hpa, &pamt_pages);
> +}
> +
> +static void tdx_pamt_put(struct page *page)
> +{
> + unsigned long hpa = page_to_phys(page);
> + atomic_t *pamt_refcount;
> + LIST_HEAD(pamt_pages);
> + u64 err;
> +
> + if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
> + return;
> +
> + hpa = ALIGN_DOWN(hpa, SZ_2M);
> +
> + pamt_refcount = tdx_get_pamt_refcount(hpa);
> + if (!atomic_dec_and_test(pamt_refcount))
> + return;
... use atomic_dec_if_possible() here, we should be able to avoid the special
handling of tdx_hpa_range_not_free() in tdx_pamt_get(). Someething like:
if (atomic_dec_if_positive(pamt_refcount) >= 0)
return;
spin_lock(&pamt_lock);
/* tdx_pamt_get() called more than once */
if (atomic_read(pamt_refcount) > 0) {
spin_unlock(&pamt_lock);
return;
}
err = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, &pamt_pages);
atomic_set(pamt_refcount, -1);
spin_unlock(&pamt_lock);
tdx_free_pamt_pages(&pamt_pages);
Hmm.. am I missing anything?
> +
> + spin_lock(&pamt_lock);
> +
> + /* Lost race against tdx_pamt_add()? */
> + if (atomic_read(pamt_refcount) != 0) {
> + spin_unlock(&pamt_lock);
> + return;
> + }
> +
> + err = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, &pamt_pages);
> + spin_unlock(&pamt_lock);
> +
> + if (err) {
> + pr_tdx_error(TDH_PHYMEM_PAMT_REMOVE, err);
> + return;
> + }
> +
> + tdx_free_pamt_pages(&pamt_pages);
> +}
> +
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 06/12] KVM: TDX: Allocate PAMT memory in __tdx_td_init()
2025-05-02 13:08 ` [RFC, PATCH 06/12] KVM: TDX: Allocate PAMT memory in __tdx_td_init() Kirill A. Shutemov
@ 2025-05-05 12:46 ` Huang, Kai
0 siblings, 0 replies; 63+ messages in thread
From: Huang, Kai @ 2025-05-05 12:46 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com
Cc: Edgecombe, Rick P, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, mingo@redhat.com, Zhao, Yan Y, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Fri, 2025-05-02 at 16:08 +0300, Kirill A. Shutemov wrote:
>
> +static struct page *tdx_alloc_page(void)
> +{
> + struct page *page;
> +
> + page = alloc_page(GFP_KERNEL);
> + if (!page)
> + return NULL;
> +
> + if (tdx_pamt_get(page)) {
> + __free_page(page);
> + return NULL;
> + }
> +
> + return page;
> +}
> +
> +static void tdx_free_page(struct page *page)
> +{
> + if (!page)
> + return;
> +
> + tdx_pamt_put(page);
> + __free_page(page);
> +}
> +
IMO the two should be moved to the TDX core code, and exported for KVM to use.
They can be used for other kernel components for TDX Connect.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-02 13:08 ` [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops Kirill A. Shutemov
@ 2025-05-06 11:55 ` Yan Zhao
2025-05-08 13:23 ` Kirill A. Shutemov
2025-05-14 6:15 ` Chao Gao
1 sibling, 1 reply; 63+ messages in thread
From: Yan Zhao @ 2025-05-06 11:55 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel
On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> The functions kvm_x86_ops::link_external_spt() and
> kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> When using TDX with Dynamic PAMT enabled, the assigned memory must be
> covered by PAMT.
>
> The new function kvm_x86_ops::phys_prepare() is called before
> link_external_spt() and set_external_spte() to ensure that the memory is
> ready to be assigned to the virtual machine. In the case of TDX, it
> makes sure that the memory is covered by PAMT.
>
> kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> is available, allowing the implementation to allocate memory from a
> per-VCPU pool.
>
Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
> The function kvm_x86_ops::phys_cleanup() frees PAMT memory in case of
> failure.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 2 ++
> arch/x86/include/asm/kvm_host.h | 3 ++
> arch/x86/kvm/mmu/tdp_mmu.c | 47 +++++++++++++++++++++++++++---
> 3 files changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 79406bf07a1c..37081d04e82f 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -99,6 +99,8 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
> KVM_X86_OP_OPTIONAL(set_external_spte)
> KVM_X86_OP_OPTIONAL(free_external_spt)
> KVM_X86_OP_OPTIONAL(remove_external_spte)
> +KVM_X86_OP_OPTIONAL(phys_prepare)
> +KVM_X86_OP_OPTIONAL(phys_cleanup)
> KVM_X86_OP(has_wbinvd_exit)
> KVM_X86_OP(get_l2_tsc_offset)
> KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6c06f3d6e081..91958c55f918 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1813,6 +1813,9 @@ struct kvm_x86_ops {
> int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> kvm_pfn_t pfn_for_gfn);
>
> + int (*phys_prepare)(struct kvm_vcpu *vcpu, kvm_pfn_t pfn);
> + void (*phys_cleanup)(kvm_pfn_t pfn);
> +
> bool (*has_wbinvd_exit)(void);
>
> u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 405874f4d088..f6c836b2e6fc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1137,6 +1137,26 @@ void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
> }
> }
>
> +static int tdp_mmu_install_spte(struct kvm_vcpu *vcpu,
> + struct tdp_iter *iter,
> + u64 spte)
> +{
> + kvm_pfn_t pfn = 0;
> + int ret = 0;
> +
> + if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(spte)) {
> + pfn = spte_to_pfn(spte);
> + ret = static_call(kvm_x86_phys_prepare)(vcpu, pfn);
> + }
> + if (ret)
> + return ret;
> + ret = tdp_mmu_set_spte_atomic(vcpu->kvm, iter, spte);
> + if (pfn && ret)
> + static_call(kvm_x86_phys_cleanup)(pfn);
> +
> + return ret;
> +}
> +
> /*
> * Installs a last-level SPTE to handle a TDP page fault.
> * (NPT/EPT violation/misconfiguration)
> @@ -1170,7 +1190,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>
> if (new_spte == iter->old_spte)
> ret = RET_PF_SPURIOUS;
> - else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
> + else if (tdp_mmu_install_spte(vcpu, iter, new_spte))
> return RET_PF_RETRY;
> else if (is_shadow_present_pte(iter->old_spte) &&
> (!is_last_spte(iter->old_spte, iter->level) ||
> @@ -1211,7 +1231,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> * Returns: 0 if the new page table was installed. Non-0 if the page table
> * could not be installed (e.g. the atomic compare-exchange failed).
> */
> -static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> +static int __tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> struct kvm_mmu_page *sp, bool shared)
> {
> u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled);
> @@ -1230,6 +1250,25 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> return 0;
> }
>
> +static int tdp_mmu_link_sp(struct kvm_vcpu *vcpu, struct tdp_iter *iter,
> + struct kvm_mmu_page *sp, bool shared)
> +{
> + kvm_pfn_t pfn = 0;
> + int ret = 0;
> +
> + if (sp->external_spt) {
> + pfn = __pa(sp->external_spt) >> PAGE_SHIFT;
> + ret = static_call(kvm_x86_phys_prepare)(vcpu, pfn);
> + if (ret)
> + return ret;
> + }
> + ret = __tdp_mmu_link_sp(vcpu->kvm, iter, sp, shared);
> + if (pfn && ret)
> + static_call(kvm_x86_phys_cleanup)(pfn);
> +
> + return ret;
> +}
> +
> static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> struct kvm_mmu_page *sp, bool shared);
>
> @@ -1288,7 +1327,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm);
> r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
> } else {
> - r = tdp_mmu_link_sp(kvm, &iter, sp, true);
> + r = tdp_mmu_link_sp(vcpu, &iter, sp, true);
> }
>
> /*
> @@ -1514,7 +1553,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> * correctness standpoint since the translation will be the same either
> * way.
> */
> - ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
> + ret = __tdp_mmu_link_sp(kvm, iter, sp, shared);
> if (ret)
> goto out;
>
> --
> 2.47.2
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-05 12:44 ` Huang, Kai
@ 2025-05-07 1:01 ` Yan Zhao
2025-05-07 1:15 ` Vishal Annapurve
2025-05-07 16:31 ` Dave Hansen
2025-05-23 9:42 ` kirill.shutemov
2 siblings, 1 reply; 63+ messages in thread
From: Yan Zhao @ 2025-05-07 1:01 UTC (permalink / raw)
To: Huang, Kai
Cc: kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com, Edgecombe, Rick P, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, mingo@redhat.com,
tglx@linutronix.de, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Mon, May 05, 2025 at 08:44:26PM +0800, Huang, Kai wrote:
> On Fri, 2025-05-02 at 16:08 +0300, Kirill A. Shutemov wrote:
> > +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> > + struct list_head *pamt_pages)
> > +{
> > + u64 err;
> > +
> > + hpa = ALIGN_DOWN(hpa, SZ_2M);
> > +
> > + spin_lock(&pamt_lock);
>
> Just curious, Can the lock be per-2M-range?
Me too.
Could we introduce smaller locks each covering a 2M range?
And could we deposit 2 pamt pages per-2M hpa range no matter if it's finally
mapped as a huge page or not?
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-07 1:01 ` Yan Zhao
@ 2025-05-07 1:15 ` Vishal Annapurve
2025-05-07 2:42 ` Yan Zhao
0 siblings, 1 reply; 63+ messages in thread
From: Vishal Annapurve @ 2025-05-07 1:15 UTC (permalink / raw)
To: Yan Zhao
Cc: Huang, Kai, kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com, Edgecombe, Rick P, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, mingo@redhat.com,
tglx@linutronix.de, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Tue, May 6, 2025 at 6:04 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, May 05, 2025 at 08:44:26PM +0800, Huang, Kai wrote:
> > On Fri, 2025-05-02 at 16:08 +0300, Kirill A. Shutemov wrote:
> > > +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> > > + struct list_head *pamt_pages)
> > > +{
> > > + u64 err;
> > > +
> > > + hpa = ALIGN_DOWN(hpa, SZ_2M);
> > > +
> > > + spin_lock(&pamt_lock);
> >
> > Just curious, Can the lock be per-2M-range?
> Me too.
> Could we introduce smaller locks each covering a 2M range?
>
> And could we deposit 2 pamt pages per-2M hpa range no matter if it's finally
> mapped as a huge page or not?
>
Are you suggesting to keep 2 PAMT pages allocated for each private 2M
page even if it's mapped as a hugepage? It will lead to wastage of
memory of 4 MB per 1GB of guest memory range. For large VM sizes that
will amount to high values.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-07 1:15 ` Vishal Annapurve
@ 2025-05-07 2:42 ` Yan Zhao
2025-05-08 13:19 ` kirill.shutemov
0 siblings, 1 reply; 63+ messages in thread
From: Yan Zhao @ 2025-05-07 2:42 UTC (permalink / raw)
To: Vishal Annapurve
Cc: Huang, Kai, kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com, Edgecombe, Rick P, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, mingo@redhat.com,
tglx@linutronix.de, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Tue, May 06, 2025 at 06:15:40PM -0700, Vishal Annapurve wrote:
> On Tue, May 6, 2025 at 6:04 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Mon, May 05, 2025 at 08:44:26PM +0800, Huang, Kai wrote:
> > > On Fri, 2025-05-02 at 16:08 +0300, Kirill A. Shutemov wrote:
> > > > +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> > > > + struct list_head *pamt_pages)
> > > > +{
> > > > + u64 err;
> > > > +
> > > > + hpa = ALIGN_DOWN(hpa, SZ_2M);
> > > > +
> > > > + spin_lock(&pamt_lock);
> > >
> > > Just curious, Can the lock be per-2M-range?
> > Me too.
> > Could we introduce smaller locks each covering a 2M range?
> >
> > And could we deposit 2 pamt pages per-2M hpa range no matter if it's finally
> > mapped as a huge page or not?
> >
>
> Are you suggesting to keep 2 PAMT pages allocated for each private 2M
> page even if it's mapped as a hugepage? It will lead to wastage of
> memory of 4 MB per 1GB of guest memory range. For large VM sizes that
> will amount to high values.
Ok. I'm thinking of the possibility to aligning the time of PAMT page allocation
to that of physical page allocation.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-05 12:44 ` Huang, Kai
2025-05-07 1:01 ` Yan Zhao
@ 2025-05-07 16:31 ` Dave Hansen
2025-05-08 2:08 ` Yan Zhao
2025-05-08 13:16 ` kirill.shutemov
2025-05-23 9:42 ` kirill.shutemov
2 siblings, 2 replies; 63+ messages in thread
From: Dave Hansen @ 2025-05-07 16:31 UTC (permalink / raw)
To: Huang, Kai, kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com
Cc: Edgecombe, Rick P, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, mingo@redhat.com, Zhao, Yan Y, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On 5/5/25 05:44, Huang, Kai wrote:
>> +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
>> + struct list_head *pamt_pages)
>> +{
>> + u64 err;
>> +
>> + hpa = ALIGN_DOWN(hpa, SZ_2M);
>> +
>> + spin_lock(&pamt_lock);
> Just curious, Can the lock be per-2M-range?
Folks, please keep it simple.
If there's lock contention on this, we'll fix the lock contention, or
hash the physical address into a fixed number of locks. But having it be
per-2M-range sounds awful. Then you have to size it, and allocate it and
then resize it if there's ever hotplug, etc...
Kirill, could you put together some kind of torture test for this,
please? I would imagine a workload which is sitting in a loop setting up
and tearing down VMs on a bunch of CPUs would do it.
That ^ would be the worst possible case, I think. If you don't see lock
contention there, you'll hopefully never see it on real systems.
I *suspect* that real systems will get bottlenecked somewhere in the
page conversion process rather than on this lock. But it should be a
pretty simple experiment to run.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-07 16:31 ` Dave Hansen
@ 2025-05-08 2:08 ` Yan Zhao
2025-05-08 13:21 ` kirill.shutemov
2025-05-08 13:16 ` kirill.shutemov
1 sibling, 1 reply; 63+ messages in thread
From: Yan Zhao @ 2025-05-08 2:08 UTC (permalink / raw)
To: Dave Hansen
Cc: Huang, Kai, kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com, Edgecombe, Rick P, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, mingo@redhat.com,
tglx@linutronix.de, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Wed, May 07, 2025 at 09:31:22AM -0700, Dave Hansen wrote:
> On 5/5/25 05:44, Huang, Kai wrote:
> >> +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> >> + struct list_head *pamt_pages)
> >> +{
> >> + u64 err;
> >> +
> >> + hpa = ALIGN_DOWN(hpa, SZ_2M);
> >> +
> >> + spin_lock(&pamt_lock);
> > Just curious, Can the lock be per-2M-range?
>
> Folks, please keep it simple.
>
> If there's lock contention on this, we'll fix the lock contention, or
> hash the physical address into a fixed number of locks. But having it be
> per-2M-range sounds awful. Then you have to size it, and allocate it and
> then resize it if there's ever hotplug, etc...
In patch 2, there're per-2M-range pamt_refcounts. Could the per-2M-range
lock be implemented in a similar way?
+static atomic_t *pamt_refcounts;
+atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
+{
+ return &pamt_refcounts[hpa / PMD_SIZE];
+}
> Kirill, could you put together some kind of torture test for this,
> please? I would imagine a workload which is sitting in a loop setting up
> and tearing down VMs on a bunch of CPUs would do it.
>
> That ^ would be the worst possible case, I think. If you don't see lock
> contention there, you'll hopefully never see it on real systems.
When one vCPU is trying to install a guest page of HPA A, while another vCPU
is trying to install a guest page of HPA B, theoretically they may content the
global pamt_lock even if HPA A and B belong to different PAMT 2M blocks.
> I *suspect* that real systems will get bottlenecked somewhere in the
> page conversion process rather than on this lock. But it should be a
> pretty simple experiment to run.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory
2025-05-05 11:05 ` Huang, Kai
@ 2025-05-08 13:03 ` kirill.shutemov
2025-05-09 1:06 ` Huang, Kai
0 siblings, 1 reply; 63+ messages in thread
From: kirill.shutemov @ 2025-05-08 13:03 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, Edgecombe, Rick P,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
mingo@redhat.com, Zhao, Yan Y, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Mon, May 05, 2025 at 11:05:12AM +0000, Huang, Kai wrote:
>
> > +static atomic_t *pamt_refcounts;
> > +
> > static enum tdx_module_status_t tdx_module_status;
> > static DEFINE_MUTEX(tdx_module_lock);
> >
> > @@ -1035,9 +1038,108 @@ static int config_global_keyid(void)
> > return ret;
> > }
> >
> > +atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> > +{
> > + return &pamt_refcounts[hpa / PMD_SIZE];
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
>
> It's not quite clear why this function needs to be exported in this patch. IMO
> it's better to move the export to the patch which actually needs it.
>
> Looking at patch 5, tdx_pamt_get()/put() use it, and they are in KVM code. But
> I think we should just put them here in this file. tdx_alloc_page() and
> tdx_free_page() should be in this file too.
>
> And instead of exporting tdx_get_pamt_refcount(), the TDX core code here can
> export tdx_alloc_page() and tdx_free_page(), providing two high level helpers to
> allow the TDX users (e.g., KVM) to allocate/free TDX private pages. How PAMT
> pages are allocated is then hidden in the core TDX code.
We would still need tdx_get_pamt_refcount() to handle case when we need to
bump refcount for page allocated elsewhere.
> > +
> > +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
> > +{
> > + unsigned long vaddr;
> > + pte_t entry;
> > +
> > + if (!pte_none(ptep_get(pte)))
> > + return 0;
> > +
> > + vaddr = __get_free_page(GFP_KERNEL | __GFP_ZERO);
> > + if (!vaddr)
> > + return -ENOMEM;
> > +
> > + entry = pfn_pte(PFN_DOWN(__pa(vaddr)), PAGE_KERNEL);
> > +
> > + spin_lock(&init_mm.page_table_lock);
> > + if (pte_none(ptep_get(pte)))
> > + set_pte_at(&init_mm, addr, pte, entry);
> > + else
> > + free_page(vaddr);
> > + spin_unlock(&init_mm.page_table_lock);
> > +
> > + return 0;
> > +}
> > +
> > +static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr,
> > + void *data)
> > +{
> > + unsigned long vaddr;
> > +
> > + vaddr = (unsigned long)__va(PFN_PHYS(pte_pfn(ptep_get(pte))));
> > +
> > + spin_lock(&init_mm.page_table_lock);
> > + if (!pte_none(ptep_get(pte))) {
> > + pte_clear(&init_mm, addr, pte);
> > + free_page(vaddr);
> > + }
> > + spin_unlock(&init_mm.page_table_lock);
> > +
> > + return 0;
> > +}
> > +
> > +static int alloc_tdmr_pamt_refcount(struct tdmr_info *tdmr)
> > +{
> > + unsigned long start, end;
> > +
> > + start = (unsigned long)tdx_get_pamt_refcount(tdmr->base);
> > + end = (unsigned long)tdx_get_pamt_refcount(tdmr->base + tdmr->size);
> > + start = round_down(start, PAGE_SIZE);
> > + end = round_up(end, PAGE_SIZE);
> > +
> > + return apply_to_page_range(&init_mm, start, end - start,
> > + pamt_refcount_populate, NULL);
> > +}
>
> IIUC, populating refcount based on TDMR will slightly waste memory. The reason
> is IIUC we don't need to populate the refcount for a 2M range if the range is
> completely marked as reserved in TDMR, because it's not possible for the kernel
> to use such range for TDX.
>
> Populating based on the list of TDX memory blocks should be better. In
> practice, the difference should be unnoticeable, but conceptually, using TDX
> memory blocks is better.
Okay, I will look into this after dealing with huge pages.
> > +
> > +static int init_pamt_metadata(void)
> > +{
> > + size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> > + struct vm_struct *area;
> > +
> > + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> > + return 0;
> > +
> > + /*
> > + * Reserve vmalloc range for PAMT reference counters. It covers all
> > + * physical address space up to max_pfn. It is going to be populated
> > + * from init_tdmr() only for present memory that available for TDX use.
> > + */
> > + area = get_vm_area(size, VM_IOREMAP);
> > + if (!area)
> > + return -ENOMEM;
> > +
> > + pamt_refcounts = area->addr;
> > + return 0;
> > +}
> > +
> > +static void free_pamt_metadata(void)
> > +{
> > + size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> > +
> > + size = round_up(size, PAGE_SIZE);
> > + apply_to_existing_page_range(&init_mm,
> > + (unsigned long)pamt_refcounts,
> > + size, pamt_refcount_depopulate,
> > + NULL);
> > + vfree(pamt_refcounts);
> > + pamt_refcounts = NULL;
> > +}
> > +
> > static int init_tdmr(struct tdmr_info *tdmr)
> > {
> > u64 next;
> > + int ret;
> > +
> > + ret = alloc_tdmr_pamt_refcount(tdmr);
> > + if (ret)
> > + return ret;
> >
> > /*
> > * Initializing a TDMR can be time consuming. To avoid long
> > @@ -1048,7 +1150,6 @@ static int init_tdmr(struct tdmr_info *tdmr)
> > struct tdx_module_args args = {
> > .rcx = tdmr->base,
> > };
> > - int ret;
> >
> > ret = seamcall_prerr_ret(TDH_SYS_TDMR_INIT, &args);
> > if (ret)
> > @@ -1134,10 +1235,15 @@ static int init_tdx_module(void)
> > if (ret)
> > goto err_reset_pamts;
> >
> > + /* Reserve vmalloc range for PAMT reference counters */
> > + ret = init_pamt_metadata();
> > + if (ret)
> > + goto err_reset_pamts;
> > +
> > /* Initialize TDMRs to complete the TDX module initialization */
> > ret = init_tdmrs(&tdx_tdmr_list);
> > if (ret)
> > - goto err_reset_pamts;
> > + goto err_free_pamt_metadata;
> >
> > pr_info("%lu KB allocated for PAMT\n", tdmrs_count_pamt_kb(&tdx_tdmr_list));
> >
> > @@ -1149,6 +1255,9 @@ static int init_tdx_module(void)
> > put_online_mems();
> > return ret;
> >
> > +err_free_pamt_metadata:
> > + free_pamt_metadata();
> > +
> > err_reset_pamts:
> > /*
> > * Part of PAMTs may already have been initialized by the
>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-07 16:31 ` Dave Hansen
2025-05-08 2:08 ` Yan Zhao
@ 2025-05-08 13:16 ` kirill.shutemov
1 sibling, 0 replies; 63+ messages in thread
From: kirill.shutemov @ 2025-05-08 13:16 UTC (permalink / raw)
To: Dave Hansen
Cc: Huang, Kai, pbonzini@redhat.com, seanjc@google.com,
Edgecombe, Rick P, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, mingo@redhat.com, Zhao, Yan Y, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Wed, May 07, 2025 at 09:31:22AM -0700, Dave Hansen wrote:
> On 5/5/25 05:44, Huang, Kai wrote:
> >> +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> >> + struct list_head *pamt_pages)
> >> +{
> >> + u64 err;
> >> +
> >> + hpa = ALIGN_DOWN(hpa, SZ_2M);
> >> +
> >> + spin_lock(&pamt_lock);
> > Just curious, Can the lock be per-2M-range?
>
> Folks, please keep it simple.
>
> If there's lock contention on this, we'll fix the lock contention, or
> hash the physical address into a fixed number of locks.
I had this idea in mind as well.
> But having it be
> per-2M-range sounds awful. Then you have to size it, and allocate it and
> then resize it if there's ever hotplug, etc...
>
> Kirill, could you put together some kind of torture test for this,
> please? I would imagine a workload which is sitting in a loop setting up
> and tearing down VMs on a bunch of CPUs would do it.
It has to be multiple parallel creation/teardown loops. With single TD we
won't see much concurrency. Most of PAMT allocations comes from single
VCPU.
And it makes sense to do with huge pages as it cuts number of allocated
PAMT memory allocated on TD creation by factor of 10 in my setup.
JFYI, booting a TD with huge pages consumes 1-2MB of PAMT memory. I doubt
any optimization here is justifiable.
> That ^ would be the worst possible case, I think. If you don't see lock
> contention there, you'll hopefully never see it on real systems.
>
> I *suspect* that real systems will get bottlenecked somewhere in the
> page conversion process rather than on this lock. But it should be a
> pretty simple experiment to run.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-07 2:42 ` Yan Zhao
@ 2025-05-08 13:19 ` kirill.shutemov
0 siblings, 0 replies; 63+ messages in thread
From: kirill.shutemov @ 2025-05-08 13:19 UTC (permalink / raw)
To: Yan Zhao
Cc: Vishal Annapurve, Huang, Kai, pbonzini@redhat.com,
seanjc@google.com, Edgecombe, Rick P, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, mingo@redhat.com,
tglx@linutronix.de, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Wed, May 07, 2025 at 10:42:25AM +0800, Yan Zhao wrote:
> On Tue, May 06, 2025 at 06:15:40PM -0700, Vishal Annapurve wrote:
> > On Tue, May 6, 2025 at 6:04 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Mon, May 05, 2025 at 08:44:26PM +0800, Huang, Kai wrote:
> > > > On Fri, 2025-05-02 at 16:08 +0300, Kirill A. Shutemov wrote:
> > > > > +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> > > > > + struct list_head *pamt_pages)
> > > > > +{
> > > > > + u64 err;
> > > > > +
> > > > > + hpa = ALIGN_DOWN(hpa, SZ_2M);
> > > > > +
> > > > > + spin_lock(&pamt_lock);
> > > >
> > > > Just curious, Can the lock be per-2M-range?
> > > Me too.
> > > Could we introduce smaller locks each covering a 2M range?
> > >
> > > And could we deposit 2 pamt pages per-2M hpa range no matter if it's finally
> > > mapped as a huge page or not?
> > >
> >
> > Are you suggesting to keep 2 PAMT pages allocated for each private 2M
> > page even if it's mapped as a hugepage? It will lead to wastage of
> > memory of 4 MB per 1GB of guest memory range. For large VM sizes that
> > will amount to high values.
> Ok. I'm thinking of the possibility to aligning the time of PAMT page allocation
> to that of physical page allocation.
No. That's mostly wasted memory. We need to aim to allocate memory only as
needed. With huge pages wast majority of such allocations will never be
needed.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-08 2:08 ` Yan Zhao
@ 2025-05-08 13:21 ` kirill.shutemov
0 siblings, 0 replies; 63+ messages in thread
From: kirill.shutemov @ 2025-05-08 13:21 UTC (permalink / raw)
To: Yan Zhao
Cc: Dave Hansen, Huang, Kai, pbonzini@redhat.com, seanjc@google.com,
Edgecombe, Rick P, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, mingo@redhat.com, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Thu, May 08, 2025 at 10:08:32AM +0800, Yan Zhao wrote:
> On Wed, May 07, 2025 at 09:31:22AM -0700, Dave Hansen wrote:
> > On 5/5/25 05:44, Huang, Kai wrote:
> > >> +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> > >> + struct list_head *pamt_pages)
> > >> +{
> > >> + u64 err;
> > >> +
> > >> + hpa = ALIGN_DOWN(hpa, SZ_2M);
> > >> +
> > >> + spin_lock(&pamt_lock);
> > > Just curious, Can the lock be per-2M-range?
> >
> > Folks, please keep it simple.
> >
> > If there's lock contention on this, we'll fix the lock contention, or
> > hash the physical address into a fixed number of locks. But having it be
> > per-2M-range sounds awful. Then you have to size it, and allocate it and
> > then resize it if there's ever hotplug, etc...
> In patch 2, there're per-2M-range pamt_refcounts. Could the per-2M-range
> lock be implemented in a similar way?
>
> +static atomic_t *pamt_refcounts;
> +atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> +{
> + return &pamt_refcounts[hpa / PMD_SIZE];
> +}
But why? If no contention, it is just wasteful.
> > Kirill, could you put together some kind of torture test for this,
> > please? I would imagine a workload which is sitting in a loop setting up
> > and tearing down VMs on a bunch of CPUs would do it.
> >
> > That ^ would be the worst possible case, I think. If you don't see lock
> > contention there, you'll hopefully never see it on real systems.
> When one vCPU is trying to install a guest page of HPA A, while another vCPU
> is trying to install a guest page of HPA B, theoretically they may content the
> global pamt_lock even if HPA A and B belong to different PAMT 2M blocks.
This contention will be be momentary if ever happen.
> > I *suspect* that real systems will get bottlenecked somewhere in the
> > page conversion process rather than on this lock. But it should be a
> > pretty simple experiment to run.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-06 11:55 ` Yan Zhao
@ 2025-05-08 13:23 ` Kirill A. Shutemov
2025-05-09 1:25 ` Yan Zhao
0 siblings, 1 reply; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-08 13:23 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel
On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > The functions kvm_x86_ops::link_external_spt() and
> > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > covered by PAMT.
> >
> > The new function kvm_x86_ops::phys_prepare() is called before
> > link_external_spt() and set_external_spte() to ensure that the memory is
> > ready to be assigned to the virtual machine. In the case of TDX, it
> > makes sure that the memory is covered by PAMT.
> >
> > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > is available, allowing the implementation to allocate memory from a
> > per-VCPU pool.
> >
> Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
Because the memory pool we allocated from is per-vcpu and we lost access
to vcpu by then. And not all callers provide vcpu.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory
2025-05-08 13:03 ` kirill.shutemov
@ 2025-05-09 1:06 ` Huang, Kai
2025-05-12 9:53 ` kirill.shutemov
0 siblings, 1 reply; 63+ messages in thread
From: Huang, Kai @ 2025-05-09 1:06 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com
Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
dave.hansen@linux.intel.com, bp@alien8.de, mingo@redhat.com,
Zhao, Yan Y, tglx@linutronix.de, pbonzini@redhat.com,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Thu, 2025-05-08 at 16:03 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, May 05, 2025 at 11:05:12AM +0000, Huang, Kai wrote:
> >
> > > +static atomic_t *pamt_refcounts;
> > > +
> > > static enum tdx_module_status_t tdx_module_status;
> > > static DEFINE_MUTEX(tdx_module_lock);
> > >
> > > @@ -1035,9 +1038,108 @@ static int config_global_keyid(void)
> > > return ret;
> > > }
> > >
> > > +atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> > > +{
> > > + return &pamt_refcounts[hpa / PMD_SIZE];
> > > +}
> > > +EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
> >
> > It's not quite clear why this function needs to be exported in this patch. IMO
> > it's better to move the export to the patch which actually needs it.
> >
> > Looking at patch 5, tdx_pamt_get()/put() use it, and they are in KVM code. But
> > I think we should just put them here in this file. tdx_alloc_page() and
> > tdx_free_page() should be in this file too.
> >
> > And instead of exporting tdx_get_pamt_refcount(), the TDX core code here can
> > export tdx_alloc_page() and tdx_free_page(), providing two high level helpers to
> > allow the TDX users (e.g., KVM) to allocate/free TDX private pages. How PAMT
> > pages are allocated is then hidden in the core TDX code.
>
> We would still need tdx_get_pamt_refcount() to handle case when we need to
> bump refcount for page allocated elsewhere.
Hmm I am not sure I am following this. What "page allocated" are you referring
to? I am probably missing something, but if the caller wants a TDX page then it
should just call tdx_alloc_page() which handles refcount bumping internally.
No?
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-08 13:23 ` Kirill A. Shutemov
@ 2025-05-09 1:25 ` Yan Zhao
2025-05-12 9:55 ` Kirill A. Shutemov
0 siblings, 1 reply; 63+ messages in thread
From: Yan Zhao @ 2025-05-09 1:25 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel
On Thu, May 08, 2025 at 04:23:56PM +0300, Kirill A. Shutemov wrote:
> On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> > On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > > The functions kvm_x86_ops::link_external_spt() and
> > > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > > covered by PAMT.
> > >
> > > The new function kvm_x86_ops::phys_prepare() is called before
> > > link_external_spt() and set_external_spte() to ensure that the memory is
> > > ready to be assigned to the virtual machine. In the case of TDX, it
> > > makes sure that the memory is covered by PAMT.
> > >
> > > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > > is available, allowing the implementation to allocate memory from a
> > > per-VCPU pool.
> > >
> > Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> > Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
>
> Because the memory pool we allocated from is per-vcpu and we lost access
> to vcpu by then. And not all callers provide vcpu.
Maybe we can get vcpu via kvm_get_running_vcpu(), as in [1].
Then for callers not providing vcpu (where vcpu is NULL), we can use per-KVM
cache?
[1] https://lore.kernel.org/all/20250424030926.554-1-yan.y.zhao@intel.com/
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory
2025-05-02 13:08 ` [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory Kirill A. Shutemov
2025-05-05 11:05 ` Huang, Kai
@ 2025-05-09 9:52 ` Chao Gao
2025-05-12 9:51 ` Kirill A. Shutemov
1 sibling, 1 reply; 63+ messages in thread
From: Chao Gao @ 2025-05-09 9:52 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
>+static int init_pamt_metadata(void)
>+{
>+ size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
>+ struct vm_struct *area;
>+
>+ if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
>+ return 0;
>+
>+ /*
>+ * Reserve vmalloc range for PAMT reference counters. It covers all
>+ * physical address space up to max_pfn. It is going to be populated
>+ * from init_tdmr() only for present memory that available for TDX use.
>+ */
>+ area = get_vm_area(size, VM_IOREMAP);
>+ if (!area)
>+ return -ENOMEM;
>+
>+ pamt_refcounts = area->addr;
>+ return 0;
>+}
>+
>+static void free_pamt_metadata(void)
>+{
>+ size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
>+
Shouldn't the free path also be gated by tdx_supports_dynamic_pamt()?
There is a possibility that pamt_refcounts could be NULL here, e.g., the
TDX module doesn't support dynamic PAMT and init_tdmrs() encountered an
error. I am assuming that apply_to_existing_page_range() below will cause
issues if pamt_refcounts is NULL, e.g., unmap mappings set up by others.
>+ size = round_up(size, PAGE_SIZE);
>+ apply_to_existing_page_range(&init_mm,
>+ (unsigned long)pamt_refcounts,
>+ size, pamt_refcount_depopulate,
>+ NULL);
>+ vfree(pamt_refcounts);
>+ pamt_refcounts = NULL;
>+}
>+
> static int init_tdmr(struct tdmr_info *tdmr)
> {
> u64 next;
>+ int ret;
>+
>+ ret = alloc_tdmr_pamt_refcount(tdmr);
>+ if (ret)
>+ return ret;
>
> /*
> * Initializing a TDMR can be time consuming. To avoid long
>@@ -1048,7 +1150,6 @@ static int init_tdmr(struct tdmr_info *tdmr)
> struct tdx_module_args args = {
> .rcx = tdmr->base,
> };
>- int ret;
>
> ret = seamcall_prerr_ret(TDH_SYS_TDMR_INIT, &args);
> if (ret)
>@@ -1134,10 +1235,15 @@ static int init_tdx_module(void)
> if (ret)
> goto err_reset_pamts;
>
>+ /* Reserve vmalloc range for PAMT reference counters */
>+ ret = init_pamt_metadata();
>+ if (ret)
>+ goto err_reset_pamts;
>+
> /* Initialize TDMRs to complete the TDX module initialization */
> ret = init_tdmrs(&tdx_tdmr_list);
> if (ret)
>- goto err_reset_pamts;
>+ goto err_free_pamt_metadata;
>
> pr_info("%lu KB allocated for PAMT\n", tdmrs_count_pamt_kb(&tdx_tdmr_list));
>
>@@ -1149,6 +1255,9 @@ static int init_tdx_module(void)
> put_online_mems();
> return ret;
>
>+err_free_pamt_metadata:
>+ free_pamt_metadata();
>+
> err_reset_pamts:
> /*
> * Part of PAMTs may already have been initialized by the
>--
>2.47.2
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 03/12] x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE
2025-05-02 13:08 ` [RFC, PATCH 03/12] x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE Kirill A. Shutemov
@ 2025-05-09 10:18 ` Chao Gao
2025-05-12 9:55 ` Kirill A. Shutemov
0 siblings, 1 reply; 63+ messages in thread
From: Chao Gao @ 2025-05-09 10:18 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
> int tdx_guest_keyid_alloc(void);
> u32 tdx_get_nr_guest_keyids(void);
> void tdx_guest_keyid_free(unsigned int keyid);
>@@ -197,6 +202,9 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
> u64 tdh_phymem_cache_wb(bool resume);
> u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
> u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
>+u64 tdh_phymem_pamt_add(unsigned long hpa, struct list_head *pamt_pages);
>+u64 tdh_phymem_pamt_remove(unsigned long hpa, struct list_head *pamt_pages);
When these SEAMCALL wrappers were added, Dave requested that a struct page
be passed in instead of an HPA [*]. Does this apply to
tdh_phymem_pamt_add/remove()?
[*]: https://lore.kernel.org/kvm/30d0cef5-82d5-4325-b149-0e99833b8785@intel.com/
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory
2025-05-09 9:52 ` Chao Gao
@ 2025-05-12 9:51 ` Kirill A. Shutemov
0 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-12 9:51 UTC (permalink / raw)
To: Chao Gao
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
On Fri, May 09, 2025 at 05:52:16PM +0800, Chao Gao wrote:
> >+static int init_pamt_metadata(void)
> >+{
> >+ size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> >+ struct vm_struct *area;
> >+
> >+ if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> >+ return 0;
> >+
> >+ /*
> >+ * Reserve vmalloc range for PAMT reference counters. It covers all
> >+ * physical address space up to max_pfn. It is going to be populated
> >+ * from init_tdmr() only for present memory that available for TDX use.
> >+ */
> >+ area = get_vm_area(size, VM_IOREMAP);
> >+ if (!area)
> >+ return -ENOMEM;
> >+
> >+ pamt_refcounts = area->addr;
> >+ return 0;
> >+}
> >+
> >+static void free_pamt_metadata(void)
> >+{
> >+ size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> >+
>
> Shouldn't the free path also be gated by tdx_supports_dynamic_pamt()?
True. Missed this.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory
2025-05-09 1:06 ` Huang, Kai
@ 2025-05-12 9:53 ` kirill.shutemov
2025-05-13 23:24 ` Huang, Kai
0 siblings, 1 reply; 63+ messages in thread
From: kirill.shutemov @ 2025-05-12 9:53 UTC (permalink / raw)
To: Huang, Kai
Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
dave.hansen@linux.intel.com, bp@alien8.de, mingo@redhat.com,
Zhao, Yan Y, tglx@linutronix.de, pbonzini@redhat.com,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Fri, May 09, 2025 at 01:06:05AM +0000, Huang, Kai wrote:
> On Thu, 2025-05-08 at 16:03 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Mon, May 05, 2025 at 11:05:12AM +0000, Huang, Kai wrote:
> > >
> > > > +static atomic_t *pamt_refcounts;
> > > > +
> > > > static enum tdx_module_status_t tdx_module_status;
> > > > static DEFINE_MUTEX(tdx_module_lock);
> > > >
> > > > @@ -1035,9 +1038,108 @@ static int config_global_keyid(void)
> > > > return ret;
> > > > }
> > > >
> > > > +atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> > > > +{
> > > > + return &pamt_refcounts[hpa / PMD_SIZE];
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
> > >
> > > It's not quite clear why this function needs to be exported in this patch. IMO
> > > it's better to move the export to the patch which actually needs it.
> > >
> > > Looking at patch 5, tdx_pamt_get()/put() use it, and they are in KVM code. But
> > > I think we should just put them here in this file. tdx_alloc_page() and
> > > tdx_free_page() should be in this file too.
> > >
> > > And instead of exporting tdx_get_pamt_refcount(), the TDX core code here can
> > > export tdx_alloc_page() and tdx_free_page(), providing two high level helpers to
> > > allow the TDX users (e.g., KVM) to allocate/free TDX private pages. How PAMT
> > > pages are allocated is then hidden in the core TDX code.
> >
> > We would still need tdx_get_pamt_refcount() to handle case when we need to
> > bump refcount for page allocated elsewhere.
>
> Hmm I am not sure I am following this. What "page allocated" are you referring
> to? I am probably missing something, but if the caller wants a TDX page then it
> should just call tdx_alloc_page() which handles refcount bumping internally.
> No?
Pages that get mapped to the guest is allocated externally via
guest_memfd and we need bump refcount for them.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 03/12] x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE
2025-05-09 10:18 ` Chao Gao
@ 2025-05-12 9:55 ` Kirill A. Shutemov
0 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-12 9:55 UTC (permalink / raw)
To: Chao Gao
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
On Fri, May 09, 2025 at 06:18:01PM +0800, Chao Gao wrote:
> > int tdx_guest_keyid_alloc(void);
> > u32 tdx_get_nr_guest_keyids(void);
> > void tdx_guest_keyid_free(unsigned int keyid);
> >@@ -197,6 +202,9 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
> > u64 tdh_phymem_cache_wb(bool resume);
> > u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
> > u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
> >+u64 tdh_phymem_pamt_add(unsigned long hpa, struct list_head *pamt_pages);
> >+u64 tdh_phymem_pamt_remove(unsigned long hpa, struct list_head *pamt_pages);
>
> When these SEAMCALL wrappers were added, Dave requested that a struct page
> be passed in instead of an HPA [*]. Does this apply to
> tdh_phymem_pamt_add/remove()?
>
> [*]: https://lore.kernel.org/kvm/30d0cef5-82d5-4325-b149-0e99833b8785@intel.com/
hpa here points to a 2M region that pamt_pages covers. We don't have
struct page that represents it. Passing 4k struct page would be
misleading IMO.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-09 1:25 ` Yan Zhao
@ 2025-05-12 9:55 ` Kirill A. Shutemov
2025-05-14 0:00 ` Huang, Kai
0 siblings, 1 reply; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-12 9:55 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel
On Fri, May 09, 2025 at 09:25:58AM +0800, Yan Zhao wrote:
> On Thu, May 08, 2025 at 04:23:56PM +0300, Kirill A. Shutemov wrote:
> > On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> > > On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > > > The functions kvm_x86_ops::link_external_spt() and
> > > > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > > > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > > > covered by PAMT.
> > > >
> > > > The new function kvm_x86_ops::phys_prepare() is called before
> > > > link_external_spt() and set_external_spte() to ensure that the memory is
> > > > ready to be assigned to the virtual machine. In the case of TDX, it
> > > > makes sure that the memory is covered by PAMT.
> > > >
> > > > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > > > is available, allowing the implementation to allocate memory from a
> > > > per-VCPU pool.
> > > >
> > > Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> > > Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
> >
> > Because the memory pool we allocated from is per-vcpu and we lost access
> > to vcpu by then. And not all callers provide vcpu.
> Maybe we can get vcpu via kvm_get_running_vcpu(), as in [1].
> Then for callers not providing vcpu (where vcpu is NULL), we can use per-KVM
> cache?
Hm. I was not aware of kvm_get_running_vcpu(). Will play with it, thanks.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory
2025-05-12 9:53 ` kirill.shutemov
@ 2025-05-13 23:24 ` Huang, Kai
0 siblings, 0 replies; 63+ messages in thread
From: Huang, Kai @ 2025-05-13 23:24 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com
Cc: Edgecombe, Rick P, seanjc@google.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, mingo@redhat.com,
Zhao, Yan Y, tglx@linutronix.de, pbonzini@redhat.com,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Mon, 2025-05-12 at 12:53 +0300, kirill.shutemov@linux.intel.com wrote:
> On Fri, May 09, 2025 at 01:06:05AM +0000, Huang, Kai wrote:
> > On Thu, 2025-05-08 at 16:03 +0300, kirill.shutemov@linux.intel.com wrote:
> > > On Mon, May 05, 2025 at 11:05:12AM +0000, Huang, Kai wrote:
> > > >
> > > > > +static atomic_t *pamt_refcounts;
> > > > > +
> > > > > static enum tdx_module_status_t tdx_module_status;
> > > > > static DEFINE_MUTEX(tdx_module_lock);
> > > > >
> > > > > @@ -1035,9 +1038,108 @@ static int config_global_keyid(void)
> > > > > return ret;
> > > > > }
> > > > >
> > > > > +atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> > > > > +{
> > > > > + return &pamt_refcounts[hpa / PMD_SIZE];
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
> > > >
> > > > It's not quite clear why this function needs to be exported in this patch. IMO
> > > > it's better to move the export to the patch which actually needs it.
> > > >
> > > > Looking at patch 5, tdx_pamt_get()/put() use it, and they are in KVM code. But
> > > > I think we should just put them here in this file. tdx_alloc_page() and
> > > > tdx_free_page() should be in this file too.
> > > >
> > > > And instead of exporting tdx_get_pamt_refcount(), the TDX core code here can
> > > > export tdx_alloc_page() and tdx_free_page(), providing two high level helpers to
> > > > allow the TDX users (e.g., KVM) to allocate/free TDX private pages. How PAMT
> > > > pages are allocated is then hidden in the core TDX code.
> > >
> > > We would still need tdx_get_pamt_refcount() to handle case when we need to
> > > bump refcount for page allocated elsewhere.
> >
> > Hmm I am not sure I am following this. What "page allocated" are you referring
> > to? I am probably missing something, but if the caller wants a TDX page then it
> > should just call tdx_alloc_page() which handles refcount bumping internally.
> > No?
>
> Pages that get mapped to the guest is allocated externally via
> guest_memfd and we need bump refcount for them.
Oh right. TDX private pages can also be in page cache.
It's better to have a way to consolidate page allocation for TDX but with page
cache I don't see a simple straightforward way to do that.
For now, I think we can just export tdx_pamt_{get|put}() in the core TDX code.
We can also provide tdx_{alloc|free}_page() wrappers (e.g., static inline in
<asm/tdx.h>) for kernel TDX memory allocation so that they can be used for TDX
Connect too.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-12 9:55 ` Kirill A. Shutemov
@ 2025-05-14 0:00 ` Huang, Kai
2025-05-14 6:43 ` kirill.shutemov
2025-05-23 12:00 ` kirill.shutemov
0 siblings, 2 replies; 63+ messages in thread
From: Huang, Kai @ 2025-05-14 0:00 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com, Zhao, Yan Y
Cc: Edgecombe, Rick P, seanjc@google.com, dave.hansen@linux.intel.com,
bp@alien8.de, x86@kernel.org, mingo@redhat.com,
tglx@linutronix.de, pbonzini@redhat.com, kvm@vger.kernel.org,
Yamahata, Isaku, linux-kernel@vger.kernel.org,
linux-coco@lists.linux.dev
On Mon, 2025-05-12 at 12:55 +0300, Kirill A. Shutemov wrote:
> On Fri, May 09, 2025 at 09:25:58AM +0800, Yan Zhao wrote:
> > On Thu, May 08, 2025 at 04:23:56PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> > > > On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > > > > The functions kvm_x86_ops::link_external_spt() and
> > > > > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > > > > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > > > > covered by PAMT.
> > > > >
> > > > > The new function kvm_x86_ops::phys_prepare() is called before
> > > > > link_external_spt() and set_external_spte() to ensure that the memory is
> > > > > ready to be assigned to the virtual machine. In the case of TDX, it
> > > > > makes sure that the memory is covered by PAMT.
> > > > >
> > > > > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > > > > is available, allowing the implementation to allocate memory from a
> > > > > per-VCPU pool.
> > > > >
> > > > Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> > > > Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
> > >
> > > Because the memory pool we allocated from is per-vcpu and we lost access
> > > to vcpu by then. And not all callers provide vcpu.
> > Maybe we can get vcpu via kvm_get_running_vcpu(), as in [1].
> > Then for callers not providing vcpu (where vcpu is NULL), we can use per-KVM
> > cache?
>
> Hm. I was not aware of kvm_get_running_vcpu(). Will play with it, thanks.
I am not sure why per-vcpu cache matters.
For non-leaf SEPT pages, AFAICT the "vcpu->arch.mmu_external_spt_cache" is just
an empty cache, and eventually __get_free_page() is used to allocate in:
sp->external_spt =
kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
So why not we actually create a kmem_cache for it with an actual 'ctor', and we
can call tdx_alloc_page() in that. This makes sure when the "external_spt" is
allocated, the underneath PAMT entry is there.
For the last level guest memory page, similar to SEV, we can hook the
kvm_arch_gmem_prepare() to call tdx_alloc_page() to make PAMT entry ready.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 09/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path
2025-05-02 13:08 ` [RFC, PATCH 09/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path Kirill A. Shutemov
@ 2025-05-14 0:07 ` Huang, Kai
2025-05-14 6:30 ` Chao Gao
1 sibling, 0 replies; 63+ messages in thread
From: Huang, Kai @ 2025-05-14 0:07 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
seanjc@google.com
Cc: Edgecombe, Rick P, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, mingo@redhat.com, Zhao, Yan Y, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Fri, 2025-05-02 at 16:08 +0300, Kirill A. Shutemov wrote:
> Preallocate a page to be used in the link_external_spt() and
> set_external_spte() paths.
>
> In the worst-case scenario, handling a page fault might require a
> tdx_nr_pamt_pages() pages for each page table level.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
> 2 files changed, 12 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 91958c55f918..a5661499a176 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -849,6 +849,8 @@ struct kvm_vcpu_arch {
> */
> struct kvm_mmu_memory_cache mmu_external_spt_cache;
>
> + struct kvm_mmu_memory_cache pamt_page_cache;
> +
> /*
> * QEMU userspace and the guest each have their own FPU state.
> * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a284dce227a0..7bfa0dc50440 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -616,6 +616,15 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> if (r)
> return r;
> }
> +
> + if (vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM) {
> + int nr = tdx_nr_pamt_pages(tdx_get_sysinfo());
> + r = kvm_mmu_topup_memory_cache(&vcpu->arch.pamt_page_cache,
> + nr * PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> + }
> +
> return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> PT64_ROOT_MAX_LEVEL);
> }
> @@ -626,6 +635,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_external_spt_cache);
> + kvm_mmu_free_memory_cache(&vcpu->arch.pamt_page_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
>
IIUC, this patch can be avoided if we create an actual kmem_cache for
mmu_external_spt_cache with an actual 'ctor' where we simply call
tdx_alloc_page() as replied to the previous patch.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 11/12] KVM: TDX: Reclaim PAMT memory
2025-05-02 13:08 ` [RFC, PATCH 11/12] KVM: TDX: Reclaim PAMT memory Kirill A. Shutemov
@ 2025-05-14 1:11 ` Huang, Kai
2025-05-14 15:21 ` Vishal Annapurve
0 siblings, 1 reply; 63+ messages in thread
From: Huang, Kai @ 2025-05-14 1:11 UTC (permalink / raw)
To: Kirill A. Shutemov, pbonzini, seanjc
Cc: rick.p.edgecombe, isaku.yamahata, yan.y.zhao, tglx, mingo, bp,
dave.hansen, kvm, x86, linux-coco, linux-kernel
On 3/05/2025 1:08 am, Kirill A. Shutemov wrote:
> The PAMT memory holds metadata for TDX-protected memory. With Dynamic
> PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
> with a few pages that cover 2M of host physical memory.
>
> PAMT memory can be reclaimed when the last user is gone. It can happen
> in a few code paths:
>
> - On TDH.PHYMEM.PAGE.RECLAIM in tdx_reclaim_td_control_pages() and
> tdx_reclaim_page().
>
> - On TDH.MEM.PAGE.REMOVE in tdx_sept_drop_private_spte().
>
> - In tdx_sept_zap_private_spte() for pages that were in the queue to be
> added with TDH.MEM.PAGE.ADD, but it never happened due to an error.
>
> Add tdx_pamt_put() in these code paths.
IMHO, instead of explicitly hooking tdx_pamt_put() to various places, we
should just do tdx_free_page() for the pages that were allocated by
tdx_alloc_page() (i.e., control pages, SEPT pages).
That means, IMHO, we should do PAMT allocation/free when we actually
*allocate* and *free* the target TDX private page(s). I.e., we should:
- For TDX private pages with normal kernel allocation (control pages,
SEPT pages etc), we use tdx_alloc_page() and tdx_free_page().
- For TDX private pages in page cache, i.e., guest_memfd, since we
cannot use tdx_{alloc|free}_page(), we hook guest_memfd code to call
tdx_pamt_{get|put}().
(I wish there's a way to unify the above two as well, but I don't have a
simple way to do that.)
I believe this can help simplifying the code.
So, ...
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
> arch/x86/kvm/vmx/tdx.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 0f06ae7ff6b9..352f7b41f611 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -487,8 +487,11 @@ static int tdx_reclaim_page(struct page *page)
> int r;
>
> r = __tdx_reclaim_page(page);
> - if (!r)
> + if (!r) {
> tdx_clear_page(page);
> + tdx_pamt_put(page);
> + }
> +
> return r;
> }
>
... I think this change should be removed, and ...
[...]
> + tdx_pamt_put(kvm_tdx->td.tdr_page);
>
> __free_page(kvm_tdx->td.tdr_page);
... The above two should be just:
tdx_free_page(kvm_tdx->td.tdr_page);
and ...
> kvm_tdx->td.tdr_page = NULL;
> @@ -1768,6 +1772,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> return -EIO;
> }
> tdx_clear_page(page);
> + tdx_pamt_put(page);
> tdx_unpin(kvm, page);
> return 0;
> }
> @@ -1848,6 +1853,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
> !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
> atomic64_dec(&kvm_tdx->nr_premapped);
> + tdx_pamt_put(page);
> tdx_unpin(kvm, page);
> return 0;
> }
... the above should be removed too.
For PAMT associated with sp->external_spt, we can call tdx_pamt_put()
when we free sp->external_spt.
For PAMT associated with TDX memory in guest_memfd, we can have a
guest_memfd specific a_ops->folio_invalidate() in which we can have a
hook opposite to kvm_gmem_prepare_folio() to do tdx_pamt_put(). That
should cover all the cases, right?
Or anything I missed?
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-02 13:08 ` [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers Kirill A. Shutemov
2025-05-05 12:44 ` Huang, Kai
@ 2025-05-14 5:25 ` Chao Gao
2025-05-23 10:46 ` Kirill A. Shutemov
2025-05-14 5:33 ` Chao Gao
2 siblings, 1 reply; 63+ messages in thread
From: Chao Gao @ 2025-05-14 5:25 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
>+static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
>+ struct list_head *pamt_pages)
>+{
>+ u64 err;
>+
>+ hpa = ALIGN_DOWN(hpa, SZ_2M);
Nit: it is better to use SZ_2M or PMD_SIZE consistently.
e.g., patch 2 uses PMD_SIZE:
+atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
+{
+ return &pamt_refcounts[hpa / PMD_SIZE];
+}
+EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
>+
>+ spin_lock(&pamt_lock);
>+
>+ /* Lost race to other tdx_pamt_add() */
>+ if (atomic_read(pamt_refcount) != 0) {
>+ atomic_inc(pamt_refcount);
>+ spin_unlock(&pamt_lock);
>+ tdx_free_pamt_pages(pamt_pages);
>+ return 0;
>+ }
>+
>+ err = tdh_phymem_pamt_add(hpa | TDX_PS_2M, pamt_pages);
>+
>+ if (err)
>+ tdx_free_pamt_pages(pamt_pages);
>+
>+ /*
>+ * tdx_hpa_range_not_free() is true if current task won race
>+ * against tdx_pamt_put().
>+ */
>+ if (err && !tdx_hpa_range_not_free(err)) {
>+ spin_unlock(&pamt_lock);
>+ pr_tdx_error(TDH_PHYMEM_PAMT_ADD, err);
>+ return -EIO;
>+ }
IIUC, this chunk is needed because tdx_pamt_put() decreases the refcount
without holding the pamt_lock. Why not move that decrease inside the lock?
And I suggest that all accesses to the pamt_refcount should be performed with
the pamt_lock held. This can make the code much clearer. It's similar to how
kvm_usage_count is managed, where transitions from 0 to 1 or 1 to 0 require
extra work, but other cases simply increases or decreases the refcount.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-02 13:08 ` [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers Kirill A. Shutemov
2025-05-05 12:44 ` Huang, Kai
2025-05-14 5:25 ` Chao Gao
@ 2025-05-14 5:33 ` Chao Gao
2025-05-14 6:25 ` Kirill A. Shutemov
2 siblings, 1 reply; 63+ messages in thread
From: Chao Gao @ 2025-05-14 5:33 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
>+static void tdx_pamt_put(struct page *page)
>+{
>+ unsigned long hpa = page_to_phys(page);
>+ atomic_t *pamt_refcount;
>+ LIST_HEAD(pamt_pages);
>+ u64 err;
>+
>+ if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
>+ return;
>+
>+ hpa = ALIGN_DOWN(hpa, SZ_2M);
>+
>+ pamt_refcount = tdx_get_pamt_refcount(hpa);
>+ if (!atomic_dec_and_test(pamt_refcount))
>+ return;
>+
>+ spin_lock(&pamt_lock);
>+
>+ /* Lost race against tdx_pamt_add()? */
>+ if (atomic_read(pamt_refcount) != 0) {
>+ spin_unlock(&pamt_lock);
>+ return;
>+ }
>+
>+ err = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, &pamt_pages);
>+ spin_unlock(&pamt_lock);
>+
>+ if (err) {
>+ pr_tdx_error(TDH_PHYMEM_PAMT_REMOVE, err);
Should the refcount be increased here, since the PAMT pages are not removed?
>+ return;
>+ }
>+
>+ tdx_free_pamt_pages(&pamt_pages);
>+}
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-02 13:08 ` [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops Kirill A. Shutemov
2025-05-06 11:55 ` Yan Zhao
@ 2025-05-14 6:15 ` Chao Gao
1 sibling, 0 replies; 63+ messages in thread
From: Chao Gao @ 2025-05-14 6:15 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
>+static int tdp_mmu_install_spte(struct kvm_vcpu *vcpu,
>+ struct tdp_iter *iter,
>+ u64 spte)
>+{
>+ kvm_pfn_t pfn = 0;
>+ int ret = 0;
>+
>+ if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(spte)) {
>+ pfn = spte_to_pfn(spte);
>+ ret = static_call(kvm_x86_phys_prepare)(vcpu, pfn);
nit: kvm is using kvm_x86_call() in most of cases, e.g.,
ret = kvm_x86_call(phys_prepare)(vcpu, pfn);
>+ }
>+ if (ret)
>+ return ret;
fold this chunk into the if() statement above to align with tdp_mmu_link_sp()
below?
I'm concerned about handling phys_prepare() failures. Such failures may not be
recoverable. ...
>+ ret = tdp_mmu_set_spte_atomic(vcpu->kvm, iter, spte);
>+ if (pfn && ret)
>+ static_call(kvm_x86_phys_cleanup)(pfn);
>+
>+ return ret;
>+}
>+
> /*
> * Installs a last-level SPTE to handle a TDP page fault.
> * (NPT/EPT violation/misconfiguration)
>@@ -1170,7 +1190,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>
> if (new_spte == iter->old_spte)
> ret = RET_PF_SPURIOUS;
>- else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
>+ else if (tdp_mmu_install_spte(vcpu, iter, new_spte))
> return RET_PF_RETRY;
if RET_FP_RETRY is returned here, it could potentially cause an infinite loop.
I think we need a KVM_BUG_ON() somewhere.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-14 5:33 ` Chao Gao
@ 2025-05-14 6:25 ` Kirill A. Shutemov
0 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-14 6:25 UTC (permalink / raw)
To: Chao Gao
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
On Wed, May 14, 2025 at 01:33:38PM +0800, Chao Gao wrote:
> >+static void tdx_pamt_put(struct page *page)
> >+{
> >+ unsigned long hpa = page_to_phys(page);
> >+ atomic_t *pamt_refcount;
> >+ LIST_HEAD(pamt_pages);
> >+ u64 err;
> >+
> >+ if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
> >+ return;
> >+
> >+ hpa = ALIGN_DOWN(hpa, SZ_2M);
> >+
> >+ pamt_refcount = tdx_get_pamt_refcount(hpa);
> >+ if (!atomic_dec_and_test(pamt_refcount))
> >+ return;
> >+
> >+ spin_lock(&pamt_lock);
> >+
> >+ /* Lost race against tdx_pamt_add()? */
> >+ if (atomic_read(pamt_refcount) != 0) {
> >+ spin_unlock(&pamt_lock);
> >+ return;
> >+ }
> >+
> >+ err = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, &pamt_pages);
> >+ spin_unlock(&pamt_lock);
> >+
> >+ if (err) {
> >+ pr_tdx_error(TDH_PHYMEM_PAMT_REMOVE, err);
>
> Should the refcount be increased here, since the PAMT pages are not removed?
Right. Thanks.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 09/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path
2025-05-02 13:08 ` [RFC, PATCH 09/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path Kirill A. Shutemov
2025-05-14 0:07 ` Huang, Kai
@ 2025-05-14 6:30 ` Chao Gao
2025-05-30 10:28 ` Kirill A. Shutemov
1 sibling, 1 reply; 63+ messages in thread
From: Chao Gao @ 2025-05-14 6:30 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
On Fri, May 02, 2025 at 04:08:25PM +0300, Kirill A. Shutemov wrote:
>Preallocate a page to be used in the link_external_spt() and
>set_external_spte() paths.
>
>In the worst-case scenario, handling a page fault might require a
>tdx_nr_pamt_pages() pages for each page table level.
>
>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
> 2 files changed, 12 insertions(+)
>
>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>index 91958c55f918..a5661499a176 100644
>--- a/arch/x86/include/asm/kvm_host.h
>+++ b/arch/x86/include/asm/kvm_host.h
>@@ -849,6 +849,8 @@ struct kvm_vcpu_arch {
> */
> struct kvm_mmu_memory_cache mmu_external_spt_cache;
>
>+ struct kvm_mmu_memory_cache pamt_page_cache;
>+
> /*
> * QEMU userspace and the guest each have their own FPU state.
> * In vcpu_run, we switch between the user and guest FPU contexts.
>diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>index a284dce227a0..7bfa0dc50440 100644
>--- a/arch/x86/kvm/mmu/mmu.c
>+++ b/arch/x86/kvm/mmu/mmu.c
>@@ -616,6 +616,15 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> if (r)
> return r;
> }
>+
>+ if (vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM) {
The check for vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM is identical to
kvm_has_mirrored_tdp() a few lines above.
>+ int nr = tdx_nr_pamt_pages(tdx_get_sysinfo());
Since you're already accessing tdx_sysinfo, you can check if dynamic PAMT is
enabled and allocate the pamt page cache accordingly.
>+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.pamt_page_cache,
>+ nr * PT64_ROOT_MAX_LEVEL);
>+ if (r)
>+ return r;
>+ }
>+
> return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> PT64_ROOT_MAX_LEVEL);
> }
>@@ -626,6 +635,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_external_spt_cache);
>+ kvm_mmu_free_memory_cache(&vcpu->arch.pamt_page_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
>
>--
>2.47.2
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-14 0:00 ` Huang, Kai
@ 2025-05-14 6:43 ` kirill.shutemov
2025-05-19 5:00 ` Huang, Kai
2025-05-23 12:00 ` kirill.shutemov
1 sibling, 1 reply; 63+ messages in thread
From: kirill.shutemov @ 2025-05-14 6:43 UTC (permalink / raw)
To: Huang, Kai
Cc: Zhao, Yan Y, Edgecombe, Rick P, seanjc@google.com,
dave.hansen@linux.intel.com, bp@alien8.de, x86@kernel.org,
mingo@redhat.com, tglx@linutronix.de, pbonzini@redhat.com,
kvm@vger.kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
On Wed, May 14, 2025 at 12:00:17AM +0000, Huang, Kai wrote:
> On Mon, 2025-05-12 at 12:55 +0300, Kirill A. Shutemov wrote:
> > On Fri, May 09, 2025 at 09:25:58AM +0800, Yan Zhao wrote:
> > > On Thu, May 08, 2025 at 04:23:56PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> > > > > On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > > > > > The functions kvm_x86_ops::link_external_spt() and
> > > > > > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > > > > > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > > > > > covered by PAMT.
> > > > > >
> > > > > > The new function kvm_x86_ops::phys_prepare() is called before
> > > > > > link_external_spt() and set_external_spte() to ensure that the memory is
> > > > > > ready to be assigned to the virtual machine. In the case of TDX, it
> > > > > > makes sure that the memory is covered by PAMT.
> > > > > >
> > > > > > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > > > > > is available, allowing the implementation to allocate memory from a
> > > > > > per-VCPU pool.
> > > > > >
> > > > > Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> > > > > Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
> > > >
> > > > Because the memory pool we allocated from is per-vcpu and we lost access
> > > > to vcpu by then. And not all callers provide vcpu.
> > > Maybe we can get vcpu via kvm_get_running_vcpu(), as in [1].
> > > Then for callers not providing vcpu (where vcpu is NULL), we can use per-KVM
> > > cache?
> >
> > Hm. I was not aware of kvm_get_running_vcpu(). Will play with it, thanks.
>
> I am not sure why per-vcpu cache matters.
>
> For non-leaf SEPT pages, AFAICT the "vcpu->arch.mmu_external_spt_cache" is just
> an empty cache, and eventually __get_free_page() is used to allocate in:
>
> sp->external_spt =
> kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
>
> So why not we actually create a kmem_cache for it with an actual 'ctor', and we
> can call tdx_alloc_page() in that. This makes sure when the "external_spt" is
> allocated, the underneath PAMT entry is there.
This would make hard to debug PAMT memory leaks. external_spt pages in the
pool will have PAMT memory tied to them, so we will have non-zero PAMT
memory usage with zero TDs running.
> For the last level guest memory page, similar to SEV, we can hook the
> kvm_arch_gmem_prepare() to call tdx_alloc_page() to make PAMT entry ready.
I don't think kvm_arch_gmem_prepare() is right place to allocate PAMT
memory. THPs are dynamic and page order can change due to split or
collapse between the time the page is allocated and gets mapped into EPT.
I am not sure if SEV code is correct in this regard.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (11 preceding siblings ...)
2025-05-02 13:08 ` [RFC, PATCH 12/12] x86/virt/tdx: Enable Dynamic PAMT Kirill A. Shutemov
@ 2025-05-14 13:41 ` Sean Christopherson
2025-05-15 14:22 ` Kirill A. Shutemov
2025-05-14 20:33 ` Zhi Wang
13 siblings, 1 reply; 63+ messages in thread
From: Sean Christopherson @ 2025-05-14 13:41 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao,
tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel
On Fri, May 02, 2025, Kirill A. Shutemov wrote:
> This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be
> applied, but rather to receive early feedback on the feature design and
> enabling.
In that case, please describe the design, and specifically *why* you chose this
particular design, along with the constraints and rules of dynamic PAMTs that
led to that decision. It would also be very helpful to know what options you
considered and discarded, so that others don't waste time coming up with solutions
that you already rejected.
> >From our perspective, this feature has a lower priority compared to huge
> page support. I will rebase this patchset on top of Yan's huge page
> enabling at a later time, as it requires additional work.
>
> Any feedback is welcome. We are open to ideas.
>
> =========================================================================
>
> The Physical Address Metadata Table (PAMT) holds TDX metadata for
> physical memory and must be allocated by the kernel during TDX module
> initialization.
>
> The exact size of the required PAMT memory is determined by the TDX
> module and may vary between TDX module versions, but currently it is
> approximately 0.4% of the system memory. This is a significant
> commitment, especially if it is not known upfront whether the machine
> will run any TDX guests.
>
> The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
> PAMT_2M levels are still allocated on TDX module initialization, but the
> PAMT_4K level is allocated dynamically, reducing static allocations to
> approximately 0.004% of the system memory.
>
> PAMT memory is dynamically allocated as pages gain TDX protections.
> It is reclaimed when TDX protections have been removed from all
> pages in a contiguous area.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 11/12] KVM: TDX: Reclaim PAMT memory
2025-05-14 1:11 ` Huang, Kai
@ 2025-05-14 15:21 ` Vishal Annapurve
2025-05-19 5:06 ` Huang, Kai
0 siblings, 1 reply; 63+ messages in thread
From: Vishal Annapurve @ 2025-05-14 15:21 UTC (permalink / raw)
To: Huang, Kai
Cc: Kirill A. Shutemov, pbonzini, seanjc, rick.p.edgecombe,
isaku.yamahata, yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm,
x86, linux-coco, linux-kernel
On Tue, May 13, 2025 at 6:12 PM Huang, Kai <kai.huang@intel.com> wrote:
>
>
>
> On 3/05/2025 1:08 am, Kirill A. Shutemov wrote:
> > The PAMT memory holds metadata for TDX-protected memory. With Dynamic
> > PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
> > with a few pages that cover 2M of host physical memory.
> >
> > PAMT memory can be reclaimed when the last user is gone. It can happen
> > in a few code paths:
> >
> > - On TDH.PHYMEM.PAGE.RECLAIM in tdx_reclaim_td_control_pages() and
> > tdx_reclaim_page().
> >
> > - On TDH.MEM.PAGE.REMOVE in tdx_sept_drop_private_spte().
> >
> > - In tdx_sept_zap_private_spte() for pages that were in the queue to be
> > added with TDH.MEM.PAGE.ADD, but it never happened due to an error.
> >
> > Add tdx_pamt_put() in these code paths.
>
> IMHO, instead of explicitly hooking tdx_pamt_put() to various places, we
> should just do tdx_free_page() for the pages that were allocated by
> tdx_alloc_page() (i.e., control pages, SEPT pages).
>
> That means, IMHO, we should do PAMT allocation/free when we actually
> *allocate* and *free* the target TDX private page(s). I.e., we should:
I think it's important to ensure that PAMT pages are *only* allocated
for a 2M range if it's getting mapped in EPT at 4K granularity.
Physical memory allocation order can be different from the EPT mapping
granularity.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
` (12 preceding siblings ...)
2025-05-14 13:41 ` [RFC, PATCH 00/12] TDX: " Sean Christopherson
@ 2025-05-14 20:33 ` Zhi Wang
2025-05-15 9:17 ` Kirill A. Shutemov
13 siblings, 1 reply; 63+ messages in thread
From: Zhi Wang @ 2025-05-14 20:33 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
On Fri, 2 May 2025 16:08:16 +0300
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> This RFC patchset enables Dynamic PAMT in TDX. It is not intended to
> be applied, but rather to receive early feedback on the feature
> design and enabling.
>
> From our perspective, this feature has a lower priority compared to
> huge page support. I will rebase this patchset on top of Yan's huge
> page enabling at a later time, as it requires additional work.
>
> Any feedback is welcome. We are open to ideas.
>
Do we have any estimation on how much extra cost on TVM creation/destroy
when tightly couple the PAMT allocation/de-allocation to the private
page allocation/de-allocation? It has been trendy nowadays that
meta pages are required to be given to the TSM when doing stuff with
private page in many platforms. When the pool of the meta page is
extensible/shrinkable, there are always ideas about batch pre-charge the
pool and lazy batch reclaim it at a certain path for performance
considerations or VM characteristics. That might turn into a
vendor-agnostic path in KVM with tunable configurations.
Z.
> =========================================================================
>
> The Physical Address Metadata Table (PAMT) holds TDX metadata for
> physical memory and must be allocated by the kernel during TDX module
> initialization.
>
> The exact size of the required PAMT memory is determined by the TDX
> module and may vary between TDX module versions, but currently it is
> approximately 0.4% of the system memory. This is a significant
> commitment, especially if it is not known upfront whether the machine
> will run any TDX guests.
>
> The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
> PAMT_2M levels are still allocated on TDX module initialization, but
> the PAMT_4K level is allocated dynamically, reducing static
> allocations to approximately 0.004% of the system memory.
>
> PAMT memory is dynamically allocated as pages gain TDX protections.
> It is reclaimed when TDX protections have been removed from all
> pages in a contiguous area.
>
> TODO:
> - Rebase on top of Yan's huge page support series. Demotion requires
> additional handling with Dynamic PAMT;
> - Get better vmalloc API from core-mm and simplify patch 02/12.
>
> Kirill A. Shutemov (12):
> x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
> x86/virt/tdx: Allocate reference counters for PAMT memory
> x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE
> x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo
> KVM: TDX: Add tdx_pamt_get()/put() helpers
> KVM: TDX: Allocate PAMT memory in __tdx_td_init()
> KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init()
> KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to
> kvm_x86_ops KVM: TDX: Preallocate PAMT pages to be used in page fault
> path KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops
> KVM: TDX: Reclaim PAMT memory
> x86/virt/tdx: Enable Dynamic PAMT
>
> arch/x86/include/asm/kvm-x86-ops.h | 2 +
> arch/x86/include/asm/kvm_host.h | 5 +
> arch/x86/include/asm/set_memory.h | 2 +
> arch/x86/include/asm/tdx.h | 22 ++
> arch/x86/include/asm/tdx_global_metadata.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 10 +
> arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++-
> arch/x86/kvm/vmx/main.c | 2 +
> arch/x86/kvm/vmx/tdx.c | 215
> ++++++++++++++++++-- arch/x86/kvm/vmx/tdx_errno.h |
> 1 + arch/x86/kvm/vmx/x86_ops.h | 9 +
> arch/x86/mm/Makefile | 2 +
> arch/x86/mm/meminfo.c | 11 +
> arch/x86/mm/pat/set_memory.c | 2 +-
> arch/x86/virt/vmx/tdx/tdx.c | 211 ++++++++++++++++++-
> arch/x86/virt/vmx/tdx/tdx.h | 5 +-
> arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 +
> virt/kvm/kvm_main.c | 1 +
> 18 files changed, 522 insertions(+), 29 deletions(-)
> create mode 100644 arch/x86/mm/meminfo.c
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
2025-05-14 20:33 ` Zhi Wang
@ 2025-05-15 9:17 ` Kirill A. Shutemov
2025-05-15 14:03 ` Dave Hansen
0 siblings, 1 reply; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-15 9:17 UTC (permalink / raw)
To: Zhi Wang
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
On Wed, May 14, 2025 at 11:33:17PM +0300, Zhi Wang wrote:
> On Fri, 2 May 2025 16:08:16 +0300
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>
> > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to
> > be applied, but rather to receive early feedback on the feature
> > design and enabling.
> >
> > From our perspective, this feature has a lower priority compared to
> > huge page support. I will rebase this patchset on top of Yan's huge
> > page enabling at a later time, as it requires additional work.
> >
> > Any feedback is welcome. We are open to ideas.
> >
>
> Do we have any estimation on how much extra cost on TVM creation/destroy
> when tightly couple the PAMT allocation/de-allocation to the private
> page allocation/de-allocation? It has been trendy nowadays that
> meta pages are required to be given to the TSM when doing stuff with
> private page in many platforms. When the pool of the meta page is
> extensible/shrinkable, there are always ideas about batch pre-charge the
> pool and lazy batch reclaim it at a certain path for performance
> considerations or VM characteristics. That might turn into a
> vendor-agnostic path in KVM with tunable configurations.
It depends on the pages that the page allocator gives to TD. If memory is
not fragmented and TD receives memory from the same 2M chunks, we do not
need much PAMT memory and we do not need to make additional SEAMCALLs to
add it. It also depends on the availability of huge pages.
From my tests, a typical TD boot takes about 20 MiB of PAMT memory if no
huge pages are allocated and about 2MiB with huge pages. The overhead on
its management is negligible, especially with huge pages: approximately
256 SEAMCALLs to add PAMT pages and the same number to remove.
The consumption of PAMT memory for booting does not increase significantly
with the size of TD as the guest accepts memory lazily. However, it will
increase as more memory is accepted if huge pages are not used.
I don't think we can justify any batching here.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
2025-05-15 9:17 ` Kirill A. Shutemov
@ 2025-05-15 14:03 ` Dave Hansen
0 siblings, 0 replies; 63+ messages in thread
From: Dave Hansen @ 2025-05-15 14:03 UTC (permalink / raw)
To: Kirill A. Shutemov, Zhi Wang
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
On 5/15/25 02:17, Kirill A. Shutemov wrote:
> I don't think we can justify any batching here.
The is one primary goal here:
Reduce TDX overhead when not running TDX guests.
It has the side-effect of being _able_ to reduce the amount of memory
that TDX guests use when using >=2M pages only. It has the theoretical
capability to do the same for 4k users but only when the pages are quite
contiguous.
Right?
The "not running TDX guest" and ">=2M pages" benefits are relatively
easy. The 4k one is hard and is going to take a lot more work.
Could we please focus on the easy one for now and not get distracted by
the hard one that might not even be worth it in the end?
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
2025-05-14 13:41 ` [RFC, PATCH 00/12] TDX: " Sean Christopherson
@ 2025-05-15 14:22 ` Kirill A. Shutemov
2025-05-15 15:03 ` Dave Hansen
0 siblings, 1 reply; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-15 14:22 UTC (permalink / raw)
To: Sean Christopherson
Cc: pbonzini, rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao,
tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel
On Wed, May 14, 2025 at 06:41:10AM -0700, Sean Christopherson wrote:
> On Fri, May 02, 2025, Kirill A. Shutemov wrote:
> > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be
> > applied, but rather to receive early feedback on the feature design and
> > enabling.
>
> In that case, please describe the design, and specifically *why* you chose this
> particular design, along with the constraints and rules of dynamic PAMTs that
> led to that decision. It would also be very helpful to know what options you
> considered and discarded, so that others don't waste time coming up with solutions
> that you already rejected.
Dynamic PAMT support in TDX module
==================================
Dynamic PAMT is a TDX feature that allows VMM to allocate PAMT_4K as
needed. PAMT_1G and PAMT_2M are still allocated statically at the time of
TDX module initialization. At init stage allocation of PAMT_4K is replaced
with PAMT_PAGE_BITMAP which currently requires one bit of memory per 4k.
VMM is responsible for allocating and freeing PAMT_4K. There's a pair of
new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They
add/remove PAMT memory in form of page pair. There's no requirement for
these pages to be contiguous.
Page pair supplied via TDH.PHYMEM.PAMT.ADD will cover specified 2M region.
It allows any 4K from the region to be usable by TDX module.
With Dynamic PAMT, a number of SEAMCALLs can now fail due to missing PAMT
memory (TDX_MISSING_PAMT_PAGE_PAIR):
- TDH.MNG.CREATE
- TDH.MNG.ADDCX
- TDH.VP.ADDCX
- TDH.VP.CREATE
- TDH.MEM.PAGE.ADD
- TDH.MEM.PAGE.AUG
- TDH.MEM.PAGE.DEMOTE
- TDH.MEM.PAGE.RELOCATE
Basically, if you supply memory to a TD, this memory has to backed by PAMT
memory.
Once no TD uses the 2M range, the PAMT page pair can be reclaimed with
TDH.PHYMEM.PAMT.REMOVE.
TDX module track PAMT memory usage and can give VMM a hint that PAMT
memory can be removed. Such hint is provided from all SEAMCALLs that
removes memory from TD:
- TDH.MEM.SEPT.REMOVE
- TDH.MEM.PAGE.REMOVE
- TDH.MEM.PAGE.PROMOTE
- TDH.MEM.PAGE.RELOCATE
- TDH.PHYMEM.PAGE.RECLAIM
With Dynamic PAMT, TDH.MEM.PAGE.DEMOTE takes PAMT page pair as additional
input to populate PAMT_4K on split. TDH.MEM.PAGE.PROMOTE returns no longer
needed PAMT page pair.
PAMT memory is global resource and not tied to a specific TD. TDX modules
maintains PAMT memory in a radix tree addressed by physical address. Each
entry in the tree can be locked with shared or exclusive lock. Any
modification of the tree requires exclusive lock.
Any SEAMCALL that takes explicit HPA as an argument will walk the tree
taking shared lock on entries. It required to make sure that the page
pointed by HPA is of compatible type for the usage.
TDCALLs don't take PAMT locks as none of the take HPA as an argument.
Dynamic PAMT enabling in kernel
===============================
Kernel maintains refcounts for every 2M regions with two helpers
tdx_pamt_get() and tdx_pamt_put().
The refcount represents number of users for the PAMT memory in the region.
Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
TDH.PHYMEM.PAMT.REMOVE on transition 1->0.
PAMT memory gets allocated as part of TD init, VCPU init, on populating
SEPT tree and adding guest memory (both during TD build and via AUG on
accept).
PAMT memory removed on reclaim of control pages and guest memory.
Populating PAMT memory on fault is tricky as we cannot allocate memory
from the context where it is needed. I introduced a pair of kvm_x86_ops to
allocate PAMT memory from a per-VCPU pool from context where VCPU is still
around and free it on failuire. This flow will likely be reworked in next
versions.
Previous attempt on Dynamic PAMT enabling
=========================================
My initial kernel enabling attempt was quite different. I wanted to make
PAMT allocation lazy: only try to add PAMT page pair if a SEAMCALL fails
due to missing PAMT and reclaim it back based on hint provided by the TDX
module.
The motivation was to avoid duplication of PAMT memory refcounting that
TDX module does on kernel side.
This approach is inherently more racy as we don't serialize PAMT memory
add/remove against SEAMCALLs that uses add/remove memory for a TD. Such
serialization would require global locking which is no-go.
I made this approach work, but at some point I realized that it cannot be
robust as long as we want to avoid TDX_OPERAND_BUSY loops.
TDX_OPERAND_BUSY will pop up as result of the races I mentioned above.
I gave up on this approach and went with the current one which uses
explicit refcounting.
Brain dumped.
Let me know if anything is unclear.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
2025-05-15 14:22 ` Kirill A. Shutemov
@ 2025-05-15 15:03 ` Dave Hansen
2025-05-15 16:02 ` Kirill A. Shutemov
0 siblings, 1 reply; 63+ messages in thread
From: Dave Hansen @ 2025-05-15 15:03 UTC (permalink / raw)
To: Kirill A. Shutemov, Sean Christopherson
Cc: pbonzini, rick.p.edgecombe, isaku.yamahata, kai.huang, yan.y.zhao,
tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco, linux-kernel
On 5/15/25 07:22, Kirill A. Shutemov wrote:
> VMM is responsible for allocating and freeing PAMT_4K. There's a pair of
> new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They
> add/remove PAMT memory in form of page pair. There's no requirement for
> these pages to be contiguous.
BTW, that second sentence is a little goofy. Is it talking about
ADD/REMOVE being a matched pair? Or that there needs to be 8k of
metadata storage provided to each ADD/REMOVE call?
One thing I've noticed in writing changelogs and so forth is that
repetition can hurt understanding if the concepts aren't the same. Like
saying there is a "pair" of calls and a "pair" of pages when the fact
that both are pairs is a coincidence rather than an intentional and
important part of the design.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
2025-05-15 15:03 ` Dave Hansen
@ 2025-05-15 16:02 ` Kirill A. Shutemov
0 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-15 16:02 UTC (permalink / raw)
To: Dave Hansen
Cc: Sean Christopherson, pbonzini, rick.p.edgecombe, isaku.yamahata,
kai.huang, yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86,
linux-coco, linux-kernel
On Thu, May 15, 2025 at 08:03:28AM -0700, Dave Hansen wrote:
> On 5/15/25 07:22, Kirill A. Shutemov wrote:
> > VMM is responsible for allocating and freeing PAMT_4K. There's a pair of
> > new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They
> > add/remove PAMT memory in form of page pair. There's no requirement for
> > these pages to be contiguous.
>
> BTW, that second sentence is a little goofy. Is it talking about
> ADD/REMOVE being a matched pair? Or that there needs to be 8k of
> metadata storage provided to each ADD/REMOVE call?
Both :P
Pair of SEAMCALLs operate on pairs of pages.
> One thing I've noticed in writing changelogs and so forth is that
> repetition can hurt understanding if the concepts aren't the same. Like
> saying there is a "pair" of calls and a "pair" of pages when the fact
> that both are pairs is a coincidence rather than an intentional and
> important part of the design.
Yeah, I see it.
I will try to avoid to "pair" for SEAMCALLs in Dynamic PAMT context.
Maybe it will clear up the confusion.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-14 6:43 ` kirill.shutemov
@ 2025-05-19 5:00 ` Huang, Kai
0 siblings, 0 replies; 63+ messages in thread
From: Huang, Kai @ 2025-05-19 5:00 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com
Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
bp@alien8.de, dave.hansen@linux.intel.com, mingo@redhat.com,
tglx@linutronix.de, Zhao, Yan Y, pbonzini@redhat.com,
kvm@vger.kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
On Wed, 2025-05-14 at 09:43 +0300, kirill.shutemov@linux.intel.com wrote:
> On Wed, May 14, 2025 at 12:00:17AM +0000, Huang, Kai wrote:
> > On Mon, 2025-05-12 at 12:55 +0300, Kirill A. Shutemov wrote:
> > > On Fri, May 09, 2025 at 09:25:58AM +0800, Yan Zhao wrote:
> > > > On Thu, May 08, 2025 at 04:23:56PM +0300, Kirill A. Shutemov wrote:
> > > > > On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> > > > > > On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > > > > > > The functions kvm_x86_ops::link_external_spt() and
> > > > > > > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > > > > > > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > > > > > > covered by PAMT.
> > > > > > >
> > > > > > > The new function kvm_x86_ops::phys_prepare() is called before
> > > > > > > link_external_spt() and set_external_spte() to ensure that the memory is
> > > > > > > ready to be assigned to the virtual machine. In the case of TDX, it
> > > > > > > makes sure that the memory is covered by PAMT.
> > > > > > >
> > > > > > > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > > > > > > is available, allowing the implementation to allocate memory from a
> > > > > > > per-VCPU pool.
> > > > > > >
> > > > > > Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> > > > > > Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
> > > > >
> > > > > Because the memory pool we allocated from is per-vcpu and we lost access
> > > > > to vcpu by then. And not all callers provide vcpu.
> > > > Maybe we can get vcpu via kvm_get_running_vcpu(), as in [1].
> > > > Then for callers not providing vcpu (where vcpu is NULL), we can use per-KVM
> > > > cache?
> > >
> > > Hm. I was not aware of kvm_get_running_vcpu(). Will play with it, thanks.
> >
> > I am not sure why per-vcpu cache matters.
> >
> > For non-leaf SEPT pages, AFAICT the "vcpu->arch.mmu_external_spt_cache" is just
> > an empty cache, and eventually __get_free_page() is used to allocate in:
> >
> > sp->external_spt =
> > kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
> >
> > So why not we actually create a kmem_cache for it with an actual 'ctor', and we
> > can call tdx_alloc_page() in that. This makes sure when the "external_spt" is
> > allocated, the underneath PAMT entry is there.
>
> This would make hard to debug PAMT memory leaks. external_spt pages in the
> pool will have PAMT memory tied to them, so we will have non-zero PAMT
> memory usage with zero TDs running.
Why is that? AFAICT all 'external_spt' pages are freed when TD is gone.
>
> > For the last level guest memory page, similar to SEV, we can hook the
> > kvm_arch_gmem_prepare() to call tdx_alloc_page() to make PAMT entry ready.
>
> I don't think kvm_arch_gmem_prepare() is right place to allocate PAMT
> memory. THPs are dynamic and page order can change due to split or
> collapse between the time the page is allocated and gets mapped into EPT.
> I am not sure if SEV code is correct in this regard.
Yeah, agreed. Not sure how does SEV-SNP handles large page split/merge either.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 11/12] KVM: TDX: Reclaim PAMT memory
2025-05-14 15:21 ` Vishal Annapurve
@ 2025-05-19 5:06 ` Huang, Kai
0 siblings, 0 replies; 63+ messages in thread
From: Huang, Kai @ 2025-05-19 5:06 UTC (permalink / raw)
To: Annapurve, Vishal
Cc: Edgecombe, Rick P, seanjc@google.com, bp@alien8.de,
dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
x86@kernel.org, mingo@redhat.com, kirill.shutemov@linux.intel.com,
tglx@linutronix.de, pbonzini@redhat.com, Zhao, Yan Y,
Yamahata, Isaku, kvm@vger.kernel.org, linux-coco@lists.linux.dev
On Wed, 2025-05-14 at 08:21 -0700, Vishal Annapurve wrote:
> On Tue, May 13, 2025 at 6:12 PM Huang, Kai <kai.huang@intel.com> wrote:
> >
> >
> >
> > On 3/05/2025 1:08 am, Kirill A. Shutemov wrote:
> > > The PAMT memory holds metadata for TDX-protected memory. With Dynamic
> > > PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
> > > with a few pages that cover 2M of host physical memory.
> > >
> > > PAMT memory can be reclaimed when the last user is gone. It can happen
> > > in a few code paths:
> > >
> > > - On TDH.PHYMEM.PAGE.RECLAIM in tdx_reclaim_td_control_pages() and
> > > tdx_reclaim_page().
> > >
> > > - On TDH.MEM.PAGE.REMOVE in tdx_sept_drop_private_spte().
> > >
> > > - In tdx_sept_zap_private_spte() for pages that were in the queue to be
> > > added with TDH.MEM.PAGE.ADD, but it never happened due to an error.
> > >
> > > Add tdx_pamt_put() in these code paths.
> >
> > IMHO, instead of explicitly hooking tdx_pamt_put() to various places, we
> > should just do tdx_free_page() for the pages that were allocated by
> > tdx_alloc_page() (i.e., control pages, SEPT pages).
> >
> > That means, IMHO, we should do PAMT allocation/free when we actually
> > *allocate* and *free* the target TDX private page(s). I.e., we should:
>
> I think it's important to ensure that PAMT pages are *only* allocated
> for a 2M range if it's getting mapped in EPT at 4K granularity.
> Physical memory allocation order can be different from the EPT mapping
> granularity.
Agreed. Thanks.
I still think all control pages and secure EPT pages can just use
tdx_{alloc|free}_page() though (because we always alloc and use them in 4K
granularity).
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-05 12:44 ` Huang, Kai
2025-05-07 1:01 ` Yan Zhao
2025-05-07 16:31 ` Dave Hansen
@ 2025-05-23 9:42 ` kirill.shutemov
2 siblings, 0 replies; 63+ messages in thread
From: kirill.shutemov @ 2025-05-23 9:42 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, Edgecombe, Rick P,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
mingo@redhat.com, Zhao, Yan Y, tglx@linutronix.de,
kvm@vger.kernel.org, linux-coco@lists.linux.dev, Yamahata, Isaku,
linux-kernel@vger.kernel.org
On Mon, May 05, 2025 at 12:44:26PM +0000, Huang, Kai wrote:
> On Fri, 2025-05-02 at 16:08 +0300, Kirill A. Shutemov wrote:
> > Introduce a pair of helpers to allocate and free memory for a given 2M
> > range. The range is represented by struct page for any memory in the
> > range and the PAMT memory by a list of pages.
> >
> > Use per-2M refcounts to detect when PAMT memory has to be allocated and
> > when it can be freed.
> >
> > pamt_lock spinlock serializes against races between multiple
> > tdx_pamt_add() as well as tdx_pamt_add() vs tdx_pamt_put().
>
> Maybe elaborate a little bit on _why_ using spinlock?
>
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> > arch/x86/include/asm/tdx.h | 2 +
> > arch/x86/kvm/vmx/tdx.c | 123 +++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/tdx_errno.h | 1 +
> > 3 files changed, 126 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index 8091bf5b43cc..42449c054938 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -135,6 +135,8 @@ static inline int tdx_nr_pamt_pages(const struct tdx_sys_info *sysinfo)
> > return sysinfo->tdmr.pamt_4k_entry_size * PTRS_PER_PTE / PAGE_SIZE;
> > }
> >
> > +atomic_t *tdx_get_pamt_refcount(unsigned long hpa);
> > +
>
> This at least needs to be in the same patch which exports it. But as replied to
> patch 2, I think we should just move the code in this patch to TDX core code.
>
> > int tdx_guest_keyid_alloc(void);
> > u32 tdx_get_nr_guest_keyids(void);
> > void tdx_guest_keyid_free(unsigned int keyid);
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index b952bc673271..ea7e2d93fb44 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -207,6 +207,10 @@ static bool tdx_operand_busy(u64 err)
> > return (err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY;
> > }
> >
> > +static bool tdx_hpa_range_not_free(u64 err)
> > +{
> > + return (err & TDX_SEAMCALL_STATUS_MASK) == TDX_HPA_RANGE_NOT_FREE;
> > +}
> >
> > /*
> > * A per-CPU list of TD vCPUs associated with a given CPU.
> > @@ -276,6 +280,125 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> > vcpu->cpu = -1;
> > }
> >
> > +static DEFINE_SPINLOCK(pamt_lock);
> > +
> > +static void tdx_free_pamt_pages(struct list_head *pamt_pages)
> > +{
> > + struct page *page;
> > +
> > + while ((page = list_first_entry_or_null(pamt_pages, struct page, lru))) {
> > + list_del(&page->lru);
> > + __free_page(page);
> > + }
> > +}
> > +
> > +static int tdx_alloc_pamt_pages(struct list_head *pamt_pages)
> > +{
> > + for (int i = 0; i < tdx_nr_pamt_pages(tdx_sysinfo); i++) {
> > + struct page *page = alloc_page(GFP_KERNEL);
> > + if (!page)
> > + goto fail;
> > + list_add(&page->lru, pamt_pages);
> > + }
> > + return 0;
> > +fail:
> > + tdx_free_pamt_pages(pamt_pages);
> > + return -ENOMEM;
> > +}
> > +
> > +static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> > + struct list_head *pamt_pages)
> > +{
> > + u64 err;
> > +
> > + hpa = ALIGN_DOWN(hpa, SZ_2M);
> > +
> > + spin_lock(&pamt_lock);
>
> Just curious, Can the lock be per-2M-range?
>
> > +
> > + /* Lost race to other tdx_pamt_add() */
> > + if (atomic_read(pamt_refcount) != 0) {
> > + atomic_inc(pamt_refcount);
> > + spin_unlock(&pamt_lock);
> > + tdx_free_pamt_pages(pamt_pages);
>
> It's unfortunate multiple caller of tdx_pamt_add() needs to firstly allocate
> PAMT pages by the caller out of the spinlock and then free them here.
>
> I am thinking if we make tdx_pamt_add() return:
>
> * > 0: PAMT pages already added (another tdx_pamt_add() won)
> * = 0: PAMT pages added successfully
> * < 0: error code
>
> .. then we at least could move tdx_free_pamt_pages() to the caller too.
>
> > + return 0;
> > + }
> > +
> > + err = tdh_phymem_pamt_add(hpa | TDX_PS_2M, pamt_pages);
> > +
> > + if (err)
> > + tdx_free_pamt_pages(pamt_pages);
>
> Seems we are calling tdx_free_pamt_pages() within spinlock, which is not
> consistent with above when another tdx_pamt_add() has won the race.
>
> > +
> > + /*
> > + * tdx_hpa_range_not_free() is true if current task won race
> > + * against tdx_pamt_put().
> > + */
> > + if (err && !tdx_hpa_range_not_free(err)) {
> > + spin_unlock(&pamt_lock);
> > + pr_tdx_error(TDH_PHYMEM_PAMT_ADD, err);
> > + return -EIO;
> > + }
>
> I had hard time to figure out why we need to handle tdx_hpa_range_not_free()
> explicitly. IIUC, it is because atomic_dec_and_test() is used in
> tdx_pamt_put(), in which case the atomic_t can reach to 0 outside of the
> spinlock thus tdh_phymem_pamt_add() can be called when there's still PAMT pages
> populated.
>
> But ...
>
> > +
> > + atomic_set(pamt_refcount, 1);
> > + spin_unlock(&pamt_lock);
> > + return 0;
> > +}
> > +
> > +static int tdx_pamt_get(struct page *page)
> > +{
> > + unsigned long hpa = page_to_phys(page);
> > + atomic_t *pamt_refcount;
> > + LIST_HEAD(pamt_pages);
> > +
> > + if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
> > + return 0;
> > +
> > + pamt_refcount = tdx_get_pamt_refcount(hpa);
> > + WARN_ON_ONCE(atomic_read(pamt_refcount) < 0);
> > +
> > + if (atomic_inc_not_zero(pamt_refcount))
> > + return 0;
>
> ... if we set the initial value of pamt_refcount to -1, and use
> atomic_inc_unless_negetive() here:
>
> if (atomic_inc_unless_negative(pamt_refcount))
> return 0;
>
> if (tdx_alloc_pamt_pages(&pamt_pages))
> return -ENOMEM;
>
> spin_lock(&pamt_lock);
> ret = tdx_pamt_add(hpa, &pamt_pages);
> if (ret >= 0)
> atomic_inc(pamt_refcount, 0);
> spin_unlock(&pamt_lock);
>
> /*
> * If another tdx_pamt_get() won the race, or in case of
> * error, PAMT pages are not used and can be freed.
> */
> if (ret)
> tdx_free_pamt_pages(&pamt_pages);
>
> return ret >= 0 ? 0 : ret;
>
> and ...
>
> > +
> > + if (tdx_alloc_pamt_pages(&pamt_pages))
> > + return -ENOMEM;
> > +
> > + return tdx_pamt_add(pamt_refcount, hpa, &pamt_pages);
> > +}
> > +
> > +static void tdx_pamt_put(struct page *page)
> > +{
> > + unsigned long hpa = page_to_phys(page);
> > + atomic_t *pamt_refcount;
> > + LIST_HEAD(pamt_pages);
> > + u64 err;
> > +
> > + if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
> > + return;
> > +
> > + hpa = ALIGN_DOWN(hpa, SZ_2M);
> > +
> > + pamt_refcount = tdx_get_pamt_refcount(hpa);
> > + if (!atomic_dec_and_test(pamt_refcount))
> > + return;
>
> ... use atomic_dec_if_possible() here, we should be able to avoid the special
> handling of tdx_hpa_range_not_free() in tdx_pamt_get(). Someething like:
>
> if (atomic_dec_if_positive(pamt_refcount) >= 0)
> return;
>
> spin_lock(&pamt_lock);
>
> /* tdx_pamt_get() called more than once */
> if (atomic_read(pamt_refcount) > 0) {
This check would do nothing to protect you against parallel increase of
the counter as we get here with pamt_refcount == 0 the parallel
atomic_inc_unless_negative() is free to bump the counter in the fast path
without taking the lock just after this condition.
So, the code below will free PAMT memory when there is still user.
> spin_unlock(&pamt_lock);
> return;
> }
>
> err = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, &pamt_pages);
> atomic_set(pamt_refcount, -1);
> spin_unlock(&pamt_lock);
>
> tdx_free_pamt_pages(&pamt_pages);
>
> Hmm.. am I missing anything?
>
> > +
> > + spin_lock(&pamt_lock);
> > +
> > + /* Lost race against tdx_pamt_add()? */
> > + if (atomic_read(pamt_refcount) != 0) {
> > + spin_unlock(&pamt_lock);
> > + return;
> > + }
> > +
> > + err = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, &pamt_pages);
> > + spin_unlock(&pamt_lock);
> > +
> > + if (err) {
> > + pr_tdx_error(TDH_PHYMEM_PAMT_REMOVE, err);
> > + return;
> > + }
> > +
> > + tdx_free_pamt_pages(&pamt_pages);
> > +}
> > +
>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers
2025-05-14 5:25 ` Chao Gao
@ 2025-05-23 10:46 ` Kirill A. Shutemov
0 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-23 10:46 UTC (permalink / raw)
To: Chao Gao
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
On Wed, May 14, 2025 at 01:25:37PM +0800, Chao Gao wrote:
> >+static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
> >+ struct list_head *pamt_pages)
> >+{
> >+ u64 err;
> >+
> >+ hpa = ALIGN_DOWN(hpa, SZ_2M);
>
> Nit: it is better to use SZ_2M or PMD_SIZE consistently.
>
> e.g., patch 2 uses PMD_SIZE:
>
> +atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> +{
> + return &pamt_refcounts[hpa / PMD_SIZE];
> +}
> +EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
>
> >+
> >+ spin_lock(&pamt_lock);
> >+
> >+ /* Lost race to other tdx_pamt_add() */
> >+ if (atomic_read(pamt_refcount) != 0) {
> >+ atomic_inc(pamt_refcount);
> >+ spin_unlock(&pamt_lock);
> >+ tdx_free_pamt_pages(pamt_pages);
> >+ return 0;
> >+ }
> >+
> >+ err = tdh_phymem_pamt_add(hpa | TDX_PS_2M, pamt_pages);
> >+
> >+ if (err)
> >+ tdx_free_pamt_pages(pamt_pages);
> >+
>
> >+ /*
> >+ * tdx_hpa_range_not_free() is true if current task won race
> >+ * against tdx_pamt_put().
> >+ */
> >+ if (err && !tdx_hpa_range_not_free(err)) {
> >+ spin_unlock(&pamt_lock);
> >+ pr_tdx_error(TDH_PHYMEM_PAMT_ADD, err);
> >+ return -EIO;
> >+ }
>
> IIUC, this chunk is needed because tdx_pamt_put() decreases the refcount
> without holding the pamt_lock. Why not move that decrease inside the lock?
>
> And I suggest that all accesses to the pamt_refcount should be performed with
> the pamt_lock held. This can make the code much clearer. It's similar to how
> kvm_usage_count is managed, where transitions from 0 to 1 or 1 to 0 require
> extra work, but other cases simply increases or decreases the refcount.
Vast majority of cases will take fast path which requires single atomic
operation. We can move it under lock but it would double number of
atomics. I don't see a strong reason to do this.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-14 0:00 ` Huang, Kai
2025-05-14 6:43 ` kirill.shutemov
@ 2025-05-23 12:00 ` kirill.shutemov
2025-06-05 13:01 ` kirill.shutemov
1 sibling, 1 reply; 63+ messages in thread
From: kirill.shutemov @ 2025-05-23 12:00 UTC (permalink / raw)
To: Huang, Kai
Cc: Zhao, Yan Y, Edgecombe, Rick P, seanjc@google.com,
dave.hansen@linux.intel.com, bp@alien8.de, x86@kernel.org,
mingo@redhat.com, tglx@linutronix.de, pbonzini@redhat.com,
kvm@vger.kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
On Wed, May 14, 2025 at 12:00:17AM +0000, Huang, Kai wrote:
> On Mon, 2025-05-12 at 12:55 +0300, Kirill A. Shutemov wrote:
> > On Fri, May 09, 2025 at 09:25:58AM +0800, Yan Zhao wrote:
> > > On Thu, May 08, 2025 at 04:23:56PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> > > > > On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > > > > > The functions kvm_x86_ops::link_external_spt() and
> > > > > > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > > > > > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > > > > > covered by PAMT.
> > > > > >
> > > > > > The new function kvm_x86_ops::phys_prepare() is called before
> > > > > > link_external_spt() and set_external_spte() to ensure that the memory is
> > > > > > ready to be assigned to the virtual machine. In the case of TDX, it
> > > > > > makes sure that the memory is covered by PAMT.
> > > > > >
> > > > > > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > > > > > is available, allowing the implementation to allocate memory from a
> > > > > > per-VCPU pool.
> > > > > >
> > > > > Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> > > > > Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
> > > >
> > > > Because the memory pool we allocated from is per-vcpu and we lost access
> > > > to vcpu by then. And not all callers provide vcpu.
> > > Maybe we can get vcpu via kvm_get_running_vcpu(), as in [1].
> > > Then for callers not providing vcpu (where vcpu is NULL), we can use per-KVM
> > > cache?
> >
> > Hm. I was not aware of kvm_get_running_vcpu(). Will play with it, thanks.
>
> I am not sure why per-vcpu cache matters.
>
> For non-leaf SEPT pages, AFAICT the "vcpu->arch.mmu_external_spt_cache" is just
> an empty cache, and eventually __get_free_page() is used to allocate in:
>
> sp->external_spt =
> kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
>
> So why not we actually create a kmem_cache for it with an actual 'ctor', and we
> can call tdx_alloc_page() in that. This makes sure when the "external_spt" is
> allocated, the underneath PAMT entry is there.
I looked closer to this and while it is good idea, but ctor in kmem_cache
cannot fail which makes this approach not viable.
I guess we can a constructor directly into struct kvm_mmu_memory_cache.
Let me play with this.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 09/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path
2025-05-14 6:30 ` Chao Gao
@ 2025-05-30 10:28 ` Kirill A. Shutemov
0 siblings, 0 replies; 63+ messages in thread
From: Kirill A. Shutemov @ 2025-05-30 10:28 UTC (permalink / raw)
To: Chao Gao
Cc: pbonzini, seanjc, rick.p.edgecombe, isaku.yamahata, kai.huang,
yan.y.zhao, tglx, mingo, bp, dave.hansen, kvm, x86, linux-coco,
linux-kernel
On Wed, May 14, 2025 at 02:30:34PM +0800, Chao Gao wrote:
> On Fri, May 02, 2025 at 04:08:25PM +0300, Kirill A. Shutemov wrote:
> >Preallocate a page to be used in the link_external_spt() and
> >set_external_spte() paths.
> >
> >In the worst-case scenario, handling a page fault might require a
> >tdx_nr_pamt_pages() pages for each page table level.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >---
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
> > 2 files changed, 12 insertions(+)
> >
> >diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >index 91958c55f918..a5661499a176 100644
> >--- a/arch/x86/include/asm/kvm_host.h
> >+++ b/arch/x86/include/asm/kvm_host.h
> >@@ -849,6 +849,8 @@ struct kvm_vcpu_arch {
> > */
> > struct kvm_mmu_memory_cache mmu_external_spt_cache;
> >
> >+ struct kvm_mmu_memory_cache pamt_page_cache;
> >+
> > /*
> > * QEMU userspace and the guest each have their own FPU state.
> > * In vcpu_run, we switch between the user and guest FPU contexts.
> >diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >index a284dce227a0..7bfa0dc50440 100644
> >--- a/arch/x86/kvm/mmu/mmu.c
> >+++ b/arch/x86/kvm/mmu/mmu.c
> >@@ -616,6 +616,15 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > if (r)
> > return r;
> > }
> >+
> >+ if (vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM) {
>
> The check for vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM is identical to
> kvm_has_mirrored_tdp() a few lines above.
Well, yes. But I think it is conceptually different. There can be
different virtualization mode that has mirrored TDP which is not TDX.
>
> >+ int nr = tdx_nr_pamt_pages(tdx_get_sysinfo());
>
> Since you're already accessing tdx_sysinfo, you can check if dynamic PAMT is
> enabled and allocate the pamt page cache accordingly.
I will hide it in tdx_nr_pamt_pages() which would return 0 if Dynamic PAMT
is disabled.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-05-23 12:00 ` kirill.shutemov
@ 2025-06-05 13:01 ` kirill.shutemov
2025-06-05 22:21 ` Huang, Kai
0 siblings, 1 reply; 63+ messages in thread
From: kirill.shutemov @ 2025-06-05 13:01 UTC (permalink / raw)
To: Huang, Kai
Cc: Zhao, Yan Y, Edgecombe, Rick P, seanjc@google.com,
dave.hansen@linux.intel.com, bp@alien8.de, x86@kernel.org,
mingo@redhat.com, tglx@linutronix.de, pbonzini@redhat.com,
kvm@vger.kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
On Fri, May 23, 2025 at 03:00:56PM +0300, kirill.shutemov@linux.intel.com wrote:
> On Wed, May 14, 2025 at 12:00:17AM +0000, Huang, Kai wrote:
> > On Mon, 2025-05-12 at 12:55 +0300, Kirill A. Shutemov wrote:
> > > On Fri, May 09, 2025 at 09:25:58AM +0800, Yan Zhao wrote:
> > > > On Thu, May 08, 2025 at 04:23:56PM +0300, Kirill A. Shutemov wrote:
> > > > > On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> > > > > > On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > > > > > > The functions kvm_x86_ops::link_external_spt() and
> > > > > > > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > > > > > > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > > > > > > covered by PAMT.
> > > > > > >
> > > > > > > The new function kvm_x86_ops::phys_prepare() is called before
> > > > > > > link_external_spt() and set_external_spte() to ensure that the memory is
> > > > > > > ready to be assigned to the virtual machine. In the case of TDX, it
> > > > > > > makes sure that the memory is covered by PAMT.
> > > > > > >
> > > > > > > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > > > > > > is available, allowing the implementation to allocate memory from a
> > > > > > > per-VCPU pool.
> > > > > > >
> > > > > > Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> > > > > > Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
> > > > >
> > > > > Because the memory pool we allocated from is per-vcpu and we lost access
> > > > > to vcpu by then. And not all callers provide vcpu.
> > > > Maybe we can get vcpu via kvm_get_running_vcpu(), as in [1].
> > > > Then for callers not providing vcpu (where vcpu is NULL), we can use per-KVM
> > > > cache?
> > >
> > > Hm. I was not aware of kvm_get_running_vcpu(). Will play with it, thanks.
> >
> > I am not sure why per-vcpu cache matters.
> >
> > For non-leaf SEPT pages, AFAICT the "vcpu->arch.mmu_external_spt_cache" is just
> > an empty cache, and eventually __get_free_page() is used to allocate in:
> >
> > sp->external_spt =
> > kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
> >
> > So why not we actually create a kmem_cache for it with an actual 'ctor', and we
> > can call tdx_alloc_page() in that. This makes sure when the "external_spt" is
> > allocated, the underneath PAMT entry is there.
>
> I looked closer to this and while it is good idea, but ctor in kmem_cache
> cannot fail which makes this approach not viable.
>
> I guess we can a constructor directly into struct kvm_mmu_memory_cache.
> Let me play with this.
I failed to make it work.
We need to have destructor paired with the constructor that would do
PAMT-aware freeing. And redirect all free paths to it. It requires
substantial rework. I don't think it worth the effort.
Will do manual PAMT management for SPT in TDX code.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-06-05 13:01 ` kirill.shutemov
@ 2025-06-05 22:21 ` Huang, Kai
2025-06-06 10:20 ` kirill.shutemov
0 siblings, 1 reply; 63+ messages in thread
From: Huang, Kai @ 2025-06-05 22:21 UTC (permalink / raw)
To: kirill.shutemov@linux.intel.com
Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
bp@alien8.de, dave.hansen@linux.intel.com, mingo@redhat.com,
tglx@linutronix.de, Zhao, Yan Y, pbonzini@redhat.com,
kvm@vger.kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
On Thu, 2025-06-05 at 16:01 +0300, kirill.shutemov@linux.intel.com wrote:
> On Fri, May 23, 2025 at 03:00:56PM +0300, kirill.shutemov@linux.intel.com wrote:
> > On Wed, May 14, 2025 at 12:00:17AM +0000, Huang, Kai wrote:
> > > On Mon, 2025-05-12 at 12:55 +0300, Kirill A. Shutemov wrote:
> > > > On Fri, May 09, 2025 at 09:25:58AM +0800, Yan Zhao wrote:
> > > > > On Thu, May 08, 2025 at 04:23:56PM +0300, Kirill A. Shutemov wrote:
> > > > > > On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> > > > > > > On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > The functions kvm_x86_ops::link_external_spt() and
> > > > > > > > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > > > > > > > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > > > > > > > covered by PAMT.
> > > > > > > >
> > > > > > > > The new function kvm_x86_ops::phys_prepare() is called before
> > > > > > > > link_external_spt() and set_external_spte() to ensure that the memory is
> > > > > > > > ready to be assigned to the virtual machine. In the case of TDX, it
> > > > > > > > makes sure that the memory is covered by PAMT.
> > > > > > > >
> > > > > > > > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > > > > > > > is available, allowing the implementation to allocate memory from a
> > > > > > > > per-VCPU pool.
> > > > > > > >
> > > > > > > Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> > > > > > > Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
> > > > > >
> > > > > > Because the memory pool we allocated from is per-vcpu and we lost access
> > > > > > to vcpu by then. And not all callers provide vcpu.
> > > > > Maybe we can get vcpu via kvm_get_running_vcpu(), as in [1].
> > > > > Then for callers not providing vcpu (where vcpu is NULL), we can use per-KVM
> > > > > cache?
> > > >
> > > > Hm. I was not aware of kvm_get_running_vcpu(). Will play with it, thanks.
> > >
> > > I am not sure why per-vcpu cache matters.
> > >
> > > For non-leaf SEPT pages, AFAICT the "vcpu->arch.mmu_external_spt_cache" is just
> > > an empty cache, and eventually __get_free_page() is used to allocate in:
> > >
> > > sp->external_spt =
> > > kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
> > >
> > > So why not we actually create a kmem_cache for it with an actual 'ctor', and we
> > > can call tdx_alloc_page() in that. This makes sure when the "external_spt" is
> > > allocated, the underneath PAMT entry is there.
> >
> > I looked closer to this and while it is good idea, but ctor in kmem_cache
> > cannot fail which makes this approach not viable.
> >
> > I guess we can a constructor directly into struct kvm_mmu_memory_cache.
> > Let me play with this.
>
> I failed to make it work.
>
> We need to have destructor paired with the constructor that would do
> PAMT-aware freeing. And redirect all free paths to it. It requires
> substantial rework. I don't think it worth the effort.
>
> Will do manual PAMT management for SPT in TDX code.
Thanks for the effort.
Maybe something below?
diff --git a/arch/x86/kvm/mmu/mmu_internal.h
b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de62..48732270bff0 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -164,8 +164,10 @@ static inline bool is_mirror_sp(const struct
kvm_mmu_page *sp)
return sp->role.is_mirror;
}
-static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct
kvm_mmu_page *sp)
+static inline int kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct
kvm_mmu_page *sp)
{
+ int r;
+
/*
* external_spt is allocated for TDX module to hold private EPT
mappings,
* TDX module will initialize the page by itself.
@@ -173,6 +175,12 @@ static inline void kvm_mmu_alloc_external_spt(struct
kvm_vcpu *vcpu, struct kvm_
* KVM only interacts with sp->spt for private EPT operations.
*/
sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu-
>arch.mmu_external_spt_cache);
+
+ r = tdx_pamt_get(virt_to_page(sp->external_spt));
+ if (r)
+ free_page((unsigned long)sp->external_spt);
+
+ return r;
}
static inline gfn_t kvm_gfn_root_bits(const struct kvm *kvm, const struct
kvm_mmu_page *root)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..2d3a716d9195 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -55,7 +55,10 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
- free_page((unsigned long)sp->external_spt);
+ if (sp->external_spt) {
+ free_page((unsigned long)sp->external_spt);
+ tdx_pamt_put(virt_to_page(sp->external_spt));
+ }
free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
}
@@ -1277,8 +1280,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
*/
sp = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_child_sp(sp, &iter);
- if (is_mirror_sp(sp))
- kvm_mmu_alloc_external_spt(vcpu, sp);
+ if (is_mirror_sp(sp)) {
+ r = kvm_mmu_alloc_external_spt(vcpu, sp);
+ if (r) {
+ tdp_mmu_free_sp(sp);
+ goto retry;
+ }
+ }
sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
2025-06-05 22:21 ` Huang, Kai
@ 2025-06-06 10:20 ` kirill.shutemov
0 siblings, 0 replies; 63+ messages in thread
From: kirill.shutemov @ 2025-06-06 10:20 UTC (permalink / raw)
To: Huang, Kai
Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
bp@alien8.de, dave.hansen@linux.intel.com, mingo@redhat.com,
tglx@linutronix.de, Zhao, Yan Y, pbonzini@redhat.com,
kvm@vger.kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
On Thu, Jun 05, 2025 at 10:21:46PM +0000, Huang, Kai wrote:
> On Thu, 2025-06-05 at 16:01 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Fri, May 23, 2025 at 03:00:56PM +0300, kirill.shutemov@linux.intel.com wrote:
> > > On Wed, May 14, 2025 at 12:00:17AM +0000, Huang, Kai wrote:
> > > > On Mon, 2025-05-12 at 12:55 +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, May 09, 2025 at 09:25:58AM +0800, Yan Zhao wrote:
> > > > > > On Thu, May 08, 2025 at 04:23:56PM +0300, Kirill A. Shutemov wrote:
> > > > > > > On Tue, May 06, 2025 at 07:55:17PM +0800, Yan Zhao wrote:
> > > > > > > > On Fri, May 02, 2025 at 04:08:24PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > > The functions kvm_x86_ops::link_external_spt() and
> > > > > > > > > kvm_x86_ops::set_external_spte() are used to assign new memory to a VM.
> > > > > > > > > When using TDX with Dynamic PAMT enabled, the assigned memory must be
> > > > > > > > > covered by PAMT.
> > > > > > > > >
> > > > > > > > > The new function kvm_x86_ops::phys_prepare() is called before
> > > > > > > > > link_external_spt() and set_external_spte() to ensure that the memory is
> > > > > > > > > ready to be assigned to the virtual machine. In the case of TDX, it
> > > > > > > > > makes sure that the memory is covered by PAMT.
> > > > > > > > >
> > > > > > > > > kvm_x86_ops::phys_prepare() is called in a context where struct kvm_vcpu
> > > > > > > > > is available, allowing the implementation to allocate memory from a
> > > > > > > > > per-VCPU pool.
> > > > > > > > >
> > > > > > > > Why not invoke phys_prepare() and phys_cleanup() in set_external_spte_present()?
> > > > > > > > Or in tdx_sept_set_private_spte()/tdx_sept_link_private_spt()?
> > > > > > >
> > > > > > > Because the memory pool we allocated from is per-vcpu and we lost access
> > > > > > > to vcpu by then. And not all callers provide vcpu.
> > > > > > Maybe we can get vcpu via kvm_get_running_vcpu(), as in [1].
> > > > > > Then for callers not providing vcpu (where vcpu is NULL), we can use per-KVM
> > > > > > cache?
> > > > >
> > > > > Hm. I was not aware of kvm_get_running_vcpu(). Will play with it, thanks.
> > > >
> > > > I am not sure why per-vcpu cache matters.
> > > >
> > > > For non-leaf SEPT pages, AFAICT the "vcpu->arch.mmu_external_spt_cache" is just
> > > > an empty cache, and eventually __get_free_page() is used to allocate in:
> > > >
> > > > sp->external_spt =
> > > > kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
> > > >
> > > > So why not we actually create a kmem_cache for it with an actual 'ctor', and we
> > > > can call tdx_alloc_page() in that. This makes sure when the "external_spt" is
> > > > allocated, the underneath PAMT entry is there.
> > >
> > > I looked closer to this and while it is good idea, but ctor in kmem_cache
> > > cannot fail which makes this approach not viable.
> > >
> > > I guess we can a constructor directly into struct kvm_mmu_memory_cache.
> > > Let me play with this.
> >
> > I failed to make it work.
> >
> > We need to have destructor paired with the constructor that would do
> > PAMT-aware freeing. And redirect all free paths to it. It requires
> > substantial rework. I don't think it worth the effort.
> >
> > Will do manual PAMT management for SPT in TDX code.
>
> Thanks for the effort.
>
> Maybe something below?
With help of kvm_get_running_vcpu(), I localized these manipulations to
the internals of TDX code. No need to leak this to TDP.
phys_prepare/cleanup() is gone now.
https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/commit/?h=tdx/dpamt-huge&id=72394699b5454aac6c027accab6d94a52d88819b
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2025-06-06 10:20 UTC | newest]
Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-02 13:08 [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 01/12] x86/virt/tdx: Allocate page bitmap for " Kirill A. Shutemov
2025-05-05 10:08 ` Huang, Kai
2025-05-02 13:08 ` [RFC, PATCH 02/12] x86/virt/tdx: Allocate reference counters for PAMT memory Kirill A. Shutemov
2025-05-05 11:05 ` Huang, Kai
2025-05-08 13:03 ` kirill.shutemov
2025-05-09 1:06 ` Huang, Kai
2025-05-12 9:53 ` kirill.shutemov
2025-05-13 23:24 ` Huang, Kai
2025-05-09 9:52 ` Chao Gao
2025-05-12 9:51 ` Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 03/12] x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE Kirill A. Shutemov
2025-05-09 10:18 ` Chao Gao
2025-05-12 9:55 ` Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 04/12] x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 05/12] KVM: TDX: Add tdx_pamt_get()/put() helpers Kirill A. Shutemov
2025-05-05 12:44 ` Huang, Kai
2025-05-07 1:01 ` Yan Zhao
2025-05-07 1:15 ` Vishal Annapurve
2025-05-07 2:42 ` Yan Zhao
2025-05-08 13:19 ` kirill.shutemov
2025-05-07 16:31 ` Dave Hansen
2025-05-08 2:08 ` Yan Zhao
2025-05-08 13:21 ` kirill.shutemov
2025-05-08 13:16 ` kirill.shutemov
2025-05-23 9:42 ` kirill.shutemov
2025-05-14 5:25 ` Chao Gao
2025-05-23 10:46 ` Kirill A. Shutemov
2025-05-14 5:33 ` Chao Gao
2025-05-14 6:25 ` Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 06/12] KVM: TDX: Allocate PAMT memory in __tdx_td_init() Kirill A. Shutemov
2025-05-05 12:46 ` Huang, Kai
2025-05-02 13:08 ` [RFC, PATCH 07/12] KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init() Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 08/12] KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops Kirill A. Shutemov
2025-05-06 11:55 ` Yan Zhao
2025-05-08 13:23 ` Kirill A. Shutemov
2025-05-09 1:25 ` Yan Zhao
2025-05-12 9:55 ` Kirill A. Shutemov
2025-05-14 0:00 ` Huang, Kai
2025-05-14 6:43 ` kirill.shutemov
2025-05-19 5:00 ` Huang, Kai
2025-05-23 12:00 ` kirill.shutemov
2025-06-05 13:01 ` kirill.shutemov
2025-06-05 22:21 ` Huang, Kai
2025-06-06 10:20 ` kirill.shutemov
2025-05-14 6:15 ` Chao Gao
2025-05-02 13:08 ` [RFC, PATCH 09/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path Kirill A. Shutemov
2025-05-14 0:07 ` Huang, Kai
2025-05-14 6:30 ` Chao Gao
2025-05-30 10:28 ` Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 10/12] KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops Kirill A. Shutemov
2025-05-02 13:08 ` [RFC, PATCH 11/12] KVM: TDX: Reclaim PAMT memory Kirill A. Shutemov
2025-05-14 1:11 ` Huang, Kai
2025-05-14 15:21 ` Vishal Annapurve
2025-05-19 5:06 ` Huang, Kai
2025-05-02 13:08 ` [RFC, PATCH 12/12] x86/virt/tdx: Enable Dynamic PAMT Kirill A. Shutemov
2025-05-14 13:41 ` [RFC, PATCH 00/12] TDX: " Sean Christopherson
2025-05-15 14:22 ` Kirill A. Shutemov
2025-05-15 15:03 ` Dave Hansen
2025-05-15 16:02 ` Kirill A. Shutemov
2025-05-14 20:33 ` Zhi Wang
2025-05-15 9:17 ` Kirill A. Shutemov
2025-05-15 14:03 ` Dave Hansen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).