linux-coco.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	pbonzini@redhat.com, seanjc@google.com,
	dave.hansen@linux.intel.com
Cc: rick.p.edgecombe@intel.com, isaku.yamahata@intel.com,
	kai.huang@intel.com, yan.y.zhao@intel.com, chao.gao@intel.com,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	kvm@vger.kernel.org, x86@kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCHv2 03/12] x86/virt/tdx: Allocate reference counters for PAMT memory
Date: Wed, 25 Jun 2025 12:26:09 -0700	[thread overview]
Message-ID: <fb5addcb-1cfc-45be-978c-e7cee4126b38@intel.com> (raw)
In-Reply-To: <20250609191340.2051741-4-kirill.shutemov@linux.intel.com>

On 6/9/25 12:13, Kirill A. Shutemov wrote:
> The PAMT memory holds metadata for TDX-protected memory. With Dynamic
> PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
> with a page pair that covers 2M of host physical memory.
> 
> The kernel must provide this page pair before using pages from the range
> for TDX. If this is not done, any SEAMCALL that attempts to use the
> memory will fail.
> 
> Allocate reference counters for every 2M range to track PAMT memory
> usage. This is necessary to accurately determine when PAMT memory needs
> to be allocated and when it can be freed.
> 
> This allocation will consume 2MiB for every 1TiB of physical memory.

... and yes, this is another boot-time allocation that seems to be
counter to the goal of reducing the boot-time TDX memory footprint.

Please mention the 0.4%=>0.0004% overhead here in addition to the cover
letter. It's important.

> Tracking PAMT memory usage on the kernel side duplicates what TDX module
> does.  It is possible to avoid this by lazily allocating PAMT memory on
> SEAMCALL failure and freeing it based on hints provided by the TDX
> module when the last user of PAMT memory is no longer present.
> 
> However, this approach complicates serialization.
> 
> The TDX module takes locks when dealing with PAMT: a shared lock on any
> SEAMCALL that uses explicit HPA and an exclusive lock on PAMT.ADD and
> PAMT.REMOVE. Any SEAMCALL that uses explicit HPA as an operand may fail
> if it races with PAMT.ADD/REMOVE.
> 
> Since PAMT is a global resource, to prevent failure the kernel would
> need global locking (per-TD is not sufficient). Or, it has to retry on
> TDX_OPERATOR_BUSY.
> 
> Both options are not ideal, and tracking PAMT usage on the kernel side
> seems like a reasonable alternative.

Just a nit on changelog formatting: It would be ideal if you could make
it totally clear that you are transitioning from "what this patch does"
to "alternate considered designs".

> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -29,6 +29,7 @@
>  #include <linux/acpi.h>
>  #include <linux/suspend.h>
>  #include <linux/idr.h>
> +#include <linux/vmalloc.h>
>  #include <asm/page.h>
>  #include <asm/special_insns.h>
>  #include <asm/msr-index.h>
> @@ -50,6 +51,8 @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
>  
>  static struct tdmr_info_list tdx_tdmr_list;
>  
> +static atomic_t *pamt_refcounts;

Comments, please. How big is this? When is it allocated?

In this case, it's even sparse, right? That's *SUPER* unusual for a
kernel data structure.

>  static enum tdx_module_status_t tdx_module_status;
>  static DEFINE_MUTEX(tdx_module_lock);
>  
> @@ -182,6 +185,102 @@ int tdx_cpu_enable(void)
>  }
>  EXPORT_SYMBOL_GPL(tdx_cpu_enable);
>  
> +static atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> +{
> +	return &pamt_refcounts[hpa / PMD_SIZE];
> +}

"get refcount" usually means "get a reference". This is looking up the
location of the refcount.

I think this needs a better name.

> +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
> +{

This is getting to be severely under-commented.

I also got this far into the patch and I'd forgotten about the sparse
allocation and was scratching my head about what pte's have to do with
dynamically allocating part of the PAMT.

That point to a pretty severe deficit in the cover letter, changelogs
and comments leading up to this point.

> +	unsigned long vaddr;
> +	pte_t entry;
> +
> +	if (!pte_none(ptep_get(pte)))
> +		return 0;

This ^ is an optimization, right? Could it be comment appropriately, please?

> +	vaddr = __get_free_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!vaddr)
> +		return -ENOMEM;
> +
> +	entry = pfn_pte(PFN_DOWN(__pa(vaddr)), PAGE_KERNEL);
> +
> +	spin_lock(&init_mm.page_table_lock);
> +	if (pte_none(ptep_get(pte)))
> +		set_pte_at(&init_mm, addr, pte, entry);
> +	else
> +		free_page(vaddr);
> +	spin_unlock(&init_mm.page_table_lock);
> +
> +	return 0;
> +}
> +
> +static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr,
> +				    void *data)
> +{
> +	unsigned long vaddr;
> +
> +	vaddr = (unsigned long)__va(PFN_PHYS(pte_pfn(ptep_get(pte))));

Gah, we really need a kpte_to_vaddr() helper here. This is really ugly.
How many of these are in the tree?

> +	spin_lock(&init_mm.page_table_lock);
> +	if (!pte_none(ptep_get(pte))) {

Is there really a case where this gets called on unpopulated ptes? How?

> +		pte_clear(&init_mm, addr, pte);
> +		free_page(vaddr);
> +	}
> +	spin_unlock(&init_mm.page_table_lock);
> +
> +	return 0;
> +}
> +
> +static int alloc_pamt_refcount(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	unsigned long start, end;
> +
> +	start = (unsigned long)tdx_get_pamt_refcount(PFN_PHYS(start_pfn));
> +	end = (unsigned long)tdx_get_pamt_refcount(PFN_PHYS(end_pfn + 1));
> +	start = round_down(start, PAGE_SIZE);
> +	end = round_up(end, PAGE_SIZE);
> +

Please try to vertically align these:

	start = (...)tdx_get_pamt_refcount(PFN_PHYS(start_pfn));
	end   = (...)tdx_get_pamt_refcount(PFN_PHYS(end_pfn + 1));
	start = round_down(start, PAGE_SIZE);
	end   = round_up(    end, PAGE_SIZE);

> +	return apply_to_page_range(&init_mm, start, end - start,
> +				   pamt_refcount_populate, NULL);
> +}

But, I've staring at these for maybe 5 minutes. I think I've made sense
of it.

alloc_pamt_refcount() is taking a relatively arbitrary range of pfns.
Those PFNs come from memory map and NUMA layout so they don't have any
real alignment guarantees.

This code translates the memory range into a range of virtual addresses
in the *virtual* refcount table. That table is sparse and might not be
allocated. It is populated 4k at a time and since the start/end_pfn
don't have any alignment guarantees, there's no telling onto which page
they map into the refcount table. This has to be conservative and round
'start' down and 'end' up. This might overlap with previous refcount
table populations.

Is that all correct?

That seems ... medium to high complexity to me. Is there some reason
none of it is documented or commented? Like, I think it's not been
mentioned a single time anywhere.

> +static int init_pamt_metadata(void)
> +{
> +	size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> +	struct vm_struct *area;
> +
> +	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> +		return 0;
> +
> +	/*
> +	 * Reserve vmalloc range for PAMT reference counters. It covers all
> +	 * physical address space up to max_pfn. It is going to be populated
> +	 * from init_tdmr() only for present memory that available for TDX use.
> +	 */
> +	area = get_vm_area(size, VM_IOREMAP);
> +	if (!area)
> +		return -ENOMEM;
> +
> +	pamt_refcounts = area->addr;
> +	return 0;
> +}
Finally, we get to a description of what's actually going on. But, still
nothing has told me why this is necessary directly.

If it were me, I'd probably split this up into two patches. The first
would just do:

	area = vmalloc(size);

The second would do all the fancy sparse population.

But either way, I've hit a wall on this. This is too impenetrable as it
stands to review further. I'll eagerly await a more approachable v3.

  reply	other threads:[~2025-06-25 19:26 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-09 19:13 [PATCHv2 00/12] TDX: Enable Dynamic PAMT Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 01/12] x86/tdx: Consolidate TDX error handling Kirill A. Shutemov
2025-06-25 17:58   ` Dave Hansen
2025-06-25 20:58     ` Edgecombe, Rick P
2025-06-25 21:27       ` Sean Christopherson
2025-06-25 21:46         ` Edgecombe, Rick P
2025-06-26  9:25         ` kirill.shutemov
2025-06-26 14:46           ` Dave Hansen
2025-06-26 15:51             ` Sean Christopherson
2025-06-26 16:59               ` Dave Hansen
2025-06-27 10:42                 ` kirill.shutemov
2025-07-30 18:32                 ` Edgecombe, Rick P
2025-07-31 23:31                   ` Sean Christopherson
2025-07-31 23:46                     ` Edgecombe, Rick P
2025-07-31 23:53                       ` Sean Christopherson
2025-08-01 15:03                         ` Edgecombe, Rick P
2025-08-06 15:19                           ` Sean Christopherson
2025-06-26  0:05     ` Huang, Kai
2025-07-30 18:33       ` Edgecombe, Rick P
2025-06-09 19:13 ` [PATCHv2 02/12] x86/virt/tdx: Allocate page bitmap for Dynamic PAMT Kirill A. Shutemov
2025-06-25 18:06   ` Dave Hansen
2025-06-26  9:25     ` Kirill A. Shutemov
2025-07-31  1:06     ` Edgecombe, Rick P
2025-07-31  4:10       ` Huang, Kai
2025-06-26 11:08   ` Huang, Kai
2025-06-27 10:42     ` kirill.shutemov
2025-06-09 19:13 ` [PATCHv2 03/12] x86/virt/tdx: Allocate reference counters for PAMT memory Kirill A. Shutemov
2025-06-25 19:26   ` Dave Hansen [this message]
2025-06-27 11:27     ` Kirill A. Shutemov
2025-06-27 14:03       ` Dave Hansen
2025-06-26  0:53   ` Huang, Kai
2025-06-26  4:48     ` Huang, Kai
2025-06-27 11:35     ` kirill.shutemov
2025-06-09 19:13 ` [PATCHv2 04/12] x86/virt/tdx: Add tdx_alloc/free_page() helpers Kirill A. Shutemov
2025-06-10  2:36   ` Chao Gao
2025-06-10 14:51     ` [PATCHv2.1 " Kirill A. Shutemov
2025-06-25 18:01       ` Dave Hansen
2025-06-25 20:09     ` [PATCHv2 " Dave Hansen
2025-06-26  0:46       ` Chao Gao
2025-06-25 20:02   ` Dave Hansen
2025-06-27 13:00     ` Kirill A. Shutemov
2025-06-27  7:49   ` Adrian Hunter
2025-06-27 13:03     ` Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 05/12] KVM: TDX: Allocate PAMT memory in __tdx_td_init() Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 06/12] KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init() Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 07/12] KVM: TDX: Preallocate PAMT pages to be used in page fault path Kirill A. Shutemov
2025-06-26 11:21   ` Huang, Kai
2025-07-10  1:34   ` Edgecombe, Rick P
2025-07-10  7:49     ` kirill.shutemov
2025-06-09 19:13 ` [PATCHv2 08/12] KVM: TDX: Handle PAMT allocation in " Kirill A. Shutemov
2025-06-12 12:19   ` Chao Gao
2025-06-12 13:05     ` [PATCHv2.1 " Kirill A. Shutemov
2025-06-25 22:38   ` [PATCHv2 " Edgecombe, Rick P
2025-07-09 14:29     ` kirill.shutemov
2025-07-10  1:33   ` Edgecombe, Rick P
2025-07-10  8:45     ` kirill.shutemov
2025-08-21 19:21   ` Sagi Shahar
2025-08-21 19:35     ` Edgecombe, Rick P
2025-08-21 19:53       ` Sagi Shahar
2025-06-09 19:13 ` [PATCHv2 09/12] KVM: TDX: Reclaim PAMT memory Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 10/12] [NOT-FOR-UPSTREAM] x86/virt/tdx: Account PAMT memory and print it in /proc/meminfo Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 11/12] x86/virt/tdx: Enable Dynamic PAMT Kirill A. Shutemov
2025-06-09 19:13 ` [PATCHv2 12/12] Documentation/x86: Add documentation for TDX's " Kirill A. Shutemov
2025-06-25 13:25 ` [PATCHv2 00/12] TDX: Enable " Kirill A. Shutemov
2025-06-25 22:49 ` Edgecombe, Rick P
2025-06-27 13:05   ` kirill.shutemov
2025-08-08 23:18 ` Edgecombe, Rick P
2025-08-11  6:31   ` kas
2025-08-11 22:30     ` Edgecombe, Rick P
2025-08-12  2:02       ` Sean Christopherson
2025-08-12  2:31         ` Vishal Annapurve
2025-08-12  8:04           ` kas
2025-08-12 15:12             ` Edgecombe, Rick P
2025-08-12 16:15               ` Sean Christopherson
2025-08-12 18:39                 ` Edgecombe, Rick P
2025-08-12 22:00                   ` Vishal Annapurve
2025-08-12 23:34                     ` Edgecombe, Rick P
2025-08-13  0:18                       ` Vishal Annapurve
2025-08-13  0:51                         ` Edgecombe, Rick P
2025-08-12 18:44                 ` Vishal Annapurve
2025-08-13  8:09                 ` Kiryl Shutsemau
2025-08-13  7:49               ` Kiryl Shutsemau
2025-08-12  8:03         ` kas
2025-08-13 22:43         ` Edgecombe, Rick P
2025-08-13 23:31           ` Dave Hansen
2025-08-14  0:14             ` Edgecombe, Rick P
2025-08-14 10:55               ` Kiryl Shutsemau
2025-08-15  1:03                 ` Edgecombe, Rick P
2025-08-20 15:31                   ` Sean Christopherson
2025-08-20 16:35                     ` Edgecombe, Rick P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fb5addcb-1cfc-45be-978c-e7cee4126b38@intel.com \
    --to=dave.hansen@intel.com \
    --cc=bp@alien8.de \
    --cc=chao.gao@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).