public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes
@ 2024-11-13 11:57 Kai Huang
  2024-11-13 11:57 ` [PATCH v8 1/9] x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better Kai Huang
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

This series replaces the existing TDX module metadata reading code with
a new auto-generated global metadata infrastructure to:

1) address two issues in the current TDX module initialization code, and
2) have an extendable infrastructure which is super easy to read more
   metadata and share with KVM for KVM TDX support (and other kernel
   components for TDX Connect in the future).

And the reason that we need a new global metadata infrastructure is the
current one can only read TDMR related metadata fields and it is not
extendable to read more metadata fields, which is required to address
both 1) and 2) above.

Specifically, below two issues in the current module initialization code
need to be addressed:

1) Module initialization may fail on some large systems (e.g., with 4 or
   more sockets) [1].
2) Some old modules can clobber host's RBP when existing from the TDX
   guest, and currently they can be initialized successfully.  We don't
   want to use such modules thus we should just fail to initialize them
   to avoid memory/cpu cycle cost of initializing TDX module [2].

The first 6 patches introduce the new auto-generated global metadata
infrastructure (which is auto-generated using a script [3]), and the
rest patches address the above two issues.

Hi Dave,

This series targets x86 tip.  This is also a pre-work of the "quite near
future" KVM TDX support.  I appreciate if you can review, comment and
take this series if the patches look good to you.

The script used to auto-generate the metadata reading code in patch 3
can be found in [3].

Also cc Dan for TDX Connect, and cc Paolo/Sean for KVM TDX (but I
removed KVM list since this series doesn't touch KVM code).

History:

v7 -> v8:
 - Address Dave's comments to remove the code to print module version
   and CMRs:
   - Remove the code which reads module version in the auto-generated
     code.
   - Remove the patch which prints module version (patch 10 in v7)
   - Remove the code which prints CMRs in patch 7.
   - Update the changelog of some patches that mentioned "reading module
     version" and "print CMRs".
 - Collect Nikolay's tag.

Previous versions and more background info please see:

 - https://lore.kernel.org/kvm/6ab90fd332bccdec7b64e5909cb4637732d6bb01.1731318868.git.kai.huang@intel.com/T/

[1]: https://github.com/canonical/tdx/issues/135
[2]: https://lore.kernel.org/fc0e8ab7-86d4-4428-be31-82e1ece6dd21@intel.com/
[3]: https://lore.kernel.org/d5aed06ae4b46df5db97fdbac9c01843920a2f96.camel@intel.com/


Kai Huang (8):
  x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec
    better
  x86/virt/tdx: Start to track all global metadata in one structure
  x86/virt/tdx: Use dedicated struct members for PAMT entry sizes
  x86/virt/tdx: Add missing header file inclusion to local tdx.h
  x86/virt/tdx: Switch to use auto-generated global metadata reading
    code
  x86/virt/tdx: Trim away tail null CMRs
  x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find
    memory holes
  x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD
    mitigation

Paolo Bonzini (1):
  x86/virt/tdx: Use auto-generated code to read global metadata

 arch/x86/virt/vmx/tdx/tdx.c                 | 146 +++++++++++---------
 arch/x86/virt/vmx/tdx/tdx.h                 |  43 +-----
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c |  67 +++++++++
 arch/x86/virt/vmx/tdx/tdx_global_metadata.h |  32 +++++
 4 files changed, 183 insertions(+), 105 deletions(-)
 create mode 100644 arch/x86/virt/vmx/tdx/tdx_global_metadata.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx_global_metadata.h


base-commit: 7ae15e2f69bad06527668b478dff7c099ad2e6ae
-- 
2.46.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8 1/9] x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
@ 2024-11-13 11:57 ` Kai Huang
  2024-11-13 11:57 ` [PATCH v8 2/9] x86/virt/tdx: Start to track all global metadata in one structure Kai Huang
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

The TDX module provides a set of "Global Metadata Fields".  They report
things like TDX module version, supported features, and fields related
to create/run TDX guests and so on.

TDX organizes those metadata fields by "Classes" based on the meaning of
those fields.  E.g., for now the kernel only reads "TD Memory Region"
(TDMR) related fields for module initialization.  Those fields are
defined under class "TDMR Info".

There are both immediate needs to read more metadata fields for module
initialization and near-future needs for other kernel components like
KVM to run TDX guests.  To meet all those requirements, the idea is the
TDX host core-kernel to provide a centralized, canonical, and read-only
structure for the global metadata that comes out from the TDX module for
all kernel components to use.

More specifically, the target is to end up with something like:

       struct tdx_sys_info {
	       struct tdx_sys_info_classA a;
	       struct tdx_sys_info_classB b;
	       ...
       };

Currently the kernel organizes all fields under "TDMR Info" class in
'struct tdx_tdmr_sysinfo'.  To prepare for the above target, rename the
structure to 'struct tdx_sys_info_tdmr' to follow the class name better.

No functional change intended.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 36 ++++++++++++++++++------------------
 arch/x86/virt/vmx/tdx/tdx.h |  2 +-
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 4e2b2e2ac9f9..e979bf442929 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -272,7 +272,7 @@ static int read_sys_metadata_field(u64 field_id, u64 *data)
 
 static int read_sys_metadata_field16(u64 field_id,
 				     int offset,
-				     struct tdx_tdmr_sysinfo *ts)
+				     struct tdx_sys_info_tdmr *ts)
 {
 	u16 *ts_member = ((void *)ts) + offset;
 	u64 tmp;
@@ -298,9 +298,9 @@ struct field_mapping {
 
 #define TD_SYSINFO_MAP(_field_id, _offset) \
 	{ .field_id = MD_FIELD_ID_##_field_id,	   \
-	  .offset   = offsetof(struct tdx_tdmr_sysinfo, _offset) }
+	  .offset   = offsetof(struct tdx_sys_info_tdmr, _offset) }
 
-/* Map TD_SYSINFO fields into 'struct tdx_tdmr_sysinfo': */
+/* Map TD_SYSINFO fields into 'struct tdx_sys_info_tdmr': */
 static const struct field_mapping fields[] = {
 	TD_SYSINFO_MAP(MAX_TDMRS,	      max_tdmrs),
 	TD_SYSINFO_MAP(MAX_RESERVED_PER_TDMR, max_reserved_per_tdmr),
@@ -309,16 +309,16 @@ static const struct field_mapping fields[] = {
 	TD_SYSINFO_MAP(PAMT_1G_ENTRY_SIZE,    pamt_entry_size[TDX_PS_1G]),
 };
 
-static int get_tdx_tdmr_sysinfo(struct tdx_tdmr_sysinfo *tdmr_sysinfo)
+static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
 {
 	int ret;
 	int i;
 
-	/* Populate 'tdmr_sysinfo' fields using the mapping structure above: */
+	/* Populate 'sysinfo_tdmr' fields using the mapping structure above: */
 	for (i = 0; i < ARRAY_SIZE(fields); i++) {
 		ret = read_sys_metadata_field16(fields[i].field_id,
 						fields[i].offset,
-						tdmr_sysinfo);
+						sysinfo_tdmr);
 		if (ret)
 			return ret;
 	}
@@ -342,13 +342,13 @@ static int tdmr_size_single(u16 max_reserved_per_tdmr)
 }
 
 static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
-			   struct tdx_tdmr_sysinfo *tdmr_sysinfo)
+			   struct tdx_sys_info_tdmr *sysinfo_tdmr)
 {
 	size_t tdmr_sz, tdmr_array_sz;
 	void *tdmr_array;
 
-	tdmr_sz = tdmr_size_single(tdmr_sysinfo->max_reserved_per_tdmr);
-	tdmr_array_sz = tdmr_sz * tdmr_sysinfo->max_tdmrs;
+	tdmr_sz = tdmr_size_single(sysinfo_tdmr->max_reserved_per_tdmr);
+	tdmr_array_sz = tdmr_sz * sysinfo_tdmr->max_tdmrs;
 
 	/*
 	 * To keep things simple, allocate all TDMRs together.
@@ -367,7 +367,7 @@ static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
 	 * at a given index in the TDMR list.
 	 */
 	tdmr_list->tdmr_sz = tdmr_sz;
-	tdmr_list->max_tdmrs = tdmr_sysinfo->max_tdmrs;
+	tdmr_list->max_tdmrs = sysinfo_tdmr->max_tdmrs;
 	tdmr_list->nr_consumed_tdmrs = 0;
 
 	return 0;
@@ -921,11 +921,11 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
- * TDMR global information in @tdmr_sysinfo.
+ * TDMR global information in @sysinfo_tdmr.
  */
 static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
-			   struct tdx_tdmr_sysinfo *tdmr_sysinfo)
+			   struct tdx_sys_info_tdmr *sysinfo_tdmr)
 {
 	int ret;
 
@@ -934,12 +934,12 @@ static int construct_tdmrs(struct list_head *tmb_list,
 		return ret;
 
 	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
-			tdmr_sysinfo->pamt_entry_size);
+			sysinfo_tdmr->pamt_entry_size);
 	if (ret)
 		return ret;
 
 	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
-			tdmr_sysinfo->max_reserved_per_tdmr);
+			sysinfo_tdmr->max_reserved_per_tdmr);
 	if (ret)
 		tdmrs_free_pamt_all(tdmr_list);
 
@@ -1098,7 +1098,7 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
 
 static int init_tdx_module(void)
 {
-	struct tdx_tdmr_sysinfo tdmr_sysinfo;
+	struct tdx_sys_info_tdmr sysinfo_tdmr;
 	int ret;
 
 	/*
@@ -1117,17 +1117,17 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_put_tdxmem;
 
-	ret = get_tdx_tdmr_sysinfo(&tdmr_sysinfo);
+	ret = get_tdx_sys_info_tdmr(&sysinfo_tdmr);
 	if (ret)
 		goto err_free_tdxmem;
 
 	/* Allocate enough space for constructing TDMRs */
-	ret = alloc_tdmr_list(&tdx_tdmr_list, &tdmr_sysinfo);
+	ret = alloc_tdmr_list(&tdx_tdmr_list, &sysinfo_tdmr);
 	if (ret)
 		goto err_free_tdxmem;
 
 	/* Cover all TDX-usable memory regions in TDMRs */
-	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &tdmr_sysinfo);
+	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo_tdmr);
 	if (ret)
 		goto err_free_tdmrs;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index b701f69485d3..148f9b4d1140 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -100,7 +100,7 @@ struct tdx_memblock {
 };
 
 /* "TDMR info" part of "Global Scope Metadata" for constructing TDMRs */
-struct tdx_tdmr_sysinfo {
+struct tdx_sys_info_tdmr {
 	u16 max_tdmrs;
 	u16 max_reserved_per_tdmr;
 	u16 pamt_entry_size[TDX_PS_NR];
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 2/9] x86/virt/tdx: Start to track all global metadata in one structure
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
  2024-11-13 11:57 ` [PATCH v8 1/9] x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better Kai Huang
@ 2024-11-13 11:57 ` Kai Huang
  2024-11-13 11:57 ` [PATCH v8 3/9] x86/virt/tdx: Use auto-generated code to read global metadata Kai Huang
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

The TDX module provides a set of "Global Metadata Fields".  They report
things like TDX module version, supported features, and fields related
to create/run TDX guests and so on.

Currently the kernel only reads "TD Memory Region" (TDMR) related fields
for module initialization.  There are both immediate needs to read more
metadata fields for module initialization and near-future needs for
other kernel components like KVM to run TDX guests.

To meet all those requirements, the idea is the TDX host core-kernel to
to provide a centralized, canonical, and read-only structure for the
global metadata that comes out from the TDX module for all kernel
components to use.

As the first step, introduce a new 'struct tdx_sys_info' to track all
global metadata fields.

TDX categories global metadata fields into different "Classes".  E.g.,
the TDMR related fields are under class "TDMR Info".  Instead of making
'struct tdx_sys_info' a plain structure to contain all metadata fields,
organize them in smaller structures based on the "Class".

This allows those metadata fields to be used in finer granularity thus
makes the code more clear.  E.g., the construct_tdmr() can just take the
structure which contains "TDMR Info" metadata fields.

Add a new function get_tdx_sys_info() as the placeholder to read all
metadata fields, and call it at the beginning of init_tdx_module().  For
now it only calls get_tdx_sys_info_tdmr() to read TDMR related fields.

Note there is a functional change: get_tdx_sys_info_tdmr() is moved from
after build_tdx_memlist() to before it, but it is fine to do so.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 19 ++++++++++++-------
 arch/x86/virt/vmx/tdx/tdx.h | 19 ++++++++++++-------
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e979bf442929..7a2f979092e7 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -326,6 +326,11 @@ static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
 	return 0;
 }
 
+static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
+{
+	return get_tdx_sys_info_tdmr(&sysinfo->tdmr);
+}
+
 /* Calculate the actual TDMR size */
 static int tdmr_size_single(u16 max_reserved_per_tdmr)
 {
@@ -1098,9 +1103,13 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
 
 static int init_tdx_module(void)
 {
-	struct tdx_sys_info_tdmr sysinfo_tdmr;
+	struct tdx_sys_info sysinfo;
 	int ret;
 
+	ret = get_tdx_sys_info(&sysinfo);
+	if (ret)
+		return ret;
+
 	/*
 	 * To keep things simple, assume that all TDX-protected memory
 	 * will come from the page allocator.  Make sure all pages in the
@@ -1117,17 +1126,13 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_put_tdxmem;
 
-	ret = get_tdx_sys_info_tdmr(&sysinfo_tdmr);
-	if (ret)
-		goto err_free_tdxmem;
-
 	/* Allocate enough space for constructing TDMRs */
-	ret = alloc_tdmr_list(&tdx_tdmr_list, &sysinfo_tdmr);
+	ret = alloc_tdmr_list(&tdx_tdmr_list, &sysinfo.tdmr);
 	if (ret)
 		goto err_free_tdxmem;
 
 	/* Cover all TDX-usable memory regions in TDMRs */
-	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo_tdmr);
+	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr);
 	if (ret)
 		goto err_free_tdmrs;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 148f9b4d1140..2600ec3752f5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -80,6 +80,18 @@ struct tdmr_info {
 	DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
 } __packed __aligned(TDMR_INFO_ALIGNMENT);
 
+/* Class "TDMR info" */
+struct tdx_sys_info_tdmr {
+	u16 max_tdmrs;
+	u16 max_reserved_per_tdmr;
+	u16 pamt_entry_size[TDX_PS_NR];
+};
+
+/* Kernel used global metadata fields */
+struct tdx_sys_info {
+	struct tdx_sys_info_tdmr tdmr;
+};
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
@@ -99,13 +111,6 @@ struct tdx_memblock {
 	int nid;
 };
 
-/* "TDMR info" part of "Global Scope Metadata" for constructing TDMRs */
-struct tdx_sys_info_tdmr {
-	u16 max_tdmrs;
-	u16 max_reserved_per_tdmr;
-	u16 pamt_entry_size[TDX_PS_NR];
-};
-
 /* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
 #define TDMR_NR_WARN 4
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 3/9] x86/virt/tdx: Use auto-generated code to read global metadata
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
  2024-11-13 11:57 ` [PATCH v8 1/9] x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better Kai Huang
  2024-11-13 11:57 ` [PATCH v8 2/9] x86/virt/tdx: Start to track all global metadata in one structure Kai Huang
@ 2024-11-13 11:57 ` Kai Huang
  2024-12-13 11:17   ` Huang, Kai
  2024-11-13 11:57 ` [PATCH v8 4/9] x86/virt/tdx: Use dedicated struct members for PAMT entry sizes Kai Huang
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

From: Paolo Bonzini <pbonzini@redhat.com>

The TDX module provides a set of "Global Metadata Fields".  They report
things like TDX module version, supported features, and fields related
to create/run TDX guests and so on.

Currently the kernel only reads "TD Memory Region" (TDMR) related fields
for module initialization.  There are needs to read more global metadata
fields to address two issues in the current module initialization code:

 - "Convertible Memory Regions" (CMRs) to fix a potential module
   initialization failure on some large systems [1].
 - Supported features ("TDX_FEATURES0") to fail module initialization
   when the module doesn't support "not clobbering host RBP when exiting
   from TDX guest" feature [2].

Future changes to support KVM TDX and other features like TDX Connect
will need to read more.

The current global metadata reading code has limitations (e.g., it only
has a primitive helper to read metadata field with 16-bit element size,
while TDX supports 8/16/32/64 bits metadata element sizes).  It needs
tweaks in order to read more metadata fields.

But even with the tweaks, when new code is added to read a new field,
the reviewers will still need to review against the spec to make sure
the new code doesn't screw up things like using the wrong metadata
field ID (each metadata field is associated with a unique field ID,
which is a TDX-defined u64 constant) etc.

TDX documents all global metadata fields in a 'global_metadata.json'
file as part of TDX spec [3].  JSON format is machine readable.  Instead
of tweaking the metadata reading code, use a script to generate the code
so that:

  1) Using the generated C is simple.
  2) Adding a field is simple, e.g., the script just pulls the field ID
     out of the JSON for a given field thus no manual review is needed.

Specifically, to match the layout of the 'struct tdx_sys_info' and its
sub-structures, the script uses a table with each entry containing the
the name of the sub-structures (which reflects the "Class") and the
"Field Name" of all its fields, and auto-generate:

  1) The 'struct tdx_sys_info' and all 'struct tdx_sys_info_xx'
     sub-structures in 'tdx_global_metadata.h'

  2) The main function 'get_tdx_sys_info()' which reads all metadata to
     'struct tdx_sys_info' and the 'get_tdx_sys_info_xx()' functions
     which read 'struct tdx_sys_info_xx()' in 'tdx_global_metadata.c'.

Using the generated C is simple: 1) include "tdx_global_metadata.h" to
the local "tdx.h"; 2) explicitly include "tdx_global_metadata.c" to the
local "tdx.c" after the read_sys_metadata_field() primitive (which is a
wrapper of TDH.SYS.RD SEAMCALL to read global metadata).

Adding a field is also simple: 1) just add the new field to an existing
structure, or add it with a new structure; 2) re-run the script to
generate the new code; 3) update the existing tdx_global_metadata.{hc}
with the new ones.

For now, use the auto-generated code to read the TDMR related fields and
the aforesaid metadata fields: "TDX_FEATURES0" and CMRs.

Reading CMRs is more complicated than reading a simple field, since
there are two arrays containing the "CMR_BASE" and "CMR_SIZE" for each
CMR respectively.

TDX spec [3] section "Metadata Access Interface", sub-section "Arrays of
Metadata Fields" defines the way to read metadata fields in an array.
There's a "Base field ID" (say, X) for the array and the field ID for
entry array[i] is X + i.

For CMRs, the field "NUM_CMRS" reports the number of CMR entries that
can be read, and the code needs to use the value reported via "NUM_CMRS"
to loop despite the JSON file says the "Num Fields" of both "CMR_BASE"
and "CMR_SIZE" are 32.

The tdx_global_metadata.{hc} can be generated by running below:

 #python tdx_global_metadata.py global_metadata.json \
	tdx_global_metadata.h tdx_global_metadata.c

.. where the 'tdx_global_metadata.py' can be found in [4] and the
'global_metadata.json' can be fetched from [3].

Link: https://github.com/canonical/tdx/issues/135 [1]
Link: https://lore.kernel.org/fc0e8ab7-86d4-4428-be31-82e1ece6dd21@intel.com/ [2]
Link: https://cdrdv2.intel.com/v1/dl/getContent/795381 [3]
Link: https://lore.kernel.org/d5aed06ae4b46df5db97fdbac9c01843920a2f96.camel@intel.com/ [4]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 67 +++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx_global_metadata.h | 32 ++++++++++
 2 files changed, 99 insertions(+)
 create mode 100644 arch/x86/virt/vmx/tdx/tdx_global_metadata.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx_global_metadata.h

diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
new file mode 100644
index 000000000000..42ad62762b14
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
@@ -0,0 +1,67 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Automatically generated functions to read TDX global metadata.
+ *
+ * This file doesn't compile on its own as it lacks of inclusion
+ * of SEAMCALL wrapper primitive which reads global metadata.
+ * Include this file to other C file instead.
+ */
+
+static int get_tdx_sys_info_features(struct tdx_sys_info_features *sysinfo_features)
+{
+	int ret = 0;
+	u64 val;
+
+	if (!ret && !(ret = read_sys_metadata_field(0x0A00000300000008, &val)))
+		sysinfo_features->tdx_features0 = val;
+
+	return ret;
+}
+
+static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
+{
+	int ret = 0;
+	u64 val;
+
+	if (!ret && !(ret = read_sys_metadata_field(0x9100000100000008, &val)))
+		sysinfo_tdmr->max_tdmrs = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9100000100000009, &val)))
+		sysinfo_tdmr->max_reserved_per_tdmr = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9100000100000010, &val)))
+		sysinfo_tdmr->pamt_4k_entry_size = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9100000100000011, &val)))
+		sysinfo_tdmr->pamt_2m_entry_size = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9100000100000012, &val)))
+		sysinfo_tdmr->pamt_1g_entry_size = val;
+
+	return ret;
+}
+
+static int get_tdx_sys_info_cmr(struct tdx_sys_info_cmr *sysinfo_cmr)
+{
+	int ret = 0;
+	u64 val;
+	int i;
+
+	if (!ret && !(ret = read_sys_metadata_field(0x9000000100000000, &val)))
+		sysinfo_cmr->num_cmrs = val;
+	for (i = 0; i < sysinfo_cmr->num_cmrs; i++)
+		if (!ret && !(ret = read_sys_metadata_field(0x9000000300000080 + i, &val)))
+			sysinfo_cmr->cmr_base[i] = val;
+	for (i = 0; i < sysinfo_cmr->num_cmrs; i++)
+		if (!ret && !(ret = read_sys_metadata_field(0x9000000300000100 + i, &val)))
+			sysinfo_cmr->cmr_size[i] = val;
+
+	return ret;
+}
+
+static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
+{
+	int ret = 0;
+
+	ret = ret ?: get_tdx_sys_info_features(&sysinfo->features);
+	ret = ret ?: get_tdx_sys_info_tdmr(&sysinfo->tdmr);
+	ret = ret ?: get_tdx_sys_info_cmr(&sysinfo->cmr);
+
+	return ret;
+}
diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.h b/arch/x86/virt/vmx/tdx/tdx_global_metadata.h
new file mode 100644
index 000000000000..1fe4b513021f
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Automatically generated TDX global metadata structures. */
+#ifndef _X86_VIRT_TDX_AUTO_GENERATED_TDX_GLOBAL_METADATA_H
+#define _X86_VIRT_TDX_AUTO_GENERATED_TDX_GLOBAL_METADATA_H
+
+#include <linux/types.h>
+
+struct tdx_sys_info_features {
+	u64 tdx_features0;
+};
+
+struct tdx_sys_info_tdmr {
+	u16 max_tdmrs;
+	u16 max_reserved_per_tdmr;
+	u16 pamt_4k_entry_size;
+	u16 pamt_2m_entry_size;
+	u16 pamt_1g_entry_size;
+};
+
+struct tdx_sys_info_cmr {
+	u16 num_cmrs;
+	u64 cmr_base[32];
+	u64 cmr_size[32];
+};
+
+struct tdx_sys_info {
+	struct tdx_sys_info_features features;
+	struct tdx_sys_info_tdmr tdmr;
+	struct tdx_sys_info_cmr cmr;
+};
+
+#endif
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 4/9] x86/virt/tdx: Use dedicated struct members for PAMT entry sizes
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
                   ` (2 preceding siblings ...)
  2024-11-13 11:57 ` [PATCH v8 3/9] x86/virt/tdx: Use auto-generated code to read global metadata Kai Huang
@ 2024-11-13 11:57 ` Kai Huang
  2024-11-13 11:57 ` [PATCH v8 5/9] x86/virt/tdx: Add missing header file inclusion to local tdx.h Kai Huang
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

Currently, the 'struct tdmr_sys_info_tdmr' which includes TDMR related
fields defines the PAMT entry sizes for TDX supported page sizes (4KB,
2MB and 1GB) as an array:

	struct tdx_sys_info_tdmr {
		...
		u16 pamt_entry_sizes[TDX_PS_NR];
	};

PAMT entry sizes are needed when allocating PAMTs for each TDMR.  Using
the array to contain PAMT entry sizes reduces the number of arguments
that need to be passed when calling tdmr_set_up_pamt().  It also makes
the code pattern like below clearer:

	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
					pamt_entry_size[pgsz]);
		tdmr_pamt_size += pamt_size[pgsz];
	}

However, the auto-generated metadata reading code generates a structure
member for each field.  The 'global_metadata.json' has a dedicated field
for each PAMT entry size, and the new 'struct tdx_sys_info_tdmr' looks
like:

	struct tdx_sys_info_tdmr {
		...
		u16 pamt_4k_entry_size;
		u16 pamt_2m_entry_size;
		u16 pamt_1g_entry_size;
	};

To prepare to use the auto-generated code, make the existing 'struct
tdx_sys_info_tdmr' look like the generated one.  But when passing to
tdmrs_set_up_pamt_all(), build a local array of PAMT entry sizes from
the structure so the code to allocate PAMTs can stay the same.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 14 +++++++++-----
 arch/x86/virt/vmx/tdx/tdx.h |  4 +++-
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 7a2f979092e7..28537a6c47fc 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -304,9 +304,9 @@ struct field_mapping {
 static const struct field_mapping fields[] = {
 	TD_SYSINFO_MAP(MAX_TDMRS,	      max_tdmrs),
 	TD_SYSINFO_MAP(MAX_RESERVED_PER_TDMR, max_reserved_per_tdmr),
-	TD_SYSINFO_MAP(PAMT_4K_ENTRY_SIZE,    pamt_entry_size[TDX_PS_4K]),
-	TD_SYSINFO_MAP(PAMT_2M_ENTRY_SIZE,    pamt_entry_size[TDX_PS_2M]),
-	TD_SYSINFO_MAP(PAMT_1G_ENTRY_SIZE,    pamt_entry_size[TDX_PS_1G]),
+	TD_SYSINFO_MAP(PAMT_4K_ENTRY_SIZE,    pamt_4k_entry_size),
+	TD_SYSINFO_MAP(PAMT_2M_ENTRY_SIZE,    pamt_2m_entry_size),
+	TD_SYSINFO_MAP(PAMT_1G_ENTRY_SIZE,    pamt_1g_entry_size),
 };
 
 static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
@@ -932,14 +932,18 @@ static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
 			   struct tdx_sys_info_tdmr *sysinfo_tdmr)
 {
+	u16 pamt_entry_size[TDX_PS_NR] = {
+		sysinfo_tdmr->pamt_4k_entry_size,
+		sysinfo_tdmr->pamt_2m_entry_size,
+		sysinfo_tdmr->pamt_1g_entry_size,
+	};
 	int ret;
 
 	ret = fill_out_tdmrs(tmb_list, tdmr_list);
 	if (ret)
 		return ret;
 
-	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
-			sysinfo_tdmr->pamt_entry_size);
+	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list, pamt_entry_size);
 	if (ret)
 		return ret;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 2600ec3752f5..ec879d54eb5c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -84,7 +84,9 @@ struct tdmr_info {
 struct tdx_sys_info_tdmr {
 	u16 max_tdmrs;
 	u16 max_reserved_per_tdmr;
-	u16 pamt_entry_size[TDX_PS_NR];
+	u16 pamt_4k_entry_size;
+	u16 pamt_2m_entry_size;
+	u16 pamt_1g_entry_size;
 };
 
 /* Kernel used global metadata fields */
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 5/9] x86/virt/tdx: Add missing header file inclusion to local tdx.h
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
                   ` (3 preceding siblings ...)
  2024-11-13 11:57 ` [PATCH v8 4/9] x86/virt/tdx: Use dedicated struct members for PAMT entry sizes Kai Huang
@ 2024-11-13 11:57 ` Kai Huang
  2024-11-13 11:57 ` [PATCH v8 6/9] x86/virt/tdx: Switch to use auto-generated global metadata reading code Kai Huang
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

Compiler attributes __packed and __aligned, and DECLARE_FLEX_ARRAY() are
currently used in arch/x86/virt/vmx/tdx/tdx.h, but the relevant headers
are not included explicitly.

There's no build issue in the current code since this "tdx.h" is only
included by arch/x86/virt/vmx/tdx/tdx.c and it includes bunch of other
<linux/xxx.h> before including "tdx.h".  But for the better explicitly
include the relevant headers to "tdx.h".  Also include <linux/types.h>
for basic variable types like u16.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index ec879d54eb5c..b1d705c3ab2a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -2,6 +2,9 @@
 #ifndef _X86_VIRT_TDX_H
 #define _X86_VIRT_TDX_H
 
+#include <linux/types.h>
+#include <linux/compiler_attributes.h>
+#include <linux/stddef.h>
 #include <linux/bits.h>
 
 /*
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 6/9] x86/virt/tdx: Switch to use auto-generated global metadata reading code
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
                   ` (4 preceding siblings ...)
  2024-11-13 11:57 ` [PATCH v8 5/9] x86/virt/tdx: Add missing header file inclusion to local tdx.h Kai Huang
@ 2024-11-13 11:57 ` Kai Huang
  2024-11-13 11:57 ` [PATCH v8 7/9] x86/virt/tdx: Trim away tail null CMRs Kai Huang
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

Now the caller to read global metadata has been tweaked to be ready to
use auto-generated metadata reading code.  Switch to use it.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 61 +------------------------------------
 arch/x86/virt/vmx/tdx/tdx.h | 45 +--------------------------
 2 files changed, 2 insertions(+), 104 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 28537a6c47fc..43ec56db5084 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -270,66 +270,7 @@ static int read_sys_metadata_field(u64 field_id, u64 *data)
 	return 0;
 }
 
-static int read_sys_metadata_field16(u64 field_id,
-				     int offset,
-				     struct tdx_sys_info_tdmr *ts)
-{
-	u16 *ts_member = ((void *)ts) + offset;
-	u64 tmp;
-	int ret;
-
-	if (WARN_ON_ONCE(MD_FIELD_ID_ELE_SIZE_CODE(field_id) !=
-			MD_FIELD_ID_ELE_SIZE_16BIT))
-		return -EINVAL;
-
-	ret = read_sys_metadata_field(field_id, &tmp);
-	if (ret)
-		return ret;
-
-	*ts_member = tmp;
-
-	return 0;
-}
-
-struct field_mapping {
-	u64 field_id;
-	int offset;
-};
-
-#define TD_SYSINFO_MAP(_field_id, _offset) \
-	{ .field_id = MD_FIELD_ID_##_field_id,	   \
-	  .offset   = offsetof(struct tdx_sys_info_tdmr, _offset) }
-
-/* Map TD_SYSINFO fields into 'struct tdx_sys_info_tdmr': */
-static const struct field_mapping fields[] = {
-	TD_SYSINFO_MAP(MAX_TDMRS,	      max_tdmrs),
-	TD_SYSINFO_MAP(MAX_RESERVED_PER_TDMR, max_reserved_per_tdmr),
-	TD_SYSINFO_MAP(PAMT_4K_ENTRY_SIZE,    pamt_4k_entry_size),
-	TD_SYSINFO_MAP(PAMT_2M_ENTRY_SIZE,    pamt_2m_entry_size),
-	TD_SYSINFO_MAP(PAMT_1G_ENTRY_SIZE,    pamt_1g_entry_size),
-};
-
-static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
-{
-	int ret;
-	int i;
-
-	/* Populate 'sysinfo_tdmr' fields using the mapping structure above: */
-	for (i = 0; i < ARRAY_SIZE(fields); i++) {
-		ret = read_sys_metadata_field16(fields[i].field_id,
-						fields[i].offset,
-						sysinfo_tdmr);
-		if (ret)
-			return ret;
-	}
-
-	return 0;
-}
-
-static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
-{
-	return get_tdx_sys_info_tdmr(&sysinfo->tdmr);
-}
+#include "tdx_global_metadata.c"
 
 /* Calculate the actual TDMR size */
 static int tdmr_size_single(u16 max_reserved_per_tdmr)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index b1d705c3ab2a..0128b963b723 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -5,7 +5,7 @@
 #include <linux/types.h>
 #include <linux/compiler_attributes.h>
 #include <linux/stddef.h>
-#include <linux/bits.h>
+#include "tdx_global_metadata.h"
 
 /*
  * This file contains both macros and data structures defined by the TDX
@@ -29,35 +29,6 @@
 #define	PT_NDA		0x0
 #define	PT_RSVD		0x1
 
-/*
- * Global scope metadata field ID.
- *
- * See Table "Global Scope Metadata", TDX module 1.5 ABI spec.
- */
-#define MD_FIELD_ID_MAX_TDMRS			0x9100000100000008ULL
-#define MD_FIELD_ID_MAX_RESERVED_PER_TDMR	0x9100000100000009ULL
-#define MD_FIELD_ID_PAMT_4K_ENTRY_SIZE		0x9100000100000010ULL
-#define MD_FIELD_ID_PAMT_2M_ENTRY_SIZE		0x9100000100000011ULL
-#define MD_FIELD_ID_PAMT_1G_ENTRY_SIZE		0x9100000100000012ULL
-
-/*
- * Sub-field definition of metadata field ID.
- *
- * See Table "MD_FIELD_ID (Metadata Field Identifier / Sequence Header)
- * Definition", TDX module 1.5 ABI spec.
- *
- *  - Bit 33:32: ELEMENT_SIZE_CODE -- size of a single element of metadata
- *
- *	0: 8 bits
- *	1: 16 bits
- *	2: 32 bits
- *	3: 64 bits
- */
-#define MD_FIELD_ID_ELE_SIZE_CODE(_field_id)	\
-		(((_field_id) & GENMASK_ULL(33, 32)) >> 32)
-
-#define MD_FIELD_ID_ELE_SIZE_16BIT	1
-
 struct tdmr_reserved_area {
 	u64 offset;
 	u64 size;
@@ -83,20 +54,6 @@ struct tdmr_info {
 	DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
 } __packed __aligned(TDMR_INFO_ALIGNMENT);
 
-/* Class "TDMR info" */
-struct tdx_sys_info_tdmr {
-	u16 max_tdmrs;
-	u16 max_reserved_per_tdmr;
-	u16 pamt_4k_entry_size;
-	u16 pamt_2m_entry_size;
-	u16 pamt_1g_entry_size;
-};
-
-/* Kernel used global metadata fields */
-struct tdx_sys_info {
-	struct tdx_sys_info_tdmr tdmr;
-};
-
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 7/9] x86/virt/tdx: Trim away tail null CMRs
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
                   ` (5 preceding siblings ...)
  2024-11-13 11:57 ` [PATCH v8 6/9] x86/virt/tdx: Switch to use auto-generated global metadata reading code Kai Huang
@ 2024-11-13 11:57 ` Kai Huang
  2024-11-13 11:57 ` [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes Kai Huang
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

TDX architecturally supports up to 32 CMRs.  The global metadata field
"NUM_CMRS" reports the number of CMR entries that can be read by the
kernel.  However, that field may just report the maximum number of CMRs
albeit the actual number of CMRs is smaller, in which case there are
tail null CMRs (size is 0).

A future fix to a module initialization failure issue will need to loop
over all CMRs.  Trim away those null CMRs once for all here so that the
kernel doesn't need to explicitly skip them each time when it uses CMRs
at later times.

More information about CMR can be found at "Intel TDX ISA Background:
Convertible Memory Ranges (CMRs)" in TDX 1.5 base spec [1], and
"CMR_INFO" in TDX 1.5 ABI spec [2].

Now get_tdx_sys_info() just reads kernel-needed global metadata to
kernel structure, and it is auto-generated.  Add a wrapper function
init_tdx_sys_info() to invoke get_tdx_sys_info() and provide room to do
additional things like dealing with CMRs.

Link: https://cdrdv2.intel.com/v1/dl/getContent/733575 [1]
Link: https://cdrdv2.intel.com/v1/dl/getContent/733579 [2]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 42 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 43ec56db5084..5e6d8021681d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -272,6 +272,46 @@ static int read_sys_metadata_field(u64 field_id, u64 *data)
 
 #include "tdx_global_metadata.c"
 
+/* Update the @sysinfo_cmr->num_cmrs to trim tail null CMRs */
+static void trim_null_tail_cmrs(struct tdx_sys_info_cmr *sysinfo_cmr)
+{
+	int i;
+
+	/*
+	 * The TDX module may report the maximum number of CMRs that
+	 * TDX architecturally supports as the actual number of CMRs,
+	 * despite the latter is smaller.  In this case some tail
+	 * CMR(s) will be null (size is 0).  Trim them away.
+	 *
+	 * Note the CMRs are generated by the BIOS, but the MCHECK
+	 * verifies CMRs before enabling TDX on hardware.  Skip other
+	 * sanity checks (e.g., verify CMR is 4KB aligned) but trust
+	 * MCHECK to work properly.
+	 *
+	 * The spec doesn't say whether it's legal to have null CMRs
+	 * in the middle of valid CMRs.  For now assume no sane BIOS
+	 * would do that (even MCHECK allows).
+	 */
+	for (i = 0; i < sysinfo_cmr->num_cmrs; i++)
+		if (!sysinfo_cmr->cmr_size[i])
+			break;
+
+	sysinfo_cmr->num_cmrs = i;
+}
+
+static int init_tdx_sys_info(struct tdx_sys_info *sysinfo)
+{
+	int ret;
+
+	ret = get_tdx_sys_info(sysinfo);
+	if (ret)
+		return ret;
+
+	trim_null_tail_cmrs(&sysinfo->cmr);
+
+	return 0;
+}
+
 /* Calculate the actual TDMR size */
 static int tdmr_size_single(u16 max_reserved_per_tdmr)
 {
@@ -1051,7 +1091,7 @@ static int init_tdx_module(void)
 	struct tdx_sys_info sysinfo;
 	int ret;
 
-	ret = get_tdx_sys_info(&sysinfo);
+	ret = init_tdx_sys_info(&sysinfo);
 	if (ret)
 		return ret;
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
                   ` (6 preceding siblings ...)
  2024-11-13 11:57 ` [PATCH v8 7/9] x86/virt/tdx: Trim away tail null CMRs Kai Huang
@ 2024-11-13 11:57 ` Kai Huang
  2024-12-04 14:22   ` Huang, Kai
                     ` (2 more replies)
  2024-11-13 11:57 ` [PATCH v8 9/9] x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD mitigation Kai Huang
  2024-11-13 22:25 ` [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Edgecombe, Rick P
  9 siblings, 3 replies; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

A TDX module initialization failure was reported on a Emerald Rapids
platform [*]:

  virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
  virt/tdx: module initialization failed (-28)

As part of initializing the TDX module, the kernel informs the TDX
module of all "TDX-usable memory regions" using an array of TDX defined
structure "TD Memory Region" (TDMR).  Each TDMR must be in 1GB aligned
and in 1GB granularity, and all "non-TDX-usable memory holes" within a
given TDMR are marked as "reserved areas".  The TDX module reports a
maximum number of reserved areas that can be supported per TDMR (16).

The kernel builds the "TDX-usable memory regions" based on memblocks
(which reflects e820), and uses this list to find all "reserved areas"
for each TDMR.

It turns out that the kernel's view of memory holes is too fine grained
and sometimes exceeds the number of holes that the TDX module can track
per TDMR [1], resulting in the above failure.

Thankfully the module also lists memory that is potentially convertible
in a list of "Convertible Memory Regions" (CMRs).  That coarser grained
CMR list tends to track usable memory in the memory map even if it might
be reserved for host usage like 'ACPI data' [2].

Use that list to relax what the kernel considers unusable memory.  If it
falls in a CMR no need to instantiate a hole, and rely on the fact that
kernel will keep what it considers 'reserved' out of the page allocator.

[1] BIOS-E820 table of the problematic platform:

  BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
  BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
  BIOS-e820: [mem 0x0000000000100000-0x000000005d168fff] usable
  BIOS-e820: [mem 0x000000005d169000-0x000000005d22afff] ACPI data
  BIOS-e820: [mem 0x000000005d22b000-0x000000005d3cefff] usable
  BIOS-e820: [mem 0x000000005d3cf000-0x000000005d469fff] reserved
  BIOS-e820: [mem 0x000000005d46a000-0x000000005e5b2fff] usable
  BIOS-e820: [mem 0x000000005e5b3000-0x000000005e5c2fff] reserved
  BIOS-e820: [mem 0x000000005e5c3000-0x000000005e5d2fff] usable
  BIOS-e820: [mem 0x000000005e5d3000-0x000000005e5e4fff] reserved
  BIOS-e820: [mem 0x000000005e5e5000-0x000000005eb57fff] usable
  BIOS-e820: [mem 0x000000005eb58000-0x0000000061357fff] ACPI NVS
  BIOS-e820: [mem 0x0000000061358000-0x000000006172afff] usable
  BIOS-e820: [mem 0x000000006172b000-0x0000000061794fff] ACPI data
  BIOS-e820: [mem 0x0000000061795000-0x00000000617fefff] usable
  BIOS-e820: [mem 0x00000000617ff000-0x0000000061912fff] ACPI data
  BIOS-e820: [mem 0x0000000061913000-0x0000000061998fff] usable
  BIOS-e820: [mem 0x0000000061999000-0x00000000619dffff] ACPI data
  BIOS-e820: [mem 0x00000000619e0000-0x00000000619e1fff] usable
  BIOS-e820: [mem 0x00000000619e2000-0x00000000619e9fff] reserved
  BIOS-e820: [mem 0x00000000619ea000-0x0000000061a26fff] usable
  BIOS-e820: [mem 0x0000000061a27000-0x0000000061baefff] ACPI data
  BIOS-e820: [mem 0x0000000061baf000-0x00000000623c2fff] usable
  BIOS-e820: [mem 0x00000000623c3000-0x0000000062471fff] reserved
  BIOS-e820: [mem 0x0000000062472000-0x0000000062823fff] usable
  BIOS-e820: [mem 0x0000000062824000-0x0000000063a24fff] reserved
  BIOS-e820: [mem 0x0000000063a25000-0x0000000063d57fff] usable
  BIOS-e820: [mem 0x0000000063d58000-0x0000000064157fff] reserved
  BIOS-e820: [mem 0x0000000064158000-0x0000000064158fff] usable
  BIOS-e820: [mem 0x0000000064159000-0x0000000064194fff] reserved
  BIOS-e820: [mem 0x0000000064195000-0x000000006e9cefff] usable
  BIOS-e820: [mem 0x000000006e9cf000-0x000000006eccefff] reserved
  BIOS-e820: [mem 0x000000006eccf000-0x000000006f6fefff] ACPI NVS
  BIOS-e820: [mem 0x000000006f6ff000-0x000000006f7fefff] ACPI data
  BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable
  BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
  ......

[2] Convertible Memory Regions of the problematic platform:

  virt/tdx: CMR: [0x100000, 0x6f800000)
  virt/tdx: CMR: [0x100000000, 0x107a000000)
  virt/tdx: CMR: [0x1080000000, 0x207c000000)
  virt/tdx: CMR: [0x2080000000, 0x307c000000)
  virt/tdx: CMR: [0x3080000000, 0x407c000000)

Link: https://github.com/canonical/tdx/issues/135 [*]
Fixes: dde3b60d572c ("x86/virt/tdx: Designate reserved areas for all TDMRs")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 5e6d8021681d..b14089f784bf 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -733,29 +733,28 @@ static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
 }
 
 /*
- * Go through @tmb_list to find holes between memory areas.  If any of
+ * Go through all CMRs in @sysinfo_cmr to find memory holes.  If any of
  * those holes fall within @tdmr, set up a TDMR reserved area to cover
  * the hole.
  */
-static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
+static int tdmr_populate_rsvd_holes(struct tdx_sys_info_cmr *sysinfo_cmr,
 				    struct tdmr_info *tdmr,
 				    int *rsvd_idx,
 				    u16 max_reserved_per_tdmr)
 {
-	struct tdx_memblock *tmb;
 	u64 prev_end;
-	int ret;
+	int i, ret;
 
 	/*
 	 * Start looking for reserved blocks at the
 	 * beginning of the TDMR.
 	 */
 	prev_end = tdmr->base;
-	list_for_each_entry(tmb, tmb_list, list) {
+	for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
 		u64 start, end;
 
-		start = PFN_PHYS(tmb->start_pfn);
-		end   = PFN_PHYS(tmb->end_pfn);
+		start = sysinfo_cmr->cmr_base[i];
+		end   = start + sysinfo_cmr->cmr_size[i];
 
 		/* Break if this region is after the TDMR */
 		if (start >= tdmr_end(tdmr))
@@ -856,16 +855,16 @@ static int rsvd_area_cmp_func(const void *a, const void *b)
 
 /*
  * Populate reserved areas for the given @tdmr, including memory holes
- * (via @tmb_list) and PAMTs (via @tdmr_list).
+ * (via @sysinfo_cmr) and PAMTs (via @tdmr_list).
  */
 static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
-				    struct list_head *tmb_list,
+				    struct tdx_sys_info_cmr *sysinfo_cmr,
 				    struct tdmr_info_list *tdmr_list,
 				    u16 max_reserved_per_tdmr)
 {
 	int ret, rsvd_idx = 0;
 
-	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
+	ret = tdmr_populate_rsvd_holes(sysinfo_cmr, tdmr, &rsvd_idx,
 			max_reserved_per_tdmr);
 	if (ret)
 		return ret;
@@ -884,10 +883,10 @@ static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
 
 /*
  * Populate reserved areas for all TDMRs in @tdmr_list, including memory
- * holes (via @tmb_list) and PAMTs.
+ * holes (via @sysinfo_cmr) and PAMTs.
  */
 static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
-					 struct list_head *tmb_list,
+					 struct tdx_sys_info_cmr *sysinfo_cmr,
 					 u16 max_reserved_per_tdmr)
 {
 	int i;
@@ -896,7 +895,7 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
 		int ret;
 
 		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
-				tmb_list, tdmr_list, max_reserved_per_tdmr);
+				sysinfo_cmr, tdmr_list, max_reserved_per_tdmr);
 		if (ret)
 			return ret;
 	}
@@ -911,7 +910,8 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
  */
 static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
-			   struct tdx_sys_info_tdmr *sysinfo_tdmr)
+			   struct tdx_sys_info_tdmr *sysinfo_tdmr,
+			   struct tdx_sys_info_cmr *sysinfo_cmr)
 {
 	u16 pamt_entry_size[TDX_PS_NR] = {
 		sysinfo_tdmr->pamt_4k_entry_size,
@@ -928,7 +928,7 @@ static int construct_tdmrs(struct list_head *tmb_list,
 	if (ret)
 		return ret;
 
-	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
+	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, sysinfo_cmr,
 			sysinfo_tdmr->max_reserved_per_tdmr);
 	if (ret)
 		tdmrs_free_pamt_all(tdmr_list);
@@ -1117,7 +1117,8 @@ static int init_tdx_module(void)
 		goto err_free_tdxmem;
 
 	/* Cover all TDX-usable memory regions in TDMRs */
-	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr);
+	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr,
+			&sysinfo.cmr);
 	if (ret)
 		goto err_free_tdmrs;
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 9/9] x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD mitigation
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
                   ` (7 preceding siblings ...)
  2024-11-13 11:57 ` [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes Kai Huang
@ 2024-11-13 11:57 ` Kai Huang
  2024-11-13 22:25 ` [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Edgecombe, Rick P
  9 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2024-11-13 11:57 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

Old TDX modules can clobber RBP in the TDH.VP.ENTER SEAMCALL.  However
RBP is used as frame pointer in the x86_64 calling convention, and
clobbering RBP could result in bad things like being unable to unwind
the stack if any non-maskable exceptions (NMI, #MC etc) happens in that
gap.

A new "NO_RBP_MOD" feature was introduced to more recent TDX modules to
not clobber RBP.  This feature is reported in the TDX_FEATURES0 global
metadata field via bit 18.

Don't initialize the TDX module if this feature is not supported [1].

Note the bit definitions of TDX_FEATURES0 are not auto-generated in
tdx_global_metadata.h.  Manually define a macro for it in "tdx.h".

Link: https://lore.kernel.org/fc0e8ab7-86d4-4428-be31-82e1ece6dd21@intel.com/ [1]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 17 +++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  4 ++++
 2 files changed, 21 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b14089f784bf..e4a7e0e17812 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -312,6 +312,18 @@ static int init_tdx_sys_info(struct tdx_sys_info *sysinfo)
 	return 0;
 }
 
+static int check_features(struct tdx_sys_info *sysinfo)
+{
+	u64 tdx_features0 = sysinfo->features.tdx_features0;
+
+	if (!(tdx_features0 & TDX_FEATURES0_NO_RBP_MOD)) {
+		pr_err("frame pointer (RBP) clobber bug present, upgrade TDX module\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 /* Calculate the actual TDMR size */
 static int tdmr_size_single(u16 max_reserved_per_tdmr)
 {
@@ -1095,6 +1107,11 @@ static int init_tdx_module(void)
 	if (ret)
 		return ret;
 
+	/* Check whether the kernel can support this module */
+	ret = check_features(&sysinfo);
+	if (ret)
+		return ret;
+
 	/*
 	 * To keep things simple, assume that all TDX-protected memory
 	 * will come from the page allocator.  Make sure all pages in the
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 0128b963b723..c8be00f6b15a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/compiler_attributes.h>
 #include <linux/stddef.h>
+#include <linux/bits.h>
 #include "tdx_global_metadata.h"
 
 /*
@@ -54,6 +55,9 @@ struct tdmr_info {
 	DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
 } __packed __aligned(TDMR_INFO_ALIGNMENT);
 
+/* Bit definitions of TDX_FEATURES0 metadata field */
+#define TDX_FEATURES0_NO_RBP_MOD	BIT(18)
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes
  2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
                   ` (8 preceding siblings ...)
  2024-11-13 11:57 ` [PATCH v8 9/9] x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD mitigation Kai Huang
@ 2024-11-13 22:25 ` Edgecombe, Rick P
  2024-11-13 22:40   ` Huang, Kai
  9 siblings, 1 reply; 28+ messages in thread
From: Edgecombe, Rick P @ 2024-11-13 22:25 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, Huang, Kai, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, x86@kernel.org, Yamahata, Isaku

On Thu, 2024-11-14 at 00:57 +1300, Kai Huang wrote:
> This series replaces the existing TDX module metadata reading code with
> a new auto-generated global metadata infrastructure to:
> 
> 1) address two issues in the current TDX module initialization code, and
> 2) have an extendable infrastructure which is super easy to read more
>    metadata and share with KVM for KVM TDX support (and other kernel
>    components for TDX Connect in the future).
> 
> And the reason that we need a new global metadata infrastructure is the
> current one can only read TDMR related metadata fields and it is not
> extendable to read more metadata fields, which is required to address
> both 1) and 2) above.
> 
> Specifically, below two issues in the current module initialization code
> need to be addressed:
> 
> 1) Module initialization may fail on some large systems (e.g., with 4 or
>    more sockets) [1].

Earlier Dave asked:
"What is your goal?  What is the bare minimum amount of code to get there?"

I think the goal right now is to get the critical dependencies for KVM basic TD
boot support upstream. Basically the bare minimum to do something useful. Since
unlike the rest of the pending pre-reqs, this series will need to go through
Linus' tree and back down to KVM's before we can drop the kvm-coco-queue copy,
this part is a little more urgent. At least from an order of operations
perspective.

I looked at this bug, and it seems like if this is not fixed then TDX will not
be usable in some TDX supporting platforms. Is that right? So the consequence of
not having this would be that when KVM support lands not all TDX capable HW can
use TDX. But it won't make the TDX KVM support totally useless. And otherwise
the error condition is handled cleanly in the upstream code.

So I think it is not part of the "bare minimum". I don't have any objection with
it going upstream with rest of this series if it doesn't delay it. But I want to
make sure we don't have any more confusion that will cause further delays.

> 2) Some old modules can clobber host's RBP when existing from the TDX
>    guest, and currently they can be initialized successfully.  We don't
>    want to use such modules thus we should just fail to initialize them
>    to avoid memory/cpu cycle cost of initializing TDX module [2].

I think we need RBP MOD for basic support, or it may cause crashes when we start
booting TDs.

Does all that seem correct?

> 
> The first 6 patches introduce the new auto-generated global metadata
> infrastructure (which is auto-generated using a script [3]), and the
> rest patches address the above two issues.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes
  2024-11-13 22:25 ` [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Edgecombe, Rick P
@ 2024-11-13 22:40   ` Huang, Kai
  2024-11-13 22:53     ` Edgecombe, Rick P
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Kai @ 2024-11-13 22:40 UTC (permalink / raw)
  To: Edgecombe, Rick P, Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, x86@kernel.org, Yamahata, Isaku


> So I think it is not part of the "bare minimum". I don't have any objection with
> it going upstream with rest of this series if it doesn't delay it. But I want to
> make sure we don't have any more confusion that will cause further delays.

We have two issues that need to be addressed.  Addressing them could 
bring the infrastructure that is needed for KVM TDX as well, so this is 
the "minimal code" given the goal I want to achieve here.

> 
>> 2) Some old modules can clobber host's RBP when existing from the TDX
>>     guest, and currently they can be initialized successfully.  We don't
>>     want to use such modules thus we should just fail to initialize them
>>     to avoid memory/cpu cycle cost of initializing TDX module [2].
> 
> I think we need RBP MOD for basic support, or it may cause crashes when we start
> booting TDs.
> 
> Does all that seem correct?

We will need additional patch to save/restore RBP.  The more important 
thing is it's naturally bad, due to the thing that I mentioned in that 
patch:

"...clobbering RBP could result in bad things like being unable to 
unwind the stack if any non-maskable exceptions (NMI, #MC etc) happens 
in that gap."



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes
  2024-11-13 22:40   ` Huang, Kai
@ 2024-11-13 22:53     ` Edgecombe, Rick P
  2024-11-13 23:35       ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: Edgecombe, Rick P @ 2024-11-13 22:53 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, bp@alien8.de, Huang, Kai,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, x86@kernel.org, Yamahata, Isaku

On Thu, 2024-11-14 at 11:40 +1300, Huang, Kai wrote:
> > So I think it is not part of the "bare minimum". I don't have any objection
> > with
> > it going upstream with rest of this series if it doesn't delay it. But I
> > want to
> > make sure we don't have any more confusion that will cause further delays.
> 
> We have two issues that need to be addressed.  Addressing them could 
> bring the infrastructure that is needed for KVM TDX as well, so this is 
> the "minimal code" given the goal I want to achieve here.

I'm confused by this. "could bring the infrastructure"? What is the "goal I want
to achieve"?

Let me ask it another way, if we drop patches 7 and 8 and pushed them in a later
series. (say after TDX gets upstream for the sake of this question, but I'm not
suggesting a schedule). Then what is the consequence to the goal of booting a TD
on some HW?

I'm not questioning that the patches are in order, or that they should be fixed
urgently. I'm just trying to make sure we are clear to Dave on why this is all
needed.

> 
> > 
> > > 2) Some old modules can clobber host's RBP when existing from the TDX
> > >     guest, and currently they can be initialized successfully.  We don't
> > >     want to use such modules thus we should just fail to initialize them
> > >     to avoid memory/cpu cycle cost of initializing TDX module [2].
> > 
> > I think we need RBP MOD for basic support, or it may cause crashes when we
> > start
> > booting TDs.
> > 
> > Does all that seem correct?
> 
> We will need additional patch to save/restore RBP.  

I think you are talking about the workaround patch that was NAKed and we have
now dropped from the dev branch since we can now rely on NO_RBP_MOD? So we
*don't* need that patch...?

> The more important 
> thing is it's naturally bad, due to the thing that I mentioned in that 
> patch:
> 
> "...clobbering RBP could result in bad things like being unable to 
> unwind the stack if any non-maskable exceptions (NMI, #MC etc) happens 
> in that gap."
> 
> 

Yea I think NO_RBP_MOD is required.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes
  2024-11-13 22:53     ` Edgecombe, Rick P
@ 2024-11-13 23:35       ` Huang, Kai
  0 siblings, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2024-11-13 23:35 UTC (permalink / raw)
  To: Edgecombe, Rick P, Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, x86@kernel.org, Yamahata, Isaku



On 14/11/2024 11:53 am, Edgecombe, Rick P wrote:
> On Thu, 2024-11-14 at 11:40 +1300, Huang, Kai wrote:
>>> So I think it is not part of the "bare minimum". I don't have any objection
>>> with
>>> it going upstream with rest of this series if it doesn't delay it. But I
>>> want to
>>> make sure we don't have any more confusion that will cause further delays.
>>
>> We have two issues that need to be addressed.  Addressing them could
>> bring the infrastructure that is needed for KVM TDX as well, so this is
>> the "minimal code" given the goal I want to achieve here.
> 
> I'm confused by this. "could bring the infrastructure"? What is the "goal I want
> to achieve"?

The goal is mentioned in beginning of the cover letter:

1) address two issues in the current TDX module initialization code, and
2) have an extendable infrastructure which is super easy to read more
    metadata and share with KVM for KVM TDX support (and other kernel
    components for TDX Connect in the future).

> 
> Let me ask it another way, if we drop patches 7 and 8 and pushed them in a later
> series. (say after TDX gets upstream for the sake of this question, but I'm not
> suggesting a schedule). Then what is the consequence to the goal of booting a TD
> on some HW?

The patch to fix the bug is not a dependency of KVM TDX, but it is a 
real issue.  Customers have reported this issue, requested the fix and 
integrated the fix to their environment while we are still in middle of 
upstreaming KVM TDX.

So I don't see anything wrong to include it here.  It also gives us an 
additional user of the new metadata infrastructure.

> 
> I'm not questioning that the patches are in order, or that they should be fixed
> urgently. I'm just trying to make sure we are clear to Dave on why this is all
> needed.

Yeah I'll certainly follow Dave's request.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-11-13 11:57 ` [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes Kai Huang
@ 2024-12-04 14:22   ` Huang, Kai
  2024-12-05 12:45     ` Huang, Kai
  2024-12-05 12:40   ` [PATCH v8 8.1/9] " Kai Huang
  2024-12-09  6:50   ` [PATCH v8 8.2/9] " Kai Huang
  2 siblings, 1 reply; 28+ messages in thread
From: Huang, Kai @ 2024-12-04 14:22 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Edgecombe, Rick P, x86@kernel.org,
	Yamahata, Isaku

On Thu, 2024-11-14 at 00:57 +1300, Kai Huang wrote:
> A TDX module initialization failure was reported on a Emerald Rapids
> platform [*]:
> 
>   virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
>   virt/tdx: module initialization failed (-28)
> 
> As part of initializing the TDX module, the kernel informs the TDX
> module of all "TDX-usable memory regions" using an array of TDX defined
> structure "TD Memory Region" (TDMR).  Each TDMR must be in 1GB aligned
> and in 1GB granularity, and all "non-TDX-usable memory holes" within a
> given TDMR are marked as "reserved areas".  The TDX module reports a
> maximum number of reserved areas that can be supported per TDMR (16).
> 
> The kernel builds the "TDX-usable memory regions" based on memblocks
> (which reflects e820), and uses this list to find all "reserved areas"
> for each TDMR.
> 
> It turns out that the kernel's view of memory holes is too fine grained
> and sometimes exceeds the number of holes that the TDX module can track
> per TDMR [1], resulting in the above failure.
> 
> Thankfully the module also lists memory that is potentially convertible
> in a list of "Convertible Memory Regions" (CMRs).  That coarser grained
> CMR list tends to track usable memory in the memory map even if it might
> be reserved for host usage like 'ACPI data' [2].
> 
> Use that list to relax what the kernel considers unusable memory.  If it
> falls in a CMR no need to instantiate a hole, and rely on the fact that
> kernel will keep what it considers 'reserved' out of the page allocator.
> 

Hi Dave,

I realized there's one issue if we change to use CMRs to fill up reserved areas
for TDMRs after some internal discussion with Dan:

Currently we requires all memory regions in memblocks.memory to be TDX
convertible memory at the time of initializing the TDX module to make module
initialization successful.  We build a list of those memory regions as a list of
"TDX memory blocks", and use them to construct the TDMRs to configure the
module.  But we don't explicitly check those memory regions against CMRs to make
sure they are truly TDX convertible, instead we depend on TDH.SYS.CONFIG to
catch any non-CMR memory regions that end up to the TDX memory blocks.

This works fine because currently we use those TDX memory blocks to fill up the
reserved areas of TDMRs, i.e., any non-CMR regions in TDX memory blocks will end
up to "non-reserved areas" in TDMR(s), and this will cause TDH.SYS.CONFIG to
fail.

After we change to using CMRs to fill up reserved areas of TDMRs, then all non-
CMR regions will be marked as "reserved areas", regardless whether there is any
non-CMR memory region in the TDX memory blocks.  This will result in TDX module
initialization being successful while there are non-CMR pages going into page
allocator.

To avoid this, we need to explicitly check all the TDX memory blocks against
CMRs to make sure they are actually TDX convertible, before using CMRs to fill
up reserved areas.

I ended up with below incremental diff with some additional text for the
changelog.

Do you have any comments?

------

    Currently, the kernel does not explicitly check all TDX-usable memory
    regions (that come from memblock) to be truly TDX convertible (not
    feasible w/o CMRs anyway) but depends on the TDH.SYS.CONFIG to fail if
    there's any non-CMR memory region ended up as TDX-usable memory.
    
    After changing to using CMRs to fill up reserved areas, unfortunately
    the TDH.SYS.CONFIG will no longer be able to catch such non-CMR regions.
    This is because any non-CMR region will always end up with reserved
    area even if it is included in TDX-usable memory regions.
    
    To still make sure of that, explicitly check all TDX-usable memory
    regions are truly TDX convertible against CMRs when stashing them from
    the memblocks.
    
    Also print an error message if any region is non-CMR so the users can
    easily know the reason of the failure.  This is nice to have anyway
    because a clear message is better than having to decipher the error code
    of the TDH.SYS.CONFIG.
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e4a7e0e17812..d25bdf8ca136 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -176,6 +176,23 @@ int tdx_cpu_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_cpu_enable);
 
+/* Check whether a given memory region is a sub-region of any CMR. */
+static bool is_cmr_sub_region(unsigned long start_pfn, unsigned long end_pfn,
+                             struct tdx_sys_info_cmr *sysinfo_cmr)
+{
+       int i;
+
+       for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
+               u64 cmr_base_pfn = sysinfo_cmr->cmr_base[i] >> PAGE_SHIFT;
+               u64 cmr_npages = sysinfo_cmr->cmr_size[i] >> PAGE_SHIFT;
+
+               if (start_pfn >= cmr_base_pfn &&
+                               end_pfn <= (cmr_base_pfn + cmr_npages))
+                       return true;
+       }
+
+       return false;
+}
 /*
  * Add a memory region as a TDX memory block.  The caller must make sure
  * all memory regions are added in address ascending order and don't
@@ -218,7 +235,8 @@ static void free_tdx_memlist(struct list_head *tmb_list)
  * ranges off in a secondary structure because memblock is modified
  * in memory hotplug while TDX memory regions are fixed.
  */
-static int build_tdx_memlist(struct list_head *tmb_list)
+static int build_tdx_memlist(struct list_head *tmb_list,
+                            struct tdx_sys_info_cmr *sysinfo_cmr)
 {
        unsigned long start_pfn, end_pfn;
        int i, nid, ret;
@@ -234,6 +252,27 @@ static int build_tdx_memlist(struct list_head *tmb_list)
                if (start_pfn >= end_pfn)
                        continue;
 
+               /*
+                * Make sure the to-be-added memory region is truly TDX
+                * convertible memory.
+                *
+                * Note:
+                *
+                * The to-be-added memory region here doesn't cross NUMA
+                * nodes.  The check here assumes the memory region does
+                * not cross any adjacent CMRs which are contiguous
+                * (i.e., the end of the first CMR is the start of the
+                * next) _AND_ are in the same NUMA node.  A sane BIOS
+                * should never report memory regions and CMRs in such
+                * way.
+                */
+               if (!is_cmr_sub_region(start_pfn, end_pfn, sysinfo_cmr)) {
+                       pr_err("memory region [0x%lx, 0x%lx) is not TDX
convertible memory.\n",
+                                       PHYS_PFN(start_pfn), PHYS_PFN(end_pfn));
+                       ret = -EINVAL;
+                       goto err;
+               }
+
                /*
                 * Add the memory regions as TDX memory.  The regions in
                 * memblock has already guaranteed they are in address
@@ -940,6 +979,15 @@ static int construct_tdmrs(struct list_head *tmb_list,
        if (ret)
                return ret;
 
+       /*
+        * Use CMRs instead of the TDX memory blocks to populate the
+        * reserved areas to reduce consumption of reserved areas for
+        * each TDMR.  On some large systems (e.g., a machine with 4 or
+        * more sockets), the BIOS may report many usable memory regions
+        * in the first 1GB in e820.  This may result in reserved areas
+        * of the first TDMR being exhausted if TDX memory blocks are
+        * used to fill up reserved areas.
+        */
        ret = tdmrs_populate_rsvd_areas_all(tdmr_list, sysinfo_cmr,
                        sysinfo_tdmr->max_reserved_per_tdmr);
        if (ret)
@@ -1124,7 +1172,7 @@ static int init_tdx_module(void)
         */
        get_online_mems();
 
-       ret = build_tdx_memlist(&tdx_memlist);
+       ret = build_tdx_memlist(&tdx_memlist, &sysinfo.cmr);
        if (ret)
                goto out_put_tdxmem;


 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 8.1/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-11-13 11:57 ` [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes Kai Huang
  2024-12-04 14:22   ` Huang, Kai
@ 2024-12-05 12:40   ` Kai Huang
  2024-12-05 18:10     ` Dave Hansen
  2024-12-09  6:50   ` [PATCH v8 8.2/9] " Kai Huang
  2 siblings, 1 reply; 28+ messages in thread
From: Kai Huang @ 2024-12-05 12:40 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

A TDX module initialization failure was reported on a Emerald Rapids
platform [*]:

  virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
  virt/tdx: module initialization failed (-28)

As part of initializing the TDX module, the kernel informs the TDX
module of all "TDX-usable memory regions" using an array of TDX defined
structure "TD Memory Region" (TDMR).  Each TDMR must be in 1GB aligned
and in 1GB granularity, and all "non-TDX-usable memory holes" within a
given TDMR are marked as "reserved areas".  The TDX module reports a
maximum number of reserved areas that can be supported per TDMR (16).

The kernel builds the "TDX-usable memory regions" based on memblocks
(which reflects e820), and uses this list to find all "reserved areas"
for each TDMR.

It turns out that the kernel's view of memory holes is too fine grained
and sometimes exceeds the number of holes that the TDX module can track
per TDMR [1], resulting in the above failure.

Thankfully the module also lists memory that is potentially convertible
in a list of "Convertible Memory Regions" (CMRs).  That coarser grained
CMR list tends to track usable memory in the memory map even if it might
be reserved for host usage like 'ACPI data' [2].

Use that list to relax what the kernel considers unusable memory.  If it
falls in a CMR no need to instantiate a hole, and rely on the kernel to
keep what it considers 'reserved' out of the page allocator.

Currently, the kernel does not explicitly check all TDX-usable memory
regions (that come from memblock) to be truly TDX convertible (not
feasible w/o CMRs anyway) but depends on the TDH.SYS.CONFIG to fail if
there's any non-CMR memory region ended up as TDX-usable memory.

After changing to using CMRs to fill up reserved areas, unfortunately
the TDH.SYS.CONFIG will no longer be able to catch such non-CMR regions.
This is because any non-CMR region will always end up with reserved
area even if it is included in TDX-usable memory regions.

To still make sure of that, explicitly check all TDX-usable memory
regions are truly TDX convertible against CMRs when stashing them from
the memblocks.

Also print an error message if any region is non-CMR so the users can
easily know the reason of the failure.  This is nice to have anyway
because a clear message is better than having to decipher the error code
of the TDH.SYS.CONFIG.

[1] BIOS-E820 table of the problematic platform:

  BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
  BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
  BIOS-e820: [mem 0x0000000000100000-0x000000005d168fff] usable
  BIOS-e820: [mem 0x000000005d169000-0x000000005d22afff] ACPI data
  BIOS-e820: [mem 0x000000005d22b000-0x000000005d3cefff] usable
  BIOS-e820: [mem 0x000000005d3cf000-0x000000005d469fff] reserved
  BIOS-e820: [mem 0x000000005d46a000-0x000000005e5b2fff] usable
  BIOS-e820: [mem 0x000000005e5b3000-0x000000005e5c2fff] reserved
  BIOS-e820: [mem 0x000000005e5c3000-0x000000005e5d2fff] usable
  BIOS-e820: [mem 0x000000005e5d3000-0x000000005e5e4fff] reserved
  BIOS-e820: [mem 0x000000005e5e5000-0x000000005eb57fff] usable
  BIOS-e820: [mem 0x000000005eb58000-0x0000000061357fff] ACPI NVS
  BIOS-e820: [mem 0x0000000061358000-0x000000006172afff] usable
  BIOS-e820: [mem 0x000000006172b000-0x0000000061794fff] ACPI data
  BIOS-e820: [mem 0x0000000061795000-0x00000000617fefff] usable
  BIOS-e820: [mem 0x00000000617ff000-0x0000000061912fff] ACPI data
  BIOS-e820: [mem 0x0000000061913000-0x0000000061998fff] usable
  BIOS-e820: [mem 0x0000000061999000-0x00000000619dffff] ACPI data
  BIOS-e820: [mem 0x00000000619e0000-0x00000000619e1fff] usable
  BIOS-e820: [mem 0x00000000619e2000-0x00000000619e9fff] reserved
  BIOS-e820: [mem 0x00000000619ea000-0x0000000061a26fff] usable
  BIOS-e820: [mem 0x0000000061a27000-0x0000000061baefff] ACPI data
  BIOS-e820: [mem 0x0000000061baf000-0x00000000623c2fff] usable
  BIOS-e820: [mem 0x00000000623c3000-0x0000000062471fff] reserved
  BIOS-e820: [mem 0x0000000062472000-0x0000000062823fff] usable
  BIOS-e820: [mem 0x0000000062824000-0x0000000063a24fff] reserved
  BIOS-e820: [mem 0x0000000063a25000-0x0000000063d57fff] usable
  BIOS-e820: [mem 0x0000000063d58000-0x0000000064157fff] reserved
  BIOS-e820: [mem 0x0000000064158000-0x0000000064158fff] usable
  BIOS-e820: [mem 0x0000000064159000-0x0000000064194fff] reserved
  BIOS-e820: [mem 0x0000000064195000-0x000000006e9cefff] usable
  BIOS-e820: [mem 0x000000006e9cf000-0x000000006eccefff] reserved
  BIOS-e820: [mem 0x000000006eccf000-0x000000006f6fefff] ACPI NVS
  BIOS-e820: [mem 0x000000006f6ff000-0x000000006f7fefff] ACPI data
  BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable
  BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
  ......

[2] Convertible Memory Regions of the problematic platform:

  virt/tdx: CMR: [0x100000, 0x6f800000)
  virt/tdx: CMR: [0x100000000, 0x107a000000)
  virt/tdx: CMR: [0x1080000000, 0x207c000000)
  virt/tdx: CMR: [0x2080000000, 0x307c000000)
  virt/tdx: CMR: [0x3080000000, 0x407c000000)

Link: https://github.com/canonical/tdx/issues/135 [*]
Fixes: dde3b60d572c ("x86/virt/tdx: Designate reserved areas for all TDMRs")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---

v8 -> v8.1:
 - Add code to explicitly check TDX memory blocks against CMRs at early
   stage since TDH.SYS.CONFIG can no longer catch any non-CMR region in
   TDX memory blocks.

---
 arch/x86/virt/vmx/tdx/tdx.c | 85 +++++++++++++++++++++++++++++--------
 1 file changed, 67 insertions(+), 18 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 5e6d8021681d..687a0d520785 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -176,6 +176,23 @@ int tdx_cpu_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_cpu_enable);
 
+/* Check whether a given memory region is a sub-region of any CMR. */
+static bool is_cmr_sub_region(unsigned long start_pfn, unsigned long end_pfn,
+			      struct tdx_sys_info_cmr *sysinfo_cmr)
+{
+	int i;
+
+	for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
+		u64 cmr_base_pfn = sysinfo_cmr->cmr_base[i] >> PAGE_SHIFT;
+		u64 cmr_npages = sysinfo_cmr->cmr_size[i] >> PAGE_SHIFT;
+
+		if (start_pfn >= cmr_base_pfn &&
+				end_pfn <= (cmr_base_pfn + cmr_npages))
+			return true;
+	}
+
+	return false;
+}
 /*
  * Add a memory region as a TDX memory block.  The caller must make sure
  * all memory regions are added in address ascending order and don't
@@ -218,7 +235,8 @@ static void free_tdx_memlist(struct list_head *tmb_list)
  * ranges off in a secondary structure because memblock is modified
  * in memory hotplug while TDX memory regions are fixed.
  */
-static int build_tdx_memlist(struct list_head *tmb_list)
+static int build_tdx_memlist(struct list_head *tmb_list,
+			     struct tdx_sys_info_cmr *sysinfo_cmr)
 {
 	unsigned long start_pfn, end_pfn;
 	int i, nid, ret;
@@ -234,6 +252,27 @@ static int build_tdx_memlist(struct list_head *tmb_list)
 		if (start_pfn >= end_pfn)
 			continue;
 
+		/*
+		 * Make sure the to-be-added memory region is truly TDX
+		 * convertible memory.
+		 *
+		 * Note:
+		 *
+		 * The to-be-added memory region here doesn't cross NUMA
+		 * nodes.  The check here assumes the memory region does
+		 * not cross any adjacent CMRs which are contiguous
+		 * (i.e., the end of the first CMR is the start of the
+		 * next) _AND_ are in the same NUMA node.  A sane BIOS
+		 * should never report memory regions and CMRs in such
+		 * way.
+		 */
+		if (!is_cmr_sub_region(start_pfn, end_pfn, sysinfo_cmr)) {
+			pr_err("memory region [0x%lx, 0x%lx) is not TDX convertible memory.\n",
+					PHYS_PFN(start_pfn), PHYS_PFN(end_pfn));
+			ret = -EINVAL;
+			goto err;
+		}
+
 		/*
 		 * Add the memory regions as TDX memory.  The regions in
 		 * memblock has already guaranteed they are in address
@@ -733,29 +772,28 @@ static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
 }
 
 /*
- * Go through @tmb_list to find holes between memory areas.  If any of
+ * Go through all CMRs in @sysinfo_cmr to find memory holes.  If any of
  * those holes fall within @tdmr, set up a TDMR reserved area to cover
  * the hole.
  */
-static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
+static int tdmr_populate_rsvd_holes(struct tdx_sys_info_cmr *sysinfo_cmr,
 				    struct tdmr_info *tdmr,
 				    int *rsvd_idx,
 				    u16 max_reserved_per_tdmr)
 {
-	struct tdx_memblock *tmb;
 	u64 prev_end;
-	int ret;
+	int i, ret;
 
 	/*
 	 * Start looking for reserved blocks at the
 	 * beginning of the TDMR.
 	 */
 	prev_end = tdmr->base;
-	list_for_each_entry(tmb, tmb_list, list) {
+	for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
 		u64 start, end;
 
-		start = PFN_PHYS(tmb->start_pfn);
-		end   = PFN_PHYS(tmb->end_pfn);
+		start = sysinfo_cmr->cmr_base[i];
+		end   = start + sysinfo_cmr->cmr_size[i];
 
 		/* Break if this region is after the TDMR */
 		if (start >= tdmr_end(tdmr))
@@ -856,16 +894,16 @@ static int rsvd_area_cmp_func(const void *a, const void *b)
 
 /*
  * Populate reserved areas for the given @tdmr, including memory holes
- * (via @tmb_list) and PAMTs (via @tdmr_list).
+ * (via @sysinfo_cmr) and PAMTs (via @tdmr_list).
  */
 static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
-				    struct list_head *tmb_list,
+				    struct tdx_sys_info_cmr *sysinfo_cmr,
 				    struct tdmr_info_list *tdmr_list,
 				    u16 max_reserved_per_tdmr)
 {
 	int ret, rsvd_idx = 0;
 
-	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
+	ret = tdmr_populate_rsvd_holes(sysinfo_cmr, tdmr, &rsvd_idx,
 			max_reserved_per_tdmr);
 	if (ret)
 		return ret;
@@ -884,10 +922,10 @@ static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
 
 /*
  * Populate reserved areas for all TDMRs in @tdmr_list, including memory
- * holes (via @tmb_list) and PAMTs.
+ * holes (via @sysinfo_cmr) and PAMTs.
  */
 static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
-					 struct list_head *tmb_list,
+					 struct tdx_sys_info_cmr *sysinfo_cmr,
 					 u16 max_reserved_per_tdmr)
 {
 	int i;
@@ -896,7 +934,7 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
 		int ret;
 
 		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
-				tmb_list, tdmr_list, max_reserved_per_tdmr);
+				sysinfo_cmr, tdmr_list, max_reserved_per_tdmr);
 		if (ret)
 			return ret;
 	}
@@ -911,7 +949,8 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
  */
 static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
-			   struct tdx_sys_info_tdmr *sysinfo_tdmr)
+			   struct tdx_sys_info_tdmr *sysinfo_tdmr,
+			   struct tdx_sys_info_cmr *sysinfo_cmr)
 {
 	u16 pamt_entry_size[TDX_PS_NR] = {
 		sysinfo_tdmr->pamt_4k_entry_size,
@@ -928,7 +967,16 @@ static int construct_tdmrs(struct list_head *tmb_list,
 	if (ret)
 		return ret;
 
-	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
+	/*
+	 * Use CMRs instead of the TDX memory blocks to populate the
+	 * reserved areas to reduce consumption of reserved areas for
+	 * each TDMR.  On some large systems (e.g., a machine with 4 or
+	 * more sockets), the BIOS may report many usable memory regions
+	 * in the first 1GB in e820.  This may result in reserved areas
+	 * of the first TDMR being exhausted if TDX memory blocks are
+	 * used to fill up reserved areas.
+	 */
+	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, sysinfo_cmr,
 			sysinfo_tdmr->max_reserved_per_tdmr);
 	if (ret)
 		tdmrs_free_pamt_all(tdmr_list);
@@ -1107,7 +1155,7 @@ static int init_tdx_module(void)
 	 */
 	get_online_mems();
 
-	ret = build_tdx_memlist(&tdx_memlist);
+	ret = build_tdx_memlist(&tdx_memlist, &sysinfo.cmr);
 	if (ret)
 		goto out_put_tdxmem;
 
@@ -1117,7 +1165,8 @@ static int init_tdx_module(void)
 		goto err_free_tdxmem;
 
 	/* Cover all TDX-usable memory regions in TDMRs */
-	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr);
+	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr,
+			&sysinfo.cmr);
 	if (ret)
 		goto err_free_tdmrs;
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-04 14:22   ` Huang, Kai
@ 2024-12-05 12:45     ` Huang, Kai
  0 siblings, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2024-12-05 12:45 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	tglx@linutronix.de, kirill.shutemov@linux.intel.com,
	pbonzini@redhat.com, Williams, Dan J
  Cc: Yamahata, Isaku, nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Edgecombe, Rick P, x86@kernel.org

On Wed, 2024-12-04 at 14:22 +0000, Huang, Kai wrote:
> On Thu, 2024-11-14 at 00:57 +1300, Kai Huang wrote:
> > A TDX module initialization failure was reported on a Emerald Rapids
> > platform [*]:
> > 
> >   virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
> >   virt/tdx: module initialization failed (-28)
> > 
> > As part of initializing the TDX module, the kernel informs the TDX
> > module of all "TDX-usable memory regions" using an array of TDX defined
> > structure "TD Memory Region" (TDMR).  Each TDMR must be in 1GB aligned
> > and in 1GB granularity, and all "non-TDX-usable memory holes" within a
> > given TDMR are marked as "reserved areas".  The TDX module reports a
> > maximum number of reserved areas that can be supported per TDMR (16).
> > 
> > The kernel builds the "TDX-usable memory regions" based on memblocks
> > (which reflects e820), and uses this list to find all "reserved areas"
> > for each TDMR.
> > 
> > It turns out that the kernel's view of memory holes is too fine grained
> > and sometimes exceeds the number of holes that the TDX module can track
> > per TDMR [1], resulting in the above failure.
> > 
> > Thankfully the module also lists memory that is potentially convertible
> > in a list of "Convertible Memory Regions" (CMRs).  That coarser grained
> > CMR list tends to track usable memory in the memory map even if it might
> > be reserved for host usage like 'ACPI data' [2].
> > 
> > Use that list to relax what the kernel considers unusable memory.  If it
> > falls in a CMR no need to instantiate a hole, and rely on the fact that
> > kernel will keep what it considers 'reserved' out of the page allocator.
> > 
> 
> Hi Dave,
> 
> I realized there's one issue if we change to use CMRs to fill up reserved areas
> for TDMRs after some internal discussion with Dan:
> 
> Currently we requires all memory regions in memblocks.memory to be TDX
> convertible memory at the time of initializing the TDX module to make module
> initialization successful.  We build a list of those memory regions as a list of
> "TDX memory blocks", and use them to construct the TDMRs to configure the
> module.  But we don't explicitly check those memory regions against CMRs to make
> sure they are truly TDX convertible, instead we depend on TDH.SYS.CONFIG to
> catch any non-CMR memory regions that end up to the TDX memory blocks.
> 
> This works fine because currently we use those TDX memory blocks to fill up the
> reserved areas of TDMRs, i.e., any non-CMR regions in TDX memory blocks will end
> up to "non-reserved areas" in TDMR(s), and this will cause TDH.SYS.CONFIG to
> fail.
> 
> After we change to using CMRs to fill up reserved areas of TDMRs, then all non-
> CMR regions will be marked as "reserved areas", regardless whether there is any
> non-CMR memory region in the TDX memory blocks.  This will result in TDX module
> initialization being successful while there are non-CMR pages going into page
> allocator.
> 
> To avoid this, we need to explicitly check all the TDX memory blocks against
> CMRs to make sure they are actually TDX convertible, before using CMRs to fill
> up reserved areas.
> 
> I ended up with below incremental diff with some additional text for the
> changelog.
> 
> Do you have any comments?
> 

Hi Dave,

I sent out the updated patch as "PATCH v8 8.1/9".  I appreciate if you can
review.

Hi Dan,

I didn't remove your Reviewed-by because I think the additional code added
should be straightforward.  But please let me know if you have any concern.

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8.1/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-05 12:40   ` [PATCH v8 8.1/9] " Kai Huang
@ 2024-12-05 18:10     ` Dave Hansen
  2024-12-06  2:45       ` Huang, Kai
  2024-12-09  6:57       ` Huang, Kai
  0 siblings, 2 replies; 28+ messages in thread
From: Dave Hansen @ 2024-12-05 18:10 UTC (permalink / raw)
  To: Kai Huang, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov

On 12/5/24 04:40, Kai Huang wrote:
> A TDX module initialization failure was reported on a Emerald Rapids
> platform [*]:
> 
>   virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
>   virt/tdx: module initialization failed (-28)

There's a *LOT* of changelog here, but I'm not sure how much of it is
actually relevant to the problem at hand. I also think it's wrong to to
present the problem as one of being too fine-grained.

Could you please rework the changelog and the comments to make this more
succinct?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8.1/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-05 18:10     ` Dave Hansen
@ 2024-12-06  2:45       ` Huang, Kai
  2024-12-09  6:57       ` Huang, Kai
  1 sibling, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2024-12-06  2:45 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Edgecombe, Rick P, x86@kernel.org,
	Yamahata, Isaku

On Thu, 2024-12-05 at 10:10 -0800, Dave Hansen wrote:
> On 12/5/24 04:40, Kai Huang wrote:
> > A TDX module initialization failure was reported on a Emerald Rapids
> > platform [*]:
> > 
> >   virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
> >   virt/tdx: module initialization failed (-28)
> 
> There's a *LOT* of changelog here, but I'm not sure how much of it is
> actually relevant to the problem at hand. I also think it's wrong to to
> present the problem as one of being too fine-grained.
> 
> Could you please rework the changelog and the comments to make this more
> succinct?

Thanks for the feedback.  Yeah I agree they can be more concise.  I'll update
and send out the updated patch.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8 8.2/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-11-13 11:57 ` [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes Kai Huang
  2024-12-04 14:22   ` Huang, Kai
  2024-12-05 12:40   ` [PATCH v8 8.1/9] " Kai Huang
@ 2024-12-09  6:50   ` Kai Huang
  2024-12-09 22:54     ` Dave Hansen
  2 siblings, 1 reply; 28+ messages in thread
From: Kai Huang @ 2024-12-09  6:50 UTC (permalink / raw)
  To: dave.hansen, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov, kai.huang

A TDX module initialization failure was reported on an Emerald Rapids
platform [*]:

  virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
  virt/tdx: module initialization failed (-28)

The kernel informs the TDX module of "TDX-usable memory regions" via the
structure "TD Memory Region" (TDMR).  Each TDMR contains a limited
number of "reserved areas" to inform the TDX module of the regions that
cannot be used by TDX.

The kernel builds the list of "TDX-usable memory regions" from memblock
(which reflects e820) and marks all memory holes as "reserved areas" in
TDMRs.  It turns out on some large systems the holes in memblock can be
too fine-grained [1] and exceed the number of reserved areas that the
module can track per TDMR, resulting in the failure mentioned above.

The TDX module also reports TDX-capable memory as "Convertible Memory
Regions" (CMRs).  CMRs tend to be coarser-grained [2] than the e820.
Use CMRs to find memory holes when populating reserved areas to reduce
their consumption.

Note the kernel does not prevent non-CMR memory from being added to
"TDX-usable memory regions" but depends on the TDX module to catch in
the TDH.SYS.CONFIG.  After switching to using CMRs to populate reserved
areas this will no longer work.  To ensure no non-CMR memory is included
in the TDMRs, verify that the memory region is truly TDX convertible
before adding it as a TDX-usable memory region at early stage.

[1] BIOS-E820 table of the problematic platform:

  BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
  BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
  BIOS-e820: [mem 0x0000000000100000-0x000000005d168fff] usable
  BIOS-e820: [mem 0x000000005d169000-0x000000005d22afff] ACPI data
  BIOS-e820: [mem 0x000000005d22b000-0x000000005d3cefff] usable
  BIOS-e820: [mem 0x000000005d3cf000-0x000000005d469fff] reserved
  BIOS-e820: [mem 0x000000005d46a000-0x000000005e5b2fff] usable
  BIOS-e820: [mem 0x000000005e5b3000-0x000000005e5c2fff] reserved
  BIOS-e820: [mem 0x000000005e5c3000-0x000000005e5d2fff] usable
  BIOS-e820: [mem 0x000000005e5d3000-0x000000005e5e4fff] reserved
  BIOS-e820: [mem 0x000000005e5e5000-0x000000005eb57fff] usable
  BIOS-e820: [mem 0x000000005eb58000-0x0000000061357fff] ACPI NVS
  BIOS-e820: [mem 0x0000000061358000-0x000000006172afff] usable
  BIOS-e820: [mem 0x000000006172b000-0x0000000061794fff] ACPI data
  BIOS-e820: [mem 0x0000000061795000-0x00000000617fefff] usable
  BIOS-e820: [mem 0x00000000617ff000-0x0000000061912fff] ACPI data
  BIOS-e820: [mem 0x0000000061913000-0x0000000061998fff] usable
  BIOS-e820: [mem 0x0000000061999000-0x00000000619dffff] ACPI data
  BIOS-e820: [mem 0x00000000619e0000-0x00000000619e1fff] usable
  BIOS-e820: [mem 0x00000000619e2000-0x00000000619e9fff] reserved
  BIOS-e820: [mem 0x00000000619ea000-0x0000000061a26fff] usable
  BIOS-e820: [mem 0x0000000061a27000-0x0000000061baefff] ACPI data
  BIOS-e820: [mem 0x0000000061baf000-0x00000000623c2fff] usable
  BIOS-e820: [mem 0x00000000623c3000-0x0000000062471fff] reserved
  BIOS-e820: [mem 0x0000000062472000-0x0000000062823fff] usable
  BIOS-e820: [mem 0x0000000062824000-0x0000000063a24fff] reserved
  BIOS-e820: [mem 0x0000000063a25000-0x0000000063d57fff] usable
  BIOS-e820: [mem 0x0000000063d58000-0x0000000064157fff] reserved
  BIOS-e820: [mem 0x0000000064158000-0x0000000064158fff] usable
  BIOS-e820: [mem 0x0000000064159000-0x0000000064194fff] reserved
  BIOS-e820: [mem 0x0000000064195000-0x000000006e9cefff] usable
  BIOS-e820: [mem 0x000000006e9cf000-0x000000006eccefff] reserved
  BIOS-e820: [mem 0x000000006eccf000-0x000000006f6fefff] ACPI NVS
  BIOS-e820: [mem 0x000000006f6ff000-0x000000006f7fefff] ACPI data
  BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable
  BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
  ......

[2] Convertible Memory Regions of the problematic platform:

  virt/tdx: CMR: [0x100000, 0x6f800000)
  virt/tdx: CMR: [0x100000000, 0x107a000000)
  virt/tdx: CMR: [0x1080000000, 0x207c000000)
  virt/tdx: CMR: [0x2080000000, 0x307c000000)
  virt/tdx: CMR: [0x3080000000, 0x407c000000)

Link: https://github.com/canonical/tdx/issues/135 [*]
Fixes: dde3b60d572c ("x86/virt/tdx: Designate reserved areas for all TDMRs")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---

v8.1 -> v8.2:
 - Trim down the changelog and comments. (Dave)

---
 arch/x86/virt/vmx/tdx/tdx.c | 74 ++++++++++++++++++++++++++++---------
 1 file changed, 56 insertions(+), 18 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 5e6d8021681d..5c6a743de333 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -176,6 +176,23 @@ int tdx_cpu_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_cpu_enable);
 
+/* Check whether a given memory region is a sub-region of any CMR. */
+static bool is_cmr_sub_region(unsigned long start_pfn, unsigned long end_pfn,
+			      struct tdx_sys_info_cmr *sysinfo_cmr)
+{
+	int i;
+
+	for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
+		u64 cmr_base_pfn = sysinfo_cmr->cmr_base[i] >> PAGE_SHIFT;
+		u64 cmr_npages = sysinfo_cmr->cmr_size[i] >> PAGE_SHIFT;
+
+		if (start_pfn >= cmr_base_pfn &&
+				end_pfn <= (cmr_base_pfn + cmr_npages))
+			return true;
+	}
+
+	return false;
+}
 /*
  * Add a memory region as a TDX memory block.  The caller must make sure
  * all memory regions are added in address ascending order and don't
@@ -218,7 +235,8 @@ static void free_tdx_memlist(struct list_head *tmb_list)
  * ranges off in a secondary structure because memblock is modified
  * in memory hotplug while TDX memory regions are fixed.
  */
-static int build_tdx_memlist(struct list_head *tmb_list)
+static int build_tdx_memlist(struct list_head *tmb_list,
+			     struct tdx_sys_info_cmr *sysinfo_cmr)
 {
 	unsigned long start_pfn, end_pfn;
 	int i, nid, ret;
@@ -234,6 +252,18 @@ static int build_tdx_memlist(struct list_head *tmb_list)
 		if (start_pfn >= end_pfn)
 			continue;
 
+		/*
+		 * Make sure the to-be-added memory region is truly TDX
+		 * convertible.  This ensures no non-TDX convertible
+		 * memory can end up to page allocator.
+		 */
+		if (!is_cmr_sub_region(start_pfn, end_pfn, sysinfo_cmr)) {
+			pr_err("memory region [0x%lx, 0x%lx) is not TDX convertible memory.\n",
+					PHYS_PFN(start_pfn), PHYS_PFN(end_pfn));
+			ret = -EINVAL;
+			goto err;
+		}
+
 		/*
 		 * Add the memory regions as TDX memory.  The regions in
 		 * memblock has already guaranteed they are in address
@@ -733,29 +763,28 @@ static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
 }
 
 /*
- * Go through @tmb_list to find holes between memory areas.  If any of
+ * Go through all CMRs in @sysinfo_cmr to find memory holes.  If any of
  * those holes fall within @tdmr, set up a TDMR reserved area to cover
  * the hole.
  */
-static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
+static int tdmr_populate_rsvd_holes(struct tdx_sys_info_cmr *sysinfo_cmr,
 				    struct tdmr_info *tdmr,
 				    int *rsvd_idx,
 				    u16 max_reserved_per_tdmr)
 {
-	struct tdx_memblock *tmb;
 	u64 prev_end;
-	int ret;
+	int i, ret;
 
 	/*
 	 * Start looking for reserved blocks at the
 	 * beginning of the TDMR.
 	 */
 	prev_end = tdmr->base;
-	list_for_each_entry(tmb, tmb_list, list) {
+	for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
 		u64 start, end;
 
-		start = PFN_PHYS(tmb->start_pfn);
-		end   = PFN_PHYS(tmb->end_pfn);
+		start = sysinfo_cmr->cmr_base[i];
+		end   = start + sysinfo_cmr->cmr_size[i];
 
 		/* Break if this region is after the TDMR */
 		if (start >= tdmr_end(tdmr))
@@ -856,16 +885,16 @@ static int rsvd_area_cmp_func(const void *a, const void *b)
 
 /*
  * Populate reserved areas for the given @tdmr, including memory holes
- * (via @tmb_list) and PAMTs (via @tdmr_list).
+ * (via @sysinfo_cmr) and PAMTs (via @tdmr_list).
  */
 static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
-				    struct list_head *tmb_list,
+				    struct tdx_sys_info_cmr *sysinfo_cmr,
 				    struct tdmr_info_list *tdmr_list,
 				    u16 max_reserved_per_tdmr)
 {
 	int ret, rsvd_idx = 0;
 
-	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
+	ret = tdmr_populate_rsvd_holes(sysinfo_cmr, tdmr, &rsvd_idx,
 			max_reserved_per_tdmr);
 	if (ret)
 		return ret;
@@ -884,10 +913,10 @@ static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
 
 /*
  * Populate reserved areas for all TDMRs in @tdmr_list, including memory
- * holes (via @tmb_list) and PAMTs.
+ * holes (via @sysinfo_cmr) and PAMTs.
  */
 static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
-					 struct list_head *tmb_list,
+					 struct tdx_sys_info_cmr *sysinfo_cmr,
 					 u16 max_reserved_per_tdmr)
 {
 	int i;
@@ -896,7 +925,7 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
 		int ret;
 
 		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
-				tmb_list, tdmr_list, max_reserved_per_tdmr);
+				sysinfo_cmr, tdmr_list, max_reserved_per_tdmr);
 		if (ret)
 			return ret;
 	}
@@ -911,7 +940,8 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
  */
 static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
-			   struct tdx_sys_info_tdmr *sysinfo_tdmr)
+			   struct tdx_sys_info_tdmr *sysinfo_tdmr,
+			   struct tdx_sys_info_cmr *sysinfo_cmr)
 {
 	u16 pamt_entry_size[TDX_PS_NR] = {
 		sysinfo_tdmr->pamt_4k_entry_size,
@@ -928,7 +958,14 @@ static int construct_tdmrs(struct list_head *tmb_list,
 	if (ret)
 		return ret;
 
-	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
+	/*
+	 * On some large systems, the TDX memory blocks (which reflects
+	 * e820) in the first 1GB can be too fine-grained.  Using them
+	 * to populate reserved areas may result in reserved areas being
+	 * exhausted.  CMRs are coarser-grained than e820.  Use CMRs to
+	 * populate reserved areas to reduce their consumption.
+	 */
+	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, sysinfo_cmr,
 			sysinfo_tdmr->max_reserved_per_tdmr);
 	if (ret)
 		tdmrs_free_pamt_all(tdmr_list);
@@ -1107,7 +1144,7 @@ static int init_tdx_module(void)
 	 */
 	get_online_mems();
 
-	ret = build_tdx_memlist(&tdx_memlist);
+	ret = build_tdx_memlist(&tdx_memlist, &sysinfo.cmr);
 	if (ret)
 		goto out_put_tdxmem;
 
@@ -1117,7 +1154,8 @@ static int init_tdx_module(void)
 		goto err_free_tdxmem;
 
 	/* Cover all TDX-usable memory regions in TDMRs */
-	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr);
+	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr,
+			&sysinfo.cmr);
 	if (ret)
 		goto err_free_tdmrs;
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8.1/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-05 18:10     ` Dave Hansen
  2024-12-06  2:45       ` Huang, Kai
@ 2024-12-09  6:57       ` Huang, Kai
  1 sibling, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2024-12-09  6:57 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Edgecombe, Rick P, x86@kernel.org,
	Yamahata, Isaku

On Thu, 2024-12-05 at 10:10 -0800, Hansen, Dave wrote:
> On 12/5/24 04:40, Kai Huang wrote:
> > A TDX module initialization failure was reported on a Emerald Rapids
> > platform [*]:
> > 
> >   virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
> >   virt/tdx: module initialization failed (-28)
> 
> There's a *LOT* of changelog here, but I'm not sure how much of it is
> actually relevant to the problem at hand. I also think it's wrong to to
> present the problem as one of being too fine-grained.
> 
> Could you please rework the changelog and the comments to make this more
> succinct?

Hi Dave,

I sent out the "PATCH v8 8.2/9".  I trimmed down the changelog to keep it
succinct.  I also trimmed the comments around the is_cmr_sub_region() in
build_tdx_memory() and the tdmrs_populate_rsvd_areas_all() in construct_tdmrs()
to explain the code change.

Appreciate your review.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8.2/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-09  6:50   ` [PATCH v8 8.2/9] " Kai Huang
@ 2024-12-09 22:54     ` Dave Hansen
  2024-12-10  2:26       ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2024-12-09 22:54 UTC (permalink / raw)
  To: Kai Huang, kirill.shutemov, tglx, bp, peterz, mingo, hpa,
	dan.j.williams, seanjc, pbonzini
  Cc: x86, linux-kernel, rick.p.edgecombe, isaku.yamahata,
	adrian.hunter, nik.borisov

On 12/8/24 22:50, Kai Huang wrote:
> A TDX module initialization failure was reported on an Emerald Rapids
> platform [*]:
> 
>   virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
>   virt/tdx: module initialization failed (-28)
> 
> The kernel informs the TDX module of "TDX-usable memory regions" via the
> structure "TD Memory Region" (TDMR).  Each TDMR contains a limited
> number of "reserved areas" to inform the TDX module of the regions that
> cannot be used by TDX.
> 
> The kernel builds the list of "TDX-usable memory regions" from memblock
> (which reflects e820) and marks all memory holes as "reserved areas" in
> TDMRs.  It turns out on some large systems the holes in memblock can be
> too fine-grained [1] and exceed the number of reserved areas that the
> module can track per TDMR, resulting in the failure mentioned above.
> 
> The TDX module also reports TDX-capable memory as "Convertible Memory
> Regions" (CMRs).  CMRs tend to be coarser-grained [2] than the e820.
> Use CMRs to find memory holes when populating reserved areas to reduce
> their consumption.
> 
> Note the kernel does not prevent non-CMR memory from being added to
> "TDX-usable memory regions" but depends on the TDX module to catch in
> the TDH.SYS.CONFIG.  After switching to using CMRs to populate reserved
> areas this will no longer work.  To ensure no non-CMR memory is included
> in the TDMRs, verify that the memory region is truly TDX convertible
> before adding it as a TDX-usable memory region at early stage.

Thanks for trimming the changelog down.  But this changelog never
actually says what the fix is. It's also quite heavy on the "what" and
very light on the "why".

I think the "why" boils down to the fact that the kernel is treating RAM
-- as defined by the platform and TDX module -- as non-RAM.

> -	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
> +	/*
> +	 * On some large systems, the TDX memory blocks (which reflects
> +	 * e820) in the first 1GB can be too fine-grained.  Using them
> +	 * to populate reserved areas may result in reserved areas being
> +	 * exhausted.  CMRs are coarser-grained than e820.  Use CMRs to
> +	 * populate reserved areas to reduce their consumption.
> +	 */

I think there are still too many details here for a comment. This
comment is describing *highly* implementation and platform-specific
details particular to this bug you are fixing today. They will be
irrelevant to anyone reading this code tomorrow.

So in the end, I buy that the CMR's have something to offer here. But I
think that "why" I mentioned above casts doubt on whether
for_each_mem_pfn_range() is the right primitive on which to build the
TDX memblocks in the first place.

I suspect there's a much simpler solution that will emerge when
considering a deeper fix as opposed to adding CMRs as a band-aid.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8.2/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-09 22:54     ` Dave Hansen
@ 2024-12-10  2:26       ` Huang, Kai
  2024-12-10  2:46         ` Dan Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Kai @ 2024-12-10  2:26 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Edgecombe, Rick P, x86@kernel.org,
	Yamahata, Isaku

On Mon, 2024-12-09 at 14:54 -0800, Dave Hansen wrote:
> On 12/8/24 22:50, Kai Huang wrote:
> > A TDX module initialization failure was reported on an Emerald Rapids
> > platform [*]:
> > 
> >   virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
> >   virt/tdx: module initialization failed (-28)
> > 
> > The kernel informs the TDX module of "TDX-usable memory regions" via the
> > structure "TD Memory Region" (TDMR).  Each TDMR contains a limited
> > number of "reserved areas" to inform the TDX module of the regions that
> > cannot be used by TDX.
> > 
> > The kernel builds the list of "TDX-usable memory regions" from memblock
> > (which reflects e820) and marks all memory holes as "reserved areas" in
> > TDMRs.  It turns out on some large systems the holes in memblock can be
> > too fine-grained [1] and exceed the number of reserved areas that the
> > module can track per TDMR, resulting in the failure mentioned above.
> > 
> > The TDX module also reports TDX-capable memory as "Convertible Memory
> > Regions" (CMRs).  CMRs tend to be coarser-grained [2] than the e820.
> > Use CMRs to find memory holes when populating reserved areas to reduce
> > their consumption.
> > 
> > Note the kernel does not prevent non-CMR memory from being added to
> > "TDX-usable memory regions" but depends on the TDX module to catch in
> > the TDH.SYS.CONFIG.  After switching to using CMRs to populate reserved
> > areas this will no longer work.  To ensure no non-CMR memory is included
> > in the TDMRs, verify that the memory region is truly TDX convertible
> > before adding it as a TDX-usable memory region at early stage.
> 
> Thanks for trimming the changelog down.  But this changelog never
> actually says what the fix is. It's also quite heavy on the "what" and
> very light on the "why".
> 
> I think the "why" boils down to the fact that the kernel is treating RAM
> -- as defined by the platform and TDX module -- as non-RAM.

Yes.

> 
> > -	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
> > +	/*
> > +	 * On some large systems, the TDX memory blocks (which reflects
> > +	 * e820) in the first 1GB can be too fine-grained.  Using them
> > +	 * to populate reserved areas may result in reserved areas being
> > +	 * exhausted.  CMRs are coarser-grained than e820.  Use CMRs to
> > +	 * populate reserved areas to reduce their consumption.
> > +	 */
> 
> I think there are still too many details here for a comment. This
> comment is describing *highly* implementation and platform-specific
> details particular to this bug you are fixing today. They will be
> irrelevant to anyone reading this code tomorrow.
> 
> So in the end, I buy that the CMR's have something to offer here. But I
> think that "why" I mentioned above casts doubt on whether
> for_each_mem_pfn_range() is the right primitive on which to build the
> TDX memblocks in the first place.

We can change to just use CMRs as TDX memory blocks, i.e., always cover all CMRs
in TDMRs, but this will have much wider impact.

The main concern is the PAMT allocation: PAMT is allocated from page allocator,
but the CMRs -- the RAM as defined by the platform and the TDX module - - can
cover more, and sometimes much more, regions than the regions end up to the page
allocator.

E.g., today we can use 'memmap=' to reserve part of memory for other purpose. 
And in the future CMRs may cover CXL memory regions which could be much larger
IIUC.

If we change to cover CMRs in TDMRs, we could end up with a much larger TDMR
ranges.  In this case we may end up with wasting PAMTs (e.g., if the admin wants
to use CXL for other non-TDX purpose), increasing the failure rate, or a
complete failure of PAMT allocation.

> I suspect there's a much simpler solution that will emerge when
> considering a deeper fix as opposed to adding CMRs as a band-aid.

I don't have an immediate solution other than using CMRs to fill up reserved
areas.  I will think more.

Perhaps we can try to split the TDMR to make it cover less reserved areas.  But
this won't work when the TDMR is already 1GB.  And this will result in new cases
where "one TDX memory block can end up to multiple TDMRs" etc.  To me it's over-
complicated and is not as good as using the CMR.  (This is listed as an
alternative in the initial changelog but was removed to trim down the log).

Btw, Rick is concerning with the overall KVM TDX upstream because this series is
a dependency to the rest KVM TDX patches.  Technically this bugfix is not
related to the KVM TDX support.  We are going to look at dropping the CMR staff
for now (the code which reads CMRs in patch 3, and patch 7-8), as the TDX KVM
patches can live without it for initial support.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8.2/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-10  2:26       ` Huang, Kai
@ 2024-12-10  2:46         ` Dan Williams
  2024-12-10  4:24           ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: Dan Williams @ 2024-12-10  2:46 UTC (permalink / raw)
  To: Huang, Kai, Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Edgecombe, Rick P, x86@kernel.org,
	Yamahata, Isaku

Huang, Kai wrote:
[..]
> > So in the end, I buy that the CMR's have something to offer here. But I
> > think that "why" I mentioned above casts doubt on whether
> > for_each_mem_pfn_range() is the right primitive on which to build the
> > TDX memblocks in the first place.
> 
> We can change to just use CMRs as TDX memory blocks, i.e., always cover all CMRs
> in TDMRs, but this will have much wider impact.
> 
> The main concern is the PAMT allocation: PAMT is allocated from page allocator,
> but the CMRs -- the RAM as defined by the platform and the TDX module - - can
> cover more, and sometimes much more, regions than the regions end up to the page
> allocator.
> 
> E.g., today we can use 'memmap=' to reserve part of memory for other purpose. 
> And in the future CMRs may cover CXL memory regions which could be much larger
> IIUC.

Please do not point to memmap= as a reason to complicate the TDX
initialization. memmap= is a debug / expert feature where the user gets
to keep all the pieces if they get it wrong.

Please do not point to theoretical CXL futures as a reason not to do the
right thing in TDX initialization. The CXL TEE Security Protocol makes
CXL memory indistinguishable from DDR. It is layering violation for TDX
module initialization to add complexity and policy assumptions as if it
knows better than the published CMRs what memory resources are available
in the platform.

Please drop my reviewed-by on this patch until we have a solution and a
simple story for the recently discovered problems that CMR enumeration
solves. This includes reserve-area population and disambiguating
reserve-area enumeration from late-to-online memory resources.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8.2/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-10  2:46         ` Dan Williams
@ 2024-12-10  4:24           ` Huang, Kai
  2024-12-10 16:58             ` Dave Hansen
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Kai @ 2024-12-10  4:24 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	Williams, Dan J, kirill.shutemov@linux.intel.com,
	pbonzini@redhat.com, tglx@linutronix.de
  Cc: Yamahata, Isaku, nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Edgecombe, Rick P, x86@kernel.org

On Mon, 2024-12-09 at 18:46 -0800, Dan Williams wrote:
> Huang, Kai wrote:
> [..]
> > > So in the end, I buy that the CMR's have something to offer here. But I
> > > think that "why" I mentioned above casts doubt on whether
> > > for_each_mem_pfn_range() is the right primitive on which to build the
> > > TDX memblocks in the first place.
> > 
> > We can change to just use CMRs as TDX memory blocks, i.e., always cover all CMRs
> > in TDMRs, but this will have much wider impact.
> > 
> > The main concern is the PAMT allocation: PAMT is allocated from page allocator,
> > but the CMRs -- the RAM as defined by the platform and the TDX module - - can
> > cover more, and sometimes much more, regions than the regions end up to the page
> > allocator.
> > 
> > E.g., today we can use 'memmap=' to reserve part of memory for other purpose. 
> > And in the future CMRs may cover CXL memory regions which could be much larger
> > IIUC.
> 
> Please do not point to memmap= as a reason to complicate the TDX
> initialization. memmap= is a debug / expert feature where the user gets
> to keep all the pieces if they get it wrong.
> 
> Please do not point to theoretical CXL futures as a reason not to do the
> right thing in TDX initialization. The CXL TEE Security Protocol makes
> CXL memory indistinguishable from DDR. It is layering violation for TDX
> module initialization to add complexity and policy assumptions as if it
> knows better than the published CMRs what memory resources are available
> in the platform.

OK.  Thanks for all the feedback.

> 
> Please drop my reviewed-by on this patch until we have a solution and a
> simple story for the recently discovered problems that CMR enumeration
> solves. This includes reserve-area population and disambiguating
> reserve-area enumeration from late-to-online memory resources.

OK.  I would like to get the solution agreed.  IIUC you prefer to just using
CMRs to build TDX-usable memory regions when constructing TDMRs.  Please let me
know if I have misunderstanding.

Hi Dave,

Please let me know if you are OK with this solution?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8.2/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-10  4:24           ` Huang, Kai
@ 2024-12-10 16:58             ` Dave Hansen
  2024-12-11  4:34               ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2024-12-10 16:58 UTC (permalink / raw)
  To: Huang, Kai, seanjc@google.com, bp@alien8.de, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, Williams, Dan J,
	kirill.shutemov@linux.intel.com, pbonzini@redhat.com,
	tglx@linutronix.de
  Cc: Yamahata, Isaku, nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Edgecombe, Rick P, x86@kernel.org

On 12/9/24 20:24, Huang, Kai wrote:
> OK.  I would like to get the solution agreed.  IIUC you prefer to just using
> CMRs to build TDX-usable memory regions when constructing TDMRs.  Please let me
> know if I have misunderstanding.
> 
> Please let me know if you are OK with this solution?

I honestly don't know.

What I would prefer is that you take another hard look at the code and
deeply consider whether there's a simpler solution. My gut says that the
8.2/9 patch is too much of a band-aid and adds too much complexity. I'm
yearning for something simpler.

I'm not going to know whether I'm OK with the solution until the patch
lands in my inbox. What I do know is that I'd really appreciate if you
could _explore_ a simpler solution.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 8.2/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes
  2024-12-10 16:58             ` Dave Hansen
@ 2024-12-11  4:34               ` Huang, Kai
  0 siblings, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2024-12-11  4:34 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	pbonzini@redhat.com, tglx@linutronix.de
  Cc: Edgecombe, Rick P, nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Yamahata, Isaku, x86@kernel.org

On Tue, 2024-12-10 at 08:58 -0800, Dave Hansen wrote:
> On 12/9/24 20:24, Huang, Kai wrote:
> > OK.  I would like to get the solution agreed.  IIUC you prefer to just using
> > CMRs to build TDX-usable memory regions when constructing TDMRs.  Please let me
> > know if I have misunderstanding.
> > 
> > Please let me know if you are OK with this solution?
> 
> I honestly don't know.
> 
> What I would prefer is that you take another hard look at the code and
> deeply consider whether there's a simpler solution. My gut says that the
> 8.2/9 patch is too much of a band-aid and adds too much complexity. I'm
> yearning for something simpler.
> 
> I'm not going to know whether I'm OK with the solution until the patch
> lands in my inbox. What I do know is that I'd really appreciate if you
> could _explore_ a simpler solution.

Yeah that's the plan.  I don't have any simpler solution that doesn't use CMRs
in my mind now but will explore more.  Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 3/9] x86/virt/tdx: Use auto-generated code to read global metadata
  2024-11-13 11:57 ` [PATCH v8 3/9] x86/virt/tdx: Use auto-generated code to read global metadata Kai Huang
@ 2024-12-13 11:17   ` Huang, Kai
  0 siblings, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2024-12-13 11:17 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	pbonzini@redhat.com, Williams, Dan J
  Cc: nik.borisov@suse.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Edgecombe, Rick P, x86@kernel.org,
	Yamahata, Isaku

[-- Attachment #1: Type: text/plain, Size: 900 bytes --]

On Thu, 2024-11-14 at 00:57 +1300, Kai Huang wrote:
> The tdx_global_metadata.{hc} can be generated by running below:
> 
>  #python tdx_global_metadata.py global_metadata.json \
> 	tdx_global_metadata.h tdx_global_metadata.c
> 
> .. where the 'tdx_global_metadata.py' can be found in [4] and the
> 'global_metadata.json' can be fetched from [3].
> 

[...]

> Link: https://cdrdv2.intel.com/v1/dl/getContent/795381 [3]
> Link: https://lore.kernel.org/d5aed06ae4b46df5db97fdbac9c01843920a2f96.camel@intel.com/ [4]

Hi Dave,

I'll remove the bugfix (patch 8) and the CMR reading code but only keep what is
required for KVM TDX in the next version.

I updated the script and attached here so that I can have a lore link of the
script which can be used to reproduce the generated code in the next version.

It only has CMR part removed thus is not interesting to read.  Thanks.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: tdx_global_metadata.py --]
[-- Type: text/x-python3; name="tdx_global_metadata.py", Size: 6106 bytes --]

#! /usr/bin/env python3
import json
import sys

# Note: this script does not run as part of the build process.
# It is used to generate structs from the TDX global_metadata.json
# file, and functions to fill in said structs.  Rerun it if
# you need more fields.

TDX_STRUCTS = {
    "features": [
        "TDX_FEATURES0"
    ],
    "tdmr": [
        "MAX_TDMRS",
        "MAX_RESERVED_PER_TDMR",
        "PAMT_4K_ENTRY_SIZE",
        "PAMT_2M_ENTRY_SIZE",
        "PAMT_1G_ENTRY_SIZE",
    ],
}

STRUCT_PREFIX = "tdx_sys_info"
FUNC_PREFIX = "get_tdx_sys_info"
STRVAR_PREFIX = "sysinfo"

def print_class_struct_field(field_name, element_bytes, num_fields, num_elements, file):
    element_type = "u%s" % (element_bytes * 8)
    element_array = ""
    if num_fields > 1:
        element_array += "[%d]" % (num_fields)
    if num_elements > 1:
        element_array += "[%d]" % (num_elements)
    print("\t%s %s%s;" % (element_type, field_name, element_array), file=file)

def print_class_struct(class_name, fields, file):
    struct_name = "%s_%s" % (STRUCT_PREFIX, class_name)
    print("struct %s {" % (struct_name), file=file)
    for f in fields:
        print_class_struct_field(
            f["Field Name"].lower(),
            int(f["Element Size (Bytes)"]),
            int(f["Num Fields"]),
            int(f["Num Elements"]),
            file=file)
    print("};", file=file)

def print_read_field(field_id, struct_var, struct_member, indent, file):
    print(
        "%sif (!ret && !(ret = read_sys_metadata_field(%s, &val)))\n%s\t%s->%s = val;"
        % (indent, field_id, indent, struct_var, struct_member),
        file=file,
    )

def print_class_function(class_name, fields, file):
    func_name = "%s_%s" % (FUNC_PREFIX, class_name)
    struct_name = "%s_%s" % (STRUCT_PREFIX, class_name)
    struct_var = "%s_%s" % (STRVAR_PREFIX, class_name)

    print("static int %s(struct %s *%s)" % (func_name, struct_name, struct_var), file=file)
    print("{", file=file)
    print("\tint ret = 0;", file=file)
    print("\tu64 val;", file=file)

    has_i = 0
    has_j = 0
    for f in fields:
        num_fields = int(f["Num Fields"])
        num_elements = int(f["Num Elements"])
        if num_fields > 1:
            has_i = 1
        if num_elements > 1:
            has_j = 1

    if has_i == 1 and has_j == 1:
        print("\tint i, j;", file=file)
    elif has_i == 1:
        print("\tint i;", file=file)

    print(file=file)
    for f in fields:
        fname = f["Field Name"]
        field_id = f["Base FIELD_ID (Hex)"]
        num_fields = int(f["Num Fields"])
        num_elements = int(f["Num Elements"])
        struct_member = fname.lower()
        indent = "\t"
        if num_fields > 1:
            if fname == "CMR_BASE" or fname == "CMR_SIZE":
                limit = "%s_%s->num_cmrs" %(STRVAR_PREFIX, "cmr")
            elif fname == "CPUID_CONFIG_LEAVES" or fname == "CPUID_CONFIG_VALUES":
                limit = "%s_%s->num_cpuid_config" %(STRVAR_PREFIX, "td_conf")
            else:
                limit = "%d" %(num_fields)
            print("%sfor (i = 0; i < %s; i++)" % (indent, limit), file=file)
            indent += "\t"
            field_id += " + i"
            struct_member += "[i]"
        if num_elements > 1:
            print("%sfor (j = 0; j < %d; j++)" % (indent, num_elements), file=file)
            indent += "\t"
            field_id += " * %d + j" % (num_elements)
            struct_member += "[j]"

        print_read_field(
            field_id,
            struct_var,
            struct_member,
            indent,
            file=file,
        )

    print(file=file)
    print("\treturn ret;", file=file)
    print("}", file=file)

def print_main_struct(file):
    print("struct tdx_sys_info {", file=file)
    for class_name, field_names in TDX_STRUCTS.items():
        struct_name = "%s_%s" % (STRUCT_PREFIX, class_name)
        struct_var = class_name
        print("\tstruct %s %s;" % (struct_name, struct_var), file=file)
    print("};", file=file)

def print_main_function(file):
    print("static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)", file=file)
    print("{", file=file)
    print("\tint ret = 0;", file=file)
    print(file=file)
    for class_name, field_names in TDX_STRUCTS.items():
        func_name = "%s_%s" % (FUNC_PREFIX, class_name)
        struct_var = class_name
        print("\tret = ret ?: %s(&sysinfo->%s);" % (func_name, struct_var), file=file)
    print(file=file)
    print("\treturn ret;", file=file)
    print("}", file=file)

jsonfile = sys.argv[1]
hfile = sys.argv[2]
cfile = sys.argv[3]
hfileifdef = hfile.replace(".", "_")

with open(jsonfile, "r") as f:
    json_in = json.load(f)
    fields = {x["Field Name"]: x for x in json_in["Fields"]}

with open(hfile, "w") as f:
    print("/* SPDX-License-Identifier: GPL-2.0 */", file=f)
    print("/* Automatically generated TDX global metadata structures. */", file=f)
    print("#ifndef _X86_VIRT_TDX_AUTO_GENERATED_" + hfileifdef.upper(), file=f)
    print("#define _X86_VIRT_TDX_AUTO_GENERATED_" + hfileifdef.upper(), file=f)
    print(file=f)
    print("#include <linux/types.h>", file=f)
    print(file=f)
    for class_name, field_names in TDX_STRUCTS.items():
        print_class_struct(class_name, [fields[x] for x in field_names], file=f)
        print(file=f)
    print_main_struct(file=f)
    print(file=f)
    print("#endif", file=f)

with open(cfile, "w") as f:
    print("// SPDX-License-Identifier: GPL-2.0", file=f)
    print("/*", file=f)
    print(" * Automatically generated functions to read TDX global metadata.", file=f)
    print(" *", file=f)
    print(" * This file doesn't compile on its own as it lacks of inclusion", file=f)
    print(" * of SEAMCALL wrapper primitive which reads global metadata.", file=f)
    print(" * Include this file to other C file instead.", file=f)
    print(" */", file=f)
    for class_name, field_names in TDX_STRUCTS.items():
        print(file=f)
        print_class_function(class_name, [fields[x] for x in field_names], file=f)
    print(file=f)
    print_main_function(file=f)

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-12-13 11:18 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-13 11:57 [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Kai Huang
2024-11-13 11:57 ` [PATCH v8 1/9] x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better Kai Huang
2024-11-13 11:57 ` [PATCH v8 2/9] x86/virt/tdx: Start to track all global metadata in one structure Kai Huang
2024-11-13 11:57 ` [PATCH v8 3/9] x86/virt/tdx: Use auto-generated code to read global metadata Kai Huang
2024-12-13 11:17   ` Huang, Kai
2024-11-13 11:57 ` [PATCH v8 4/9] x86/virt/tdx: Use dedicated struct members for PAMT entry sizes Kai Huang
2024-11-13 11:57 ` [PATCH v8 5/9] x86/virt/tdx: Add missing header file inclusion to local tdx.h Kai Huang
2024-11-13 11:57 ` [PATCH v8 6/9] x86/virt/tdx: Switch to use auto-generated global metadata reading code Kai Huang
2024-11-13 11:57 ` [PATCH v8 7/9] x86/virt/tdx: Trim away tail null CMRs Kai Huang
2024-11-13 11:57 ` [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes Kai Huang
2024-12-04 14:22   ` Huang, Kai
2024-12-05 12:45     ` Huang, Kai
2024-12-05 12:40   ` [PATCH v8 8.1/9] " Kai Huang
2024-12-05 18:10     ` Dave Hansen
2024-12-06  2:45       ` Huang, Kai
2024-12-09  6:57       ` Huang, Kai
2024-12-09  6:50   ` [PATCH v8 8.2/9] " Kai Huang
2024-12-09 22:54     ` Dave Hansen
2024-12-10  2:26       ` Huang, Kai
2024-12-10  2:46         ` Dan Williams
2024-12-10  4:24           ` Huang, Kai
2024-12-10 16:58             ` Dave Hansen
2024-12-11  4:34               ` Huang, Kai
2024-11-13 11:57 ` [PATCH v8 9/9] x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD mitigation Kai Huang
2024-11-13 22:25 ` [PATCH v8 0/9] TDX host: metadata reading tweaks and bug fixes Edgecombe, Rick P
2024-11-13 22:40   ` Huang, Kai
2024-11-13 22:53     ` Edgecombe, Rick P
2024-11-13 23:35       ` Huang, Kai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox