[PATCH v2 00/25] TDX vCPU/VM creation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/25] TDX vCPU/VM creation
@ 2024-10-30 19:00 Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 01/25] x86/virt/tdx: Share the global metadata structure for KVM to use Rick Edgecombe
                   ` (27 more replies)
  0 siblings, 28 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre

Hi,

Here is v2 of TDX VM/vCPU creation series. As discussed earlier, non-nits 
from v1[0] have been applied and it’s ready to hand off to Paolo. A few 
items remain that may be worth further discussion:
 - Disable CET/PT in tdx_get_supported_xfam(), as these features haven’t 
   been been tested.
 - The Retry loop around tdh_phymem_page_reclaim() in “KVM: TDX: 
   create/destroy VM structure” likely can be dropped.
 - Drop support for TDX Module’s that don’t support
   MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM. [1]
 - Type-safety in to_vmx()/to_tdx(). [2]

This series has 9 commits intended to collect acks from x86 maintainers. 
The first group of commits is for reading TDX static metadata for KVM to 
use:
	x86/virt/tdx: Share the global metadata structure for KVM to use
	x86/virt/tdx: Read essential global metadata for KVM

The second group is for exporting a TDX keyid allocator for KVM to use:
	x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free
	  TDX guest KeyID

The third group is for exporting various SEAMCALLs needed by KVM for this 
series. SEAMCALL patches for the later TDX sections will come with those 
series’:
	x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management
	x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation
	x86/virt/tdx: Add SEAMCALL wrappers for TDX vCPU creation
	x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
	x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access
	x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations

This series is based off of a kvm-coco-queue commit and some pre-req
series:
1. d659088d46df "KVM: x86/mmu: Prevent aliased memslot GFNs" (in
   kvm-coco-queue).
2. v6 of “TDX host: metadata reading tweaks, bug fix and info dump”
3. “KVM: VMX: Initialize TDX when loading KVM module” re-ordered from 
   the commits in kvm-coco-queue

It requires TDX module 1.5.06.00.0744[3], or later. This is due to removal
of the workarounds for the lack of NO_RBP_MOD. Now NO_RBP_MOD is enabled,
and this particular version of the TDX module has a NO_RBP_MOD related bug
fix.

The full KVM branch is here:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-10-30

Matching QEMU:
https://github.com/intel-staging/qemu-tdx/tree/tdx-qemu-wip-2024-10-11

[0] https://lore.kernel.org/kvm/20240812224820.34826-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/kvm/d71540ab13e728d1326baae92e8ea82d00c08abe.camel@intel.com/
[2] https://lore.kernel.org/kvm/89657f96-0ed1-4543-9074-f13f62cc4694@redhat.com/
[3] https://github.com/intel/tdx-module/releases/tag/TDX_1.5.06

Isaku Yamahata (19):
  x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free TDX
    guest KeyID
  x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management
  x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation
  x86/virt/tdx: Add SEAMCALL wrappers for TDX vCPU creation
  x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access
  x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations
  KVM: TDX: Add placeholders for TDX VM/vCPU structures
  KVM: TDX: Define TDX architectural definitions
  KVM: TDX: Add helper functions to print TDX SEAMCALL error
  KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  KVM: TDX: Get system-wide info about TDX module on initialization
  KVM: TDX: create/destroy VM structure
  KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check
  KVM: TDX: initialize VM with TDX specific parameters
  KVM: TDX: Make pmu_intel.c ignore guest TD case
  KVM: TDX: Don't offline the last cpu of one package when there's TDX
    guest
  KVM: TDX: create/free TDX vcpu structure
  KVM: TDX: Do TDX specific vcpu initialization

Kai Huang (3):
  x86/virt/tdx: Share the global metadata structure for KVM to use
  KVM: TDX: Get TDX global information
  x86/virt/tdx: Read essential global metadata for KVM

Sean Christopherson (1):
  KVM: TDX: Add TDX "architectural" error codes

Xiaoyao Li (2):
  KVM: x86: Introduce KVM_TDX_GET_CPUID
  KVM: x86/mmu: Taking guest pa into consideration when calculate tdp
    level

 arch/x86/include/asm/kvm-x86-ops.h            |    4 +-
 arch/x86/include/asm/kvm_host.h               |    2 +
 arch/x86/include/asm/shared/tdx.h             |    7 +-
 arch/x86/include/asm/tdx.h                    |   25 +
 .../tdx => include/asm}/tdx_global_metadata.h |   19 +
 arch/x86/include/uapi/asm/kvm.h               |   59 +
 arch/x86/kvm/Kconfig                          |    2 +
 arch/x86/kvm/cpuid.c                          |   21 +
 arch/x86/kvm/cpuid.h                          |    3 +
 arch/x86/kvm/mmu/mmu.c                        |    9 +-
 arch/x86/kvm/vmx/main.c                       |  143 +-
 arch/x86/kvm/vmx/pmu_intel.c                  |   50 +-
 arch/x86/kvm/vmx/pmu_intel.h                  |   28 +
 arch/x86/kvm/vmx/tdx.c                        | 1369 ++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h                        |   92 ++
 arch/x86/kvm/vmx/tdx_arch.h                   |  165 ++
 arch/x86/kvm/vmx/tdx_errno.h                  |   37 +
 arch/x86/kvm/vmx/vmx.h                        |   34 +-
 arch/x86/kvm/vmx/x86_ops.h                    |   24 +
 arch/x86/kvm/x86.c                            |   15 +-
 arch/x86/virt/vmx/tdx/tdx.c                   |  264 +++-
 arch/x86/virt/vmx/tdx/tdx.h                   |   39 +-
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c   |   46 +
 23 files changed, 2391 insertions(+), 66 deletions(-)
 rename arch/x86/{virt/vmx/tdx => include/asm}/tdx_global_metadata.h (68%)
 create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
 create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
 create mode 100644 arch/x86/kvm/vmx/tdx_errno.h

-- 
2.47.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 01/25] x86/virt/tdx: Share the global metadata structure for KVM to use
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 02/25] KVM: TDX: Get TDX global information Rick Edgecombe
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre

From: Kai Huang <kai.huang@intel.com>

The TDX host tracks all global metadata fields in 'struct tdx_sys_info'.
For now they are only used by module initialization and are not shared
to other kernel components.

Future changes to support KVM TDX will need to read more global metadata
fields, e.g., those in "TD Control Structures" and "TD Configurability".
In the longer term, other TDX features like TDX Connect (which supports
assigning trusted devices to TDX guests) will also require other kernel
components such as pci/vt-d to access global metadata.

To meet all those requirements, the idea is the TDX host core-kernel to
to provide a centralized, canonical, and read-only structure for the
global metadata that comes out from the TDX module for all kernel
components to use.

To achieve "read-only", the ideal way is to annotate the whole structure
with __ro_after_init.  However currently all global metadata fields are
read by tdx_enable(), which could be called at any time at runtime thus
isn't annotated with __init.

The __ro_after_init can be done eventually, but it can only be done
after moving VMXON out of KVM to the core-kernel: after that we can
read all metadata during kernel boot (thus __ro_after_init), but
doesn't necessarily have to do it in tdx_enable().

For now, add a helper function to return a 'const struct tdx_sys_info *'
and export it for KVM to use.

Note, KVM doesn't need to access all global metadata for TDX, thus
exporting the entire 'struct tdx_sys_info' is overkill.  Another option
is to export sub-structures on demand.  But this will result in more
exports.  Given the export is done via a const pointer thus the other
in-kernel TDX won't be able to write to global metadata, simply export
all global metadata fields in one function.

The auto-generated 'tdx_global_metadata.h' contains declarations of
'struct tdx_sys_info' and its sub-structures.  Move it to
arch/x86/include/asm/ and include it to <asm/tdx.h> to expose those
structures.

Include 'tdx_global_metadata.h' inside the '#ifndef __ASSEMBLY__' since
otherwise there will be build warning due to <asm/tdx.h> is also
included by assembly.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - New patch
---
 arch/x86/include/asm/tdx.h                    |  3 ++
 .../tdx => include/asm}/tdx_global_metadata.h |  0
 arch/x86/virt/vmx/tdx/tdx.c                   | 28 +++++++++++++++----
 arch/x86/virt/vmx/tdx/tdx.h                   |  1 -
 4 files changed, 25 insertions(+), 7 deletions(-)
 rename arch/x86/{virt/vmx/tdx => include/asm}/tdx_global_metadata.h (100%)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index eba178996d84..b9758369d82c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -33,6 +33,7 @@
 #ifndef __ASSEMBLY__

 #include <uapi/asm/mce.h>
+#include "tdx_global_metadata.h"

 /*
  * Used by the #VE exception handler to gather the #VE exception
@@ -116,11 +117,13 @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
 int tdx_cpu_enable(void);
 int tdx_enable(void);
 const char *tdx_dump_mce_info(struct mce *m);
+const struct tdx_sys_info *tdx_get_sysinfo(void);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
+static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
 #endif	/* CONFIG_INTEL_TDX_HOST */

 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.h b/arch/x86/include/asm/tdx_global_metadata.h
similarity index 100%
rename from arch/x86/virt/vmx/tdx/tdx_global_metadata.h
rename to arch/x86/include/asm/tdx_global_metadata.h
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6982e100536d..7589c75eaa6c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -52,6 +52,8 @@ static DEFINE_MUTEX(tdx_module_lock);
 /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
 static LIST_HEAD(tdx_memlist);

+static struct tdx_sys_info tdx_sysinfo;
+
 typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);

 static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -1132,15 +1134,14 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)

 static int init_tdx_module(void)
 {
-	struct tdx_sys_info sysinfo;
 	int ret;

-	ret = init_tdx_sys_info(&sysinfo);
+	ret = init_tdx_sys_info(&tdx_sysinfo);
 	if (ret)
 		return ret;

 	/* Check whether the kernel can support this module */
-	ret = check_features(&sysinfo);
+	ret = check_features(&tdx_sysinfo);
 	if (ret)
 		return ret;

@@ -1161,13 +1162,14 @@ static int init_tdx_module(void)
 		goto out_put_tdxmem;

 	/* Allocate enough space for constructing TDMRs */
-	ret = alloc_tdmr_list(&tdx_tdmr_list, &sysinfo.tdmr);
+	ret = alloc_tdmr_list(&tdx_tdmr_list, &tdx_sysinfo.tdmr);
 	if (ret)
 		goto err_free_tdxmem;

 	/* Cover all TDX-usable memory regions in TDMRs */
-	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr,
-			&sysinfo.cmr);
+	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list,
+			      &tdx_sysinfo.tdmr, &tdx_sysinfo.cmr);
+
 	if (ret)
 		goto err_free_tdmrs;

@@ -1529,3 +1531,17 @@ void __init tdx_init(void)

 	check_tdx_erratum();
 }
+
+const struct tdx_sys_info *tdx_get_sysinfo(void)
+{
+	const struct tdx_sys_info *p = NULL;
+
+	/* Make sure all fields in @tdx_sysinfo have been populated */
+	mutex_lock(&tdx_module_lock);
+	if (tdx_module_status == TDX_MODULE_INITIALIZED)
+		p = (const struct tdx_sys_info *)&tdx_sysinfo;
+	mutex_unlock(&tdx_module_lock);
+
+	return p;
+}
+EXPORT_SYMBOL_GPL(tdx_get_sysinfo);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index c8be00f6b15a..9b708a8fb568 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -6,7 +6,6 @@
 #include <linux/compiler_attributes.h>
 #include <linux/stddef.h>
 #include <linux/bits.h>
-#include "tdx_global_metadata.h"

 /*
  * This file contains both macros and data structures defined by the TDX
-- 
2.47.0

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 02/25] KVM: TDX: Get TDX global information
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 01/25] x86/virt/tdx: Share the global metadata structure for KVM to use Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM Rick Edgecombe
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre

From: Kai Huang <kai.huang@intel.com>

KVM will need to consult some essential TDX global information to create
and run TDX guests.  Get the global information after initializing TDX.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - New patch
---
 arch/x86/kvm/vmx/tdx.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8651599822d5..f95a4dbcaf4a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -12,6 +12,8 @@ module_param_named(tdx, enable_tdx, bool, 0444);
 
 static enum cpuhp_state tdx_cpuhp_state;
 
+static const struct tdx_sys_info *tdx_sysinfo;
+
 static int tdx_online_cpu(unsigned int cpu)
 {
 	unsigned long flags;
@@ -91,11 +93,20 @@ static int __init __tdx_bringup(void)
 	if (r)
 		goto tdx_bringup_err;
 
+	/* Get TDX global information for later use */
+	tdx_sysinfo = tdx_get_sysinfo();
+	if (WARN_ON_ONCE(!tdx_sysinfo)) {
+		r = -EINVAL;
+		goto get_sysinfo_err;
+	}
+
 	/*
 	 * Leave hardware virtualization enabled after TDX is enabled
 	 * successfully.  TDX CPU hotplug depends on this.
 	 */
 	return 0;
+get_sysinfo_err:
+	__do_tdx_cleanup();
 tdx_bringup_err:
 	kvm_disable_virtualization();
 	return r;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 01/25] x86/virt/tdx: Share the global metadata structure for KVM to use Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 02/25] KVM: TDX: Get TDX global information Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-12-06  8:37   ` Xiaoyao Li
  2024-12-21  1:07   ` [PATCH v2.1 " Kai Huang
  2024-10-30 19:00 ` [PATCH v2 04/25] x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free TDX guest KeyID Rick Edgecombe
                   ` (24 subsequent siblings)
  27 siblings, 2 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre

From: Kai Huang <kai.huang@intel.com>

KVM needs two classes of global metadata to create and run TDX guests:

 - "TD Control Structures"
 - "TD Configurability"

The first class contains the sizes of TDX guest per-VM and per-vCPU
control structures.  KVM will need to use them to allocate enough space
for those control structures.

The second class contains info which reports things like which features
are configurable to TDX guest etc.  KVM will need to use them to
properly configure TDX guests.

Read them for KVM TDX to use.

The code change is auto-generated by re-running the script in [1] after
uncommenting the "td_conf" and "td_ctrl" part to regenerate the
tdx_global_metadata.{hc} and update them to the existing ones in the
kernel.

  #python tdx.py global_metadata.json tdx_global_metadata.h \
	tdx_global_metadata.c

The 'global_metadata.json' can be fetched from [2].

Link: https://lore.kernel.org/kvm/0853b155ec9aac09c594caa60914ed6ea4dc0a71.camel@intel.com/ [1]
Link: https://cdrdv2.intel.com/v1/dl/getContent/795381 [2]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - New patch
---
 arch/x86/include/asm/tdx_global_metadata.h  | 19 +++++++++
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 46 +++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/arch/x86/include/asm/tdx_global_metadata.h b/arch/x86/include/asm/tdx_global_metadata.h
index fde370b855f1..206090c9952f 100644
--- a/arch/x86/include/asm/tdx_global_metadata.h
+++ b/arch/x86/include/asm/tdx_global_metadata.h
@@ -32,11 +32,30 @@ struct tdx_sys_info_cmr {
 	u64 cmr_size[32];
 };
 
+struct tdx_sys_info_td_ctrl {
+	u16 tdr_base_size;
+	u16 tdcs_base_size;
+	u16 tdvps_base_size;
+};
+
+struct tdx_sys_info_td_conf {
+	u64 attributes_fixed0;
+	u64 attributes_fixed1;
+	u64 xfam_fixed0;
+	u64 xfam_fixed1;
+	u16 num_cpuid_config;
+	u16 max_vcpus_per_td;
+	u64 cpuid_config_leaves[32];
+	u64 cpuid_config_values[32][2];
+};
+
 struct tdx_sys_info {
 	struct tdx_sys_info_version version;
 	struct tdx_sys_info_features features;
 	struct tdx_sys_info_tdmr tdmr;
 	struct tdx_sys_info_cmr cmr;
+	struct tdx_sys_info_td_ctrl td_ctrl;
+	struct tdx_sys_info_td_conf td_conf;
 };
 
 #endif
diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
index 2fe57e084453..44c2b3e079de 100644
--- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
+++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
@@ -76,6 +76,50 @@ static int get_tdx_sys_info_cmr(struct tdx_sys_info_cmr *sysinfo_cmr)
 	return ret;
 }
 
+static int get_tdx_sys_info_td_ctrl(struct tdx_sys_info_td_ctrl *sysinfo_td_ctrl)
+{
+	int ret = 0;
+	u64 val;
+
+	if (!ret && !(ret = read_sys_metadata_field(0x9800000100000000, &val)))
+		sysinfo_td_ctrl->tdr_base_size = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9800000100000100, &val)))
+		sysinfo_td_ctrl->tdcs_base_size = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9800000100000200, &val)))
+		sysinfo_td_ctrl->tdvps_base_size = val;
+
+	return ret;
+}
+
+static int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_td_conf)
+{
+	int ret = 0;
+	u64 val;
+	int i, j;
+
+	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000000, &val)))
+		sysinfo_td_conf->attributes_fixed0 = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000001, &val)))
+		sysinfo_td_conf->attributes_fixed1 = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000002, &val)))
+		sysinfo_td_conf->xfam_fixed0 = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000003, &val)))
+		sysinfo_td_conf->xfam_fixed1 = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9900000100000004, &val)))
+		sysinfo_td_conf->num_cpuid_config = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9900000100000008, &val)))
+		sysinfo_td_conf->max_vcpus_per_td = val;
+	for (i = 0; i < sysinfo_td_conf->num_cpuid_config; i++)
+		if (!ret && !(ret = read_sys_metadata_field(0x9900000300000400 + i, &val)))
+			sysinfo_td_conf->cpuid_config_leaves[i] = val;
+	for (i = 0; i < sysinfo_td_conf->num_cpuid_config; i++)
+		for (j = 0; j < 2; j++)
+			if (!ret && !(ret = read_sys_metadata_field(0x9900000300000500 + i * 2 + j, &val)))
+				sysinfo_td_conf->cpuid_config_values[i][j] = val;
+
+	return ret;
+}
+
 static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
 {
 	int ret = 0;
@@ -84,6 +128,8 @@ static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
 	ret = ret ?: get_tdx_sys_info_features(&sysinfo->features);
 	ret = ret ?: get_tdx_sys_info_tdmr(&sysinfo->tdmr);
 	ret = ret ?: get_tdx_sys_info_cmr(&sysinfo->cmr);
+	ret = ret ?: get_tdx_sys_info_td_ctrl(&sysinfo->td_ctrl);
+	ret = ret ?: get_tdx_sys_info_td_conf(&sysinfo->td_conf);
 
 	return ret;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM
  2024-10-30 19:00 ` [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM Rick Edgecombe
@ 2024-12-06  8:37   ` Xiaoyao Li
  2024-12-06 16:13     ` Huang, Kai
  2024-12-21  1:07   ` [PATCH v2.1 " Kai Huang
  1 sibling, 1 reply; 103+ messages in thread
From: Xiaoyao Li @ 2024-12-06  8:37 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, reinette.chatre

On 10/31/2024 3:00 AM, Rick Edgecombe wrote:
> From: Kai Huang <kai.huang@intel.com>
> 
> KVM needs two classes of global metadata to create and run TDX guests:
> 
>   - "TD Control Structures"
>   - "TD Configurability"
> 
> The first class contains the sizes of TDX guest per-VM and per-vCPU
> control structures.  KVM will need to use them to allocate enough space
> for those control structures.
> 
> The second class contains info which reports things like which features
> are configurable to TDX guest etc.  KVM will need to use them to
> properly configure TDX guests.
> 
> Read them for KVM TDX to use.
> 
> The code change is auto-generated by re-running the script in [1] after
> uncommenting the "td_conf" and "td_ctrl" part to regenerate the
> tdx_global_metadata.{hc} and update them to the existing ones in the
> kernel.
> 
>    #python tdx.py global_metadata.json tdx_global_metadata.h \
> 	tdx_global_metadata.c
> 
> The 'global_metadata.json' can be fetched from [2].
> 
> Link: https://lore.kernel.org/kvm/0853b155ec9aac09c594caa60914ed6ea4dc0a71.camel@intel.com/ [1]
> Link: https://cdrdv2.intel.com/v1/dl/getContent/795381 [2]
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> uAPI breakout v2:
>   - New patch
> ---
>   arch/x86/include/asm/tdx_global_metadata.h  | 19 +++++++++
>   arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 46 +++++++++++++++++++++
>   2 files changed, 65 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx_global_metadata.h b/arch/x86/include/asm/tdx_global_metadata.h
> index fde370b855f1..206090c9952f 100644
> --- a/arch/x86/include/asm/tdx_global_metadata.h
> +++ b/arch/x86/include/asm/tdx_global_metadata.h
> @@ -32,11 +32,30 @@ struct tdx_sys_info_cmr {
>   	u64 cmr_size[32];
>   };
>   
> +struct tdx_sys_info_td_ctrl {
> +	u16 tdr_base_size;
> +	u16 tdcs_base_size;
> +	u16 tdvps_base_size;
> +};
> +
> +struct tdx_sys_info_td_conf {
> +	u64 attributes_fixed0;
> +	u64 attributes_fixed1;
> +	u64 xfam_fixed0;
> +	u64 xfam_fixed1;
> +	u16 num_cpuid_config;
> +	u16 max_vcpus_per_td;
> +	u64 cpuid_config_leaves[32];
> +	u64 cpuid_config_values[32][2];
> +};
> +
>   struct tdx_sys_info {
>   	struct tdx_sys_info_version version;
>   	struct tdx_sys_info_features features;
>   	struct tdx_sys_info_tdmr tdmr;
>   	struct tdx_sys_info_cmr cmr;
> +	struct tdx_sys_info_td_ctrl td_ctrl;
> +	struct tdx_sys_info_td_conf td_conf;
>   };
>   
>   #endif
> diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> index 2fe57e084453..44c2b3e079de 100644
> --- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> +++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> @@ -76,6 +76,50 @@ static int get_tdx_sys_info_cmr(struct tdx_sys_info_cmr *sysinfo_cmr)
>   	return ret;
>   }
>   
> +static int get_tdx_sys_info_td_ctrl(struct tdx_sys_info_td_ctrl *sysinfo_td_ctrl)
> +{
> +	int ret = 0;
> +	u64 val;
> +
> +	if (!ret && !(ret = read_sys_metadata_field(0x9800000100000000, &val)))
> +		sysinfo_td_ctrl->tdr_base_size = val;
> +	if (!ret && !(ret = read_sys_metadata_field(0x9800000100000100, &val)))
> +		sysinfo_td_ctrl->tdcs_base_size = val;
> +	if (!ret && !(ret = read_sys_metadata_field(0x9800000100000200, &val)))
> +		sysinfo_td_ctrl->tdvps_base_size = val;
> +
> +	return ret;
> +}
> +
> +static int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_td_conf)
> +{
> +	int ret = 0;
> +	u64 val;
> +	int i, j;
> +
> +	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000000, &val)))
> +		sysinfo_td_conf->attributes_fixed0 = val;
> +	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000001, &val)))
> +		sysinfo_td_conf->attributes_fixed1 = val;
> +	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000002, &val)))
> +		sysinfo_td_conf->xfam_fixed0 = val;
> +	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000003, &val)))
> +		sysinfo_td_conf->xfam_fixed1 = val;
> +	if (!ret && !(ret = read_sys_metadata_field(0x9900000100000004, &val)))
> +		sysinfo_td_conf->num_cpuid_config = val;
> +	if (!ret && !(ret = read_sys_metadata_field(0x9900000100000008, &val)))
> +		sysinfo_td_conf->max_vcpus_per_td = val;
> +	for (i = 0; i < sysinfo_td_conf->num_cpuid_config; i++)

It is not safe. We need to check

	sysinfo_td_conf->num_cpuid_config <= 32.

If the TDX module version is not matched with the json file that was 
used to generate the tdx_global_metadata.h, the num_cpuid_config 
reported by the actual TDX module might exceed 32 which causes 
out-of-bound array access.

> +		if (!ret && !(ret = read_sys_metadata_field(0x9900000300000400 + i, &val)))
> +			sysinfo_td_conf->cpuid_config_leaves[i] = val;
> +	for (i = 0; i < sysinfo_td_conf->num_cpuid_config; i++)
> +		for (j = 0; j < 2; j++)
> +			if (!ret && !(ret = read_sys_metadata_field(0x9900000300000500 + i * 2 + j, &val)))
> +				sysinfo_td_conf->cpuid_config_values[i][j] = val;
> +
> +	return ret;
> +}
> +
>   static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
>   {
>   	int ret = 0;
> @@ -84,6 +128,8 @@ static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
>   	ret = ret ?: get_tdx_sys_info_features(&sysinfo->features);
>   	ret = ret ?: get_tdx_sys_info_tdmr(&sysinfo->tdmr);
>   	ret = ret ?: get_tdx_sys_info_cmr(&sysinfo->cmr);
> +	ret = ret ?: get_tdx_sys_info_td_ctrl(&sysinfo->td_ctrl);
> +	ret = ret ?: get_tdx_sys_info_td_conf(&sysinfo->td_conf);
>   
>   	return ret;
>   }


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM
  2024-12-06  8:37   ` Xiaoyao Li
@ 2024-12-06 16:13     ` Huang, Kai
  2024-12-06 16:18       ` Huang, Kai
  2024-12-06 16:24       ` Dave Hansen
  0 siblings, 2 replies; 103+ messages in thread
From: Huang, Kai @ 2024-12-06 16:13 UTC (permalink / raw)
  To: Li, Xiaoyao, pbonzini@redhat.com, seanjc@google.com,
	Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, Chatre, Reinette, Hansen, Dave,
	linux-kernel@vger.kernel.org, Zhao, Yan Y,
	isaku.yamahata@gmail.com, tony.lindgren@linux.intel.com

On Fri, 2024-12-06 at 16:37 +0800, Xiaoyao Li wrote:
>  +static int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_td_conf)
> > +{
> > +	int ret = 0;
> > +	u64 val;
> > +	int i, j;
> > +
> > +	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000000, &val)))
> > +		sysinfo_td_conf->attributes_fixed0 = val;
> > +	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000001, &val)))
> > +		sysinfo_td_conf->attributes_fixed1 = val;
> > +	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000002, &val)))
> > +		sysinfo_td_conf->xfam_fixed0 = val;
> > +	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000003, &val)))
> > +		sysinfo_td_conf->xfam_fixed1 = val;
> > +	if (!ret && !(ret = read_sys_metadata_field(0x9900000100000004, &val)))
> > +		sysinfo_td_conf->num_cpuid_config = val;
> > +	if (!ret && !(ret = read_sys_metadata_field(0x9900000100000008, &val)))
> > +		sysinfo_td_conf->max_vcpus_per_td = val;
> > +	for (i = 0; i < sysinfo_td_conf->num_cpuid_config; i++)
> 
> It is not safe. We need to check
> 
> 	sysinfo_td_conf->num_cpuid_config <= 32.
> 
> If the TDX module version is not matched with the json file that was 
> used to generate the tdx_global_metadata.h, the num_cpuid_config 
> reported by the actual TDX module might exceed 32 which causes 
> out-of-bound array access.

+Dave.

I thought 32 (which is also auto-generated from the "Num Fields" in the JSON
file) is architectural, but looking at the TDX 1.5 spec, it seems there's no
place mentioning such.

I think we can add:

	if (sysinfo_td_conf->num_cpuid_config <= 32)
		return -EINVAL;

.. which will make reading global metadata failure, and result in module
initialization failure.  Basically it means if one day some TDX module comes
with >32 entries, some old versions of kernel won't be able to supportit.  But I
think it should be fine.

This also reminds me reading the CMRs is similar, but I don't think we need to
do similar check for reading CMRs because "maximum number of CMRs is 32" is
architectural behaviour as mentioned in the TDX 1.5 base spec:

4.1.3.1 Intel TDX ISA Background: Convertible Memory Ranges (CMRs)

...

* The maximum number of CMRs is implementation specific. It is not explicitly
enumerated; it is deduced from Family/Model/Stepping information provided by
CPUID.
   * The maximum number of CMRs is 32.

Hi Dave,

Do you have any comments?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM
  2024-12-06 16:13     ` Huang, Kai
@ 2024-12-06 16:18       ` Huang, Kai
  2024-12-06 16:24       ` Dave Hansen
  1 sibling, 0 replies; 103+ messages in thread
From: Huang, Kai @ 2024-12-06 16:18 UTC (permalink / raw)
  To: Li, Xiaoyao, pbonzini@redhat.com, seanjc@google.com,
	Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, Chatre, Reinette, Hansen, Dave,
	linux-kernel@vger.kernel.org, Zhao, Yan Y,
	isaku.yamahata@gmail.com, tony.lindgren@linux.intel.com

On Fri, 2024-12-06 at 16:13 +0000, Huang, Kai wrote:
> I think we can add:
> 
> 	if (sysinfo_td_conf->num_cpuid_config <= 32)
> 		return -EINVAL;

Sorry it should be:

	if (sysinfo_td_conf->num_cpuid_config > 32)
		return -EINVAL;

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM
  2024-12-06 16:13     ` Huang, Kai
  2024-12-06 16:18       ` Huang, Kai
@ 2024-12-06 16:24       ` Dave Hansen
  2024-12-07  0:00         ` Huang, Kai
  1 sibling, 1 reply; 103+ messages in thread
From: Dave Hansen @ 2024-12-06 16:24 UTC (permalink / raw)
  To: Huang, Kai, Li, Xiaoyao, pbonzini@redhat.com, seanjc@google.com,
	Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Zhao, Yan Y,
	isaku.yamahata@gmail.com, tony.lindgren@linux.intel.com

On 12/6/24 08:13, Huang, Kai wrote:
> It is not safe. We need to check
> 
>       sysinfo_td_conf->num_cpuid_config <= 32.
> 
> If the TDX module version is not matched with the json file that was
> used to generate the tdx_global_metadata.h, the num_cpuid_config
> reported by the actual TDX module might exceed 32 which causes
> out-of-bound array access.

The JSON *IS* the ABI description. It can't change between versions of
the TDX module. It can only be extended. The "32" is not in the spec
because the spec refers to the JSON!

^ permalink raw reply	[flat|nested] 103+ messages in thread

* RE: [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM
  2024-12-06 16:24       ` Dave Hansen
@ 2024-12-07  0:00         ` Huang, Kai
  2024-12-12  0:31           ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Huang, Kai @ 2024-12-07  0:00 UTC (permalink / raw)
  To: Hansen, Dave, Li, Xiaoyao, pbonzini@redhat.com, seanjc@google.com,
	Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Zhao, Yan Y,
	isaku.yamahata@gmail.com, tony.lindgren@linux.intel.com

> On 12/6/24 08:13, Huang, Kai wrote:
> > It is not safe. We need to check
> >
> >       sysinfo_td_conf->num_cpuid_config <= 32.
> >
> > If the TDX module version is not matched with the json file that was
> > used to generate the tdx_global_metadata.h, the num_cpuid_config
> > reported by the actual TDX module might exceed 32 which causes
> > out-of-bound array access.
> 
> The JSON *IS* the ABI description. It can't change between versions of the
> TDX module. It can only be extended. The "32" is not in the spec because the
> spec refers to the JSON!

Ah, yeah, agreed, the "spec refers to the JSON".  :-)

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM
  2024-12-07  0:00         ` Huang, Kai
@ 2024-12-12  0:31           ` Edgecombe, Rick P
  2024-12-21  1:17             ` Huang, Kai
  0 siblings, 1 reply; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-12-12  0:31 UTC (permalink / raw)
  To: Li, Xiaoyao, Williams, Dan J, pbonzini@redhat.com, Hansen, Dave,
	seanjc@google.com, Huang, Kai
  Cc: kvm@vger.kernel.org, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Zhao, Yan Y,
	isaku.yamahata@gmail.com, tony.lindgren@linux.intel.com

On Sat, 2024-12-07 at 00:00 +0000, Huang, Kai wrote:
> > On 12/6/24 08:13, Huang, Kai wrote:
> > > It is not safe. We need to check
> > > 
> > >        sysinfo_td_conf->num_cpuid_config <= 32.
> > > 
> > > If the TDX module version is not matched with the json file that was
> > > used to generate the tdx_global_metadata.h, the num_cpuid_config
> > > reported by the actual TDX module might exceed 32 which causes
> > > out-of-bound array access.
> > 
> > The JSON *IS* the ABI description. It can't change between versions of the
> > TDX module. It can only be extended. The "32" is not in the spec because the
> > spec refers to the JSON!
> 
> Ah, yeah, agreed, the "spec refers to the JSON".  :-)

So we heard back from TDX module folks that they were thinking the 32 could
change to be larger (thanks Kai for checking). We need to continue education
with them around what KVM is depending on as TDX Module ABI. And we should get
something clearer than these JSONs.

But in the meantime, we could tell TDX module team they need an opt-in to change
this field. We could also add an actual check to fail cleanly:

diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
index 44c2b3e079de..744549bdf1dd 100644
--- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
+++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
@@ -97,6 +97,10 @@ static int get_tdx_sys_info_td_conf(struct
tdx_sys_info_td_conf *sysinfo_td_conf
        u64 val;
        int i, j;
 
+       if (sysinfo_td_conf->num_cpuid_config >
+           ARRAY_SIZE(sysinfo_td_conf->cpuid_config_leaves))
+               return 1;
+
        if (!ret && !(ret = read_sys_metadata_field(0x1900000300000000, &val)))
                sysinfo_td_conf->attributes_fixed0 = val;
        if (!ret && !(ret = read_sys_metadata_field(0x1900000300000001, &val)))

Or we could dynamically allocate these arrays based on num_cpuid_config.

I'd lean towards switching to the dynamic allocation, because it will be cleaner
and less churn for this array expanding in the future.



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM
  2024-12-12  0:31           ` Edgecombe, Rick P
@ 2024-12-21  1:17             ` Huang, Kai
  0 siblings, 0 replies; 103+ messages in thread
From: Huang, Kai @ 2024-12-21  1:17 UTC (permalink / raw)
  To: Edgecombe, Rick P, Li, Xiaoyao, Williams, Dan J,
	pbonzini@redhat.com, Hansen, Dave, seanjc@google.com
  Cc: kvm@vger.kernel.org, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Zhao, Yan Y,
	isaku.yamahata@gmail.com, tony.lindgren@linux.intel.com



On 12/12/2024 8:31 am, Edgecombe, Rick P wrote:
> On Sat, 2024-12-07 at 00:00 +0000, Huang, Kai wrote:
>>> On 12/6/24 08:13, Huang, Kai wrote:
>>>> It is not safe. We need to check
>>>>
>>>>         sysinfo_td_conf->num_cpuid_config <= 32.
>>>>
>>>> If the TDX module version is not matched with the json file that was
>>>> used to generate the tdx_global_metadata.h, the num_cpuid_config
>>>> reported by the actual TDX module might exceed 32 which causes
>>>> out-of-bound array access.
>>>
>>> The JSON *IS* the ABI description. It can't change between versions of the
>>> TDX module. It can only be extended. The "32" is not in the spec because the
>>> spec refers to the JSON!
>>
>> Ah, yeah, agreed, the "spec refers to the JSON".  :-)
> 
> So we heard back from TDX module folks that they were thinking the 32 could
> change to be larger (thanks Kai for checking). We need to continue education
> with them around what KVM is depending on as TDX Module ABI. And we should get
> something clearer than these JSONs.
> 
> But in the meantime, we could tell TDX module team they need an opt-in to change
> this field. We could also add an actual check to fail cleanly:
> 

Hi Paolo/Sean/Dave,

TDX module team has acked changing 32 to a higher value in future 
modules is a breaking of ABI.  They also promised 128 is the maximum 
value they reserved for CPUID_CONFIGs thus won't change for all modules. 
  They will update the JSON to address.

I just send out an updated v2.1 of this patch to bump array size for 
CPUID_CONFIGs to 128 and add paranoid checks to protect kernel from 
potential TDX module breakage on this.

Appreciate if you can help to review, but for now, wish you have a 
wonderful Christmas :-)

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2.1 03/25] x86/virt/tdx: Read essential global metadata for KVM
  2024-10-30 19:00 ` [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM Rick Edgecombe
  2024-12-06  8:37   ` Xiaoyao Li
@ 2024-12-21  1:07   ` Kai Huang
  1 sibling, 0 replies; 103+ messages in thread
From: Kai Huang @ 2024-12-21  1:07 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, dave.hansen, yan.y.zhao, isaku.yamahata,
	kai.huang, kvm, linux-kernel, tony.lindgren, xiaoyao.li,
	reinette.chatre

KVM needs two classes of global metadata to create and run TDX guests:

 - "TD Control Structures"
 - "TD Configurability"

The first class contains the sizes of TDX guest per-VM and per-vCPU
control structures.  KVM will need to use them to allocate enough space
for those control structures.

The second class contains info which reports things like which features
are configurable to TDX guests.  KVM will need to use them to properly
configure TDX guests.

Read them for KVM TDX to use.

Basically, the code change is auto-generated by adding below to the
script in [1]:

    "td_ctrl": [
        "TDR_BASE_SIZE",
        "TDCS_BASE_SIZE",
        "TDVPS_BASE_SIZE",
    ],
    "td_conf": [
        "ATTRIBUTES_FIXED0",
        "ATTRIBUTES_FIXED1",
        "XFAM_FIXED0",
        "XFAM_FIXED1",
        "NUM_CPUID_CONFIG",
        "MAX_VCPUS_PER_TD",
        "CPUID_CONFIG_LEAVES",
        "CPUID_CONFIG_VALUES",
    ],

.. and re-running the script:

  #python tdx_global_metadata.py global_metadata.json \
  	tdx_global_metadata.h tdx_global_metadata.c

.. but unfortunately with some tweaks:

The "Intel TDX Module v1.5.09 ABI Definitions" JSON files[2], which
describe the TDX module ABI to the kernel, were expected to maintain
backward compatibility.  However, it turns out there are plans to change
the JSON per module release.  Specifically, the maximum number of
CPUID_CONFIGs, i.e., CPUID_CONFIG_{LEAVES|VALUES} is one of the fields
expected to change.

This is obviously problematic for the kernel, and needs to be addressed
by the TDX Module team.  Negotiations on clarifying ABI boundary in the
spec for future models are ongoing.  In the meantime, the TDX module
team has agreed to not increase this specific field beyond 128 entries
without an opt in.

So for now just tweak the JSON to change "Num Fields" from 32 to 128 and
generate a fixed-size (128) array for CPUID_CONFIG_{LEAVES|VALUES}.

Also, due to all those ABI breakages (and module bugs), be paranoid by
generating additional checks to make sure NUM_CPUID_CONFIG will never
exceed the array size of CPUID_CONFIG_{LEAVES|VALUES} to protect the
kernel from the module breakages.  With those checks, detecting a
breakage will just result in module initialization failure.

Link: https://lore.kernel.org/762a50133300710771337398284567b299a86f67.camel@intel.com/ [1]
Link: https://cdrdv2.intel.com/v1/dl/getContent/795381 [2]
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v2 -> v2.1
 - Bump array size for CPUID_CONFIGs to 128
 - Add paranoid checks to protect against incorrect NUM_CPUID_CONFIG.
 - Update changelog accordingly.

 Note: this is based on kvm-coco-queue which has v7 of TDX host metadata
 series which has patches to read TDX module version and CMRs.  It will
 have conflicts to resolve when rebasing to the v9 patches currently
 queued in tip/x86/tdx.

uAPI breakout v2:
 - New patch

---
 arch/x86/include/asm/tdx_global_metadata.h  | 19 ++++++++
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 50 +++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/arch/x86/include/asm/tdx_global_metadata.h b/arch/x86/include/asm/tdx_global_metadata.h
index fde370b855f1..cfef9e5e4d93 100644
--- a/arch/x86/include/asm/tdx_global_metadata.h
+++ b/arch/x86/include/asm/tdx_global_metadata.h
@@ -32,11 +32,30 @@ struct tdx_sys_info_cmr {
 	u64 cmr_size[32];
 };
 
+struct tdx_sys_info_td_ctrl {
+	u16 tdr_base_size;
+	u16 tdcs_base_size;
+	u16 tdvps_base_size;
+};
+
+struct tdx_sys_info_td_conf {
+	u64 attributes_fixed0;
+	u64 attributes_fixed1;
+	u64 xfam_fixed0;
+	u64 xfam_fixed1;
+	u16 num_cpuid_config;
+	u16 max_vcpus_per_td;
+	u64 cpuid_config_leaves[128];
+	u64 cpuid_config_values[128][2];
+};
+
 struct tdx_sys_info {
 	struct tdx_sys_info_version version;
 	struct tdx_sys_info_features features;
 	struct tdx_sys_info_tdmr tdmr;
 	struct tdx_sys_info_cmr cmr;
+	struct tdx_sys_info_td_ctrl td_ctrl;
+	struct tdx_sys_info_td_conf td_conf;
 };
 
 #endif
diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
index 2fe57e084453..d96dbfb43574 100644
--- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
+++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
@@ -76,6 +76,54 @@ static int get_tdx_sys_info_cmr(struct tdx_sys_info_cmr *sysinfo_cmr)
 	return ret;
 }
 
+static int get_tdx_sys_info_td_ctrl(struct tdx_sys_info_td_ctrl *sysinfo_td_ctrl)
+{
+	int ret = 0;
+	u64 val;
+
+	if (!ret && !(ret = read_sys_metadata_field(0x9800000100000000, &val)))
+		sysinfo_td_ctrl->tdr_base_size = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9800000100000100, &val)))
+		sysinfo_td_ctrl->tdcs_base_size = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9800000100000200, &val)))
+		sysinfo_td_ctrl->tdvps_base_size = val;
+
+	return ret;
+}
+
+static int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_td_conf)
+{
+	int ret = 0;
+	u64 val;
+	int i, j;
+
+	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000000, &val)))
+		sysinfo_td_conf->attributes_fixed0 = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000001, &val)))
+		sysinfo_td_conf->attributes_fixed1 = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000002, &val)))
+		sysinfo_td_conf->xfam_fixed0 = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x1900000300000003, &val)))
+		sysinfo_td_conf->xfam_fixed1 = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9900000100000004, &val)))
+		sysinfo_td_conf->num_cpuid_config = val;
+	if (!ret && !(ret = read_sys_metadata_field(0x9900000100000008, &val)))
+		sysinfo_td_conf->max_vcpus_per_td = val;
+	if (sysinfo_td_conf->num_cpuid_config > ARRAY_SIZE(sysinfo_td_conf->cpuid_config_leaves))
+		return -EINVAL;
+	for (i = 0; i < sysinfo_td_conf->num_cpuid_config; i++)
+		if (!ret && !(ret = read_sys_metadata_field(0x9900000300000400 + i, &val)))
+			sysinfo_td_conf->cpuid_config_leaves[i] = val;
+	if (sysinfo_td_conf->num_cpuid_config > ARRAY_SIZE(sysinfo_td_conf->cpuid_config_values))
+		return -EINVAL;
+	for (i = 0; i < sysinfo_td_conf->num_cpuid_config; i++)
+		for (j = 0; j < 2; j++)
+			if (!ret && !(ret = read_sys_metadata_field(0x9900000300000500 + i * 2 + j, &val)))
+				sysinfo_td_conf->cpuid_config_values[i][j] = val;
+
+	return ret;
+}
+
 static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
 {
 	int ret = 0;
@@ -84,6 +132,8 @@ static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
 	ret = ret ?: get_tdx_sys_info_features(&sysinfo->features);
 	ret = ret ?: get_tdx_sys_info_tdmr(&sysinfo->tdmr);
 	ret = ret ?: get_tdx_sys_info_cmr(&sysinfo->cmr);
+	ret = ret ?: get_tdx_sys_info_td_ctrl(&sysinfo->td_ctrl);
+	ret = ret ?: get_tdx_sys_info_td_conf(&sysinfo->td_conf);
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 04/25] x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free TDX guest KeyID
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (2 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 05/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management Rick Edgecombe
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Intel TDX protects guest VMs from malicious host and certain physical
attacks. Pre-TDX Intel hardware has support for a memory encryption
architecture called MK-TME, which repurposes several high bits of
physical address as "KeyID". The BIOS reserves a sub-range of MK-TME
KeyIDs as "TDX private KeyIDs".

Each TDX guest must be assigned with a unique TDX KeyID when it is
created. The kernel reserves the first TDX private KeyID for
crypto-protection of specific TDX module data which has a lifecycle that
exceeds the KeyID reserved for the TD's use. The rest of the KeyIDs are
left for TDX guests to use.

Create a small KeyID allocator. Export
tdx_guest_keyid_alloc()/tdx_guest_keyid_free() to allocate and free TDX
guest KeyID for KVM to use.

Don't provide the stub functions when CONFIG_INTEL_TDX_HOST=n since they
are not supposed to be called in this case.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Move code from KVM to x86 core, and export them.
 - Update log.

uAPI breakout v1:
 - Update the commit message
 - Delete stale comment on global hkdi
 - Deleted WARN_ON_ONCE() as it doesn't seemed very usefull

v19:
 - Removed stale comment in tdx_guest_keyid_alloc() by Binbin
 - Update sanity check in tdx_guest_keyid_free() by Binbin

v18:
 - Moved the functions to kvm tdx from arch/x86/virt/vmx/tdx/
 - Drop exporting symbols as the host tdx does.
---
 arch/x86/include/asm/tdx.h  |  3 +++
 arch/x86/virt/vmx/tdx/tdx.c | 17 +++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b9758369d82c..d33e46d53d59 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -118,6 +118,9 @@ int tdx_cpu_enable(void);
 int tdx_enable(void);
 const char *tdx_dump_mce_info(struct mce *m);
 const struct tdx_sys_info *tdx_get_sysinfo(void);
+
+int tdx_guest_keyid_alloc(void);
+void tdx_guest_keyid_free(unsigned int keyid);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 7589c75eaa6c..b883c1a4b002 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -27,6 +27,7 @@
 #include <linux/log2.h>
 #include <linux/acpi.h>
 #include <linux/suspend.h>
+#include <linux/idr.h>
 #include <asm/page.h>
 #include <asm/special_insns.h>
 #include <asm/msr-index.h>
@@ -42,6 +43,8 @@ static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
+static DEFINE_IDA(tdx_guest_keyid_pool);
+
 static DEFINE_PER_CPU(bool, tdx_lp_initialized);
 
 static struct tdmr_info_list tdx_tdmr_list;
@@ -1545,3 +1548,17 @@ const struct tdx_sys_info *tdx_get_sysinfo(void)
 	return p;
 }
 EXPORT_SYMBOL_GPL(tdx_get_sysinfo);
+
+int tdx_guest_keyid_alloc(void)
+{
+	return ida_alloc_range(&tdx_guest_keyid_pool, tdx_guest_keyid_start,
+			       tdx_guest_keyid_start + tdx_nr_guest_keyids - 1,
+			       GFP_KERNEL);
+}
+EXPORT_SYMBOL_GPL(tdx_guest_keyid_alloc);
+
+void tdx_guest_keyid_free(unsigned int keyid)
+{
+	ida_free(&tdx_guest_keyid_pool, keyid);
+}
+EXPORT_SYMBOL_GPL(tdx_guest_keyid_free);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 05/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (3 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 04/25] x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free TDX guest KeyID Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-11-12 20:09   ` Dave Hansen
  2024-10-30 19:00 ` [PATCH v2 06/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation Rick Edgecombe
                   ` (22 subsequent siblings)
  27 siblings, 1 reply; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson, Binbin Wu, Yuan Yao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Intel TDX protects guest VMs from malicious host and certain physical
attacks. Pre-TDX Intel hardware has support for a memory encryption
architecture called MK-TME, which repurposes several high bits of
physical address as "KeyID". TDX ends up with reserving a sub-range of
MK-TME KeyIDs as "TDX private KeyIDs".

Like MK-TME, these KeyIDs can be associated with an ephemeral key. For TDX
this association is done by the TDX module. It also has its own tracking
for which KeyIDs are in use. To do this ephemeral key setup and manipulate
the TDX module's internal tracking, KVM will use the following SEAMCALLs:
 TDH.MNG.KEY.CONFIG: Mark the KeyID as in use, and initialize its
                     ephemeral key.
 TDH.MNG.KEY.FREEID: Mark the KeyID as not in use.

These SEAMCALLs both operate on TDR structures, which are setup using the
previously added TDH.MNG.CREATE SEAMCALL. KVM's use of these operations
will go like:
 - tdx_guest_keyid_alloc()
 - Initialize TD and TDR page with TDH.MNG.CREATE (not yet-added), passing
   KeyID
 - TDH.MNG.KEY.CONFIG to initialize the key
 - TD runs, teardown is started
 - TDH.MNG.KEY.FREEID
 - tdx_guest_keyid_free()

Don't try to combine the tdx_guest_keyid_alloc() and TDH.MNG.KEY.CONFIG
operations because TDH.MNG.CREATE and some locking need to be done in the
middle. Don't combine TDH.MNG.KEY.FREEID and tdx_guest_keyid_free() so they
are symmetrical with the creation path.

So implement tdh_mng_key_config() and tdh_mng_key_freeid() as separate
functions than tdx_guest_keyid_alloc() and tdx_guest_keyid_free().

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---
uAPI breakout v2:
 - Change to use 'u64' as function parameter to prepare to move
   SEAMCALL wrappers to arch/x86. (Kai)
 - Split to separate patch
 - Move SEAMCALL wrappers from KVM to x86 core;
 - Move TDH_xx macros from KVM to x86 core;
 - Re-write log

uAPI breakout v1:
 - Make argument to C wrapper function struct kvm_tdx * or
   struct vcpu_tdx * .(Sean)
 - Drop unused helpers (Kai)
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)

v19:
 - Update the commit message to match the patch by Yuan
 - Use seamcall() and seamcall_ret() by paolo

v18:
 - removed stub functions for __seamcall{,_ret}()
 - Added Reviewed-by Binbin
 - Make tdx_seamcall() use struct tdx_module_args instead of taking
  each inputs.

v16:
 - use struct tdx_module_args instead of struct tdx_module_output
 - Add tdh_mem_sept_rd() for SEPT_VE_DISABLE=1.
---
 arch/x86/include/asm/tdx.h  |  4 ++++
 arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h | 16 +++++++++-------
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d33e46d53d59..9897335a8e2f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -121,6 +121,10 @@ const struct tdx_sys_info *tdx_get_sysinfo(void);
 
 int tdx_guest_keyid_alloc(void);
 void tdx_guest_keyid_free(unsigned int keyid);
+
+/* SEAMCALL wrappers for creating/destroying/running TDX guests */
+u64 tdh_mng_key_config(u64 tdr);
+u64 tdh_mng_key_freeid(u64 tdr);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b883c1a4b002..c42eab8cc069 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1562,3 +1562,23 @@ void tdx_guest_keyid_free(unsigned int keyid)
 	ida_free(&tdx_guest_keyid_pool, keyid);
 }
 EXPORT_SYMBOL_GPL(tdx_guest_keyid_free);
+
+u64 tdh_mng_key_config(u64 tdr)
+{
+	struct tdx_module_args args = {
+		.rcx = tdr,
+	};
+
+	return seamcall(TDH_MNG_KEY_CONFIG, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_mng_key_config);
+
+u64 tdh_mng_key_freeid(u64 tdr)
+{
+	struct tdx_module_args args = {
+		.rcx = tdr,
+	};
+
+	return seamcall(TDH_MNG_KEY_FREEID, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_mng_key_freeid);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 9b708a8fb568..95002e7ff4c5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -17,13 +17,15 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
-#define TDH_PHYMEM_PAGE_RDMD	24
-#define TDH_SYS_KEY_CONFIG	31
-#define TDH_SYS_INIT		33
-#define TDH_SYS_RD		34
-#define TDH_SYS_LP_INIT		35
-#define TDH_SYS_TDMR_INIT	36
-#define TDH_SYS_CONFIG		45
+#define TDH_MNG_KEY_CONFIG		8
+#define TDH_MNG_KEY_FREEID		20
+#define TDH_PHYMEM_PAGE_RDMD		24
+#define TDH_SYS_KEY_CONFIG		31
+#define TDH_SYS_INIT			33
+#define TDH_SYS_RD			34
+#define TDH_SYS_LP_INIT			35
+#define TDH_SYS_TDMR_INIT		36
+#define TDH_SYS_CONFIG			45
 
 /* TDX page types */
 #define	PT_NDA		0x0
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 05/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management
  2024-10-30 19:00 ` [PATCH v2 05/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management Rick Edgecombe
@ 2024-11-12 20:09   ` Dave Hansen
  2024-11-14  0:01     ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Hansen @ 2024-11-12 20:09 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, xiaoyao.li, reinette.chatre, Isaku Yamahata,
	Sean Christopherson, Binbin Wu, Yuan Yao

On 10/30/24 12:00, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Intel TDX protects guest VMs from malicious host and certain physical
> attacks. Pre-TDX Intel hardware has support for a memory encryption
> architecture called MK-TME, which repurposes several high bits of
> physical address as "KeyID". TDX ends up with reserving a sub-range of
> MK-TME KeyIDs as "TDX private KeyIDs".

The changelog there was great.  It read my mind because I was wondering
why some of the operations didn't get combined in helper functions which
could be exported.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 05/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management
  2024-11-12 20:09   ` Dave Hansen
@ 2024-11-14  0:01     ` Edgecombe, Rick P
  0 siblings, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-14  0:01 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com
  Cc: sean.j.christopherson@intel.com, Yao, Yuan, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette,
	Yamahata, Isaku

On Tue, 2024-11-12 at 12:09 -0800, Dave Hansen wrote:
> On 10/30/24 12:00, Rick Edgecombe wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Intel TDX protects guest VMs from malicious host and certain physical
> > attacks. Pre-TDX Intel hardware has support for a memory encryption
> > architecture called MK-TME, which repurposes several high bits of
> > physical address as "KeyID". TDX ends up with reserving a sub-range of
> > MK-TME KeyIDs as "TDX private KeyIDs".
> 
> The changelog there was great.  It read my mind because I was wondering
> why some of the operations didn't get combined in helper functions which
> could be exported.
> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks, I'll make the u64 removal changes to this one and leave the ack.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 06/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (4 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 05/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-11-12 20:17   ` Dave Hansen
  2024-10-30 19:00 ` [PATCH v2 07/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX vCPU creation Rick Edgecombe
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson, Binbin Wu, Yuan Yao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Intel TDX protects guest VMs from malicious hosts and certain physical
attacks. It defines various control structures that hold state for things
like TDs or vCPUs. These control structures are stored in pages given to
the TDX module and encrypted with either the global KeyID or the guest
KeyIDs.

To manipulate these control structures the TDX module defines a few
SEAMCALLs. KVM will use these during the process of creating a TD as
follows:

1) Allocate a unique TDX KeyID for a new guest.

1) Call TDH.MNG.CREATE to create a "TD Root" (TDR) page, together with
   the new allocated KeyID. Unlike the rest of the TDX guest, the TDR
   page is crypto-protected by the 'global KeyID'.

2) Call the previously added TDH.MNG.KEY.CONFIG on each package to
   configure the KeyID for the guest. After this step, the KeyID to
   protect the guest is ready and the rest of the guest will be protected
   by this KeyID.

3) Call TDH.MNG.ADDCX to add TD Control Structure (TDCS) pages.

4) Call TDH.MNG.INIT to initialize the TDCS.

To reclaim these pages for use by the kernel other SEAMCALLs are needed,
which will be added in future patches.

Add tdh_mng_addcx(), tdh_mng_create() and tdh_mng_init() to export these
SEAMCALLs so that KVM can use them to create TDs.

For SEAMCALLs that give a page to the TDX module to be encrypted, clflush
the page mapped with KeyID 0, such that any dirty cache lines don't write
back later and clobber TD memory or control structures. Don't worry about
the other MK-TME KeyIDs because the kernel doesn't use them. The TDX docs
specify that this flush is not needed unless the TDX module exposes the
CLFLUSH_BEFORE_ALLOC feature bit. Be conservative and aways flush.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---
uAPI breakout v2:
 - Change to use 'u64' as function parameter to prepare to move
   SEAMCALL wrappers to arch/x86. (Kai)
 - Split to separate patch
 - Move SEAMCALL wrappers from KVM to x86 core;
 - Move TDH_xx macros from KVM to x86 core;
 - Re-write log

uAPI breakout v1:
 - Make argument to C wrapper function struct kvm_tdx * or
   struct vcpu_tdx * .(Sean)
 - Drop unused helpers (Kai)
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)

v19:
 - Update the commit message to match the patch by Yuan
 - Use seamcall() and seamcall_ret() by paolo

v18:
 - removed stub functions for __seamcall{,_ret}()
 - Added Reviewed-by Binbin
 - Make tdx_seamcall() use struct tdx_module_args instead of taking
  each inputs.

v16:
 - use struct tdx_module_args instead of struct tdx_module_output
 - Add tdh_mem_sept_rd() for SEPT_VE_DISABLE=1.
---
 arch/x86/include/asm/tdx.h  |  3 +++
 arch/x86/virt/vmx/tdx/tdx.c | 39 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  3 +++
 3 files changed, 45 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9897335a8e2f..9d19ca33e884 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -123,8 +123,11 @@ int tdx_guest_keyid_alloc(void);
 void tdx_guest_keyid_free(unsigned int keyid);
 
 /* SEAMCALL wrappers for creating/destroying/running TDX guests */
+u64 tdh_mng_addcx(u64 tdr, u64 tdcs);
 u64 tdh_mng_key_config(u64 tdr);
+u64 tdh_mng_create(u64 tdr, u64 hkid);
 u64 tdh_mng_key_freeid(u64 tdr);
+u64 tdh_mng_init(u64 tdr, u64 td_params, u64 *rcx);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c42eab8cc069..16122fd552ff 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1563,6 +1563,18 @@ void tdx_guest_keyid_free(unsigned int keyid)
 }
 EXPORT_SYMBOL_GPL(tdx_guest_keyid_free);
 
+u64 tdh_mng_addcx(u64 tdr, u64 tdcs)
+{
+	struct tdx_module_args args = {
+		.rcx = tdcs,
+		.rdx = tdr,
+	};
+
+	clflush_cache_range(__va(tdcs), PAGE_SIZE);
+	return seamcall(TDH_MNG_ADDCX, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_mng_addcx);
+
 u64 tdh_mng_key_config(u64 tdr)
 {
 	struct tdx_module_args args = {
@@ -1573,6 +1585,17 @@ u64 tdh_mng_key_config(u64 tdr)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_key_config);
 
+u64 tdh_mng_create(u64 tdr, u64 hkid)
+{
+	struct tdx_module_args args = {
+		.rcx = tdr,
+		.rdx = hkid,
+	};
+	clflush_cache_range(__va(tdr), PAGE_SIZE);
+	return seamcall(TDH_MNG_CREATE, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_mng_create);
+
 u64 tdh_mng_key_freeid(u64 tdr)
 {
 	struct tdx_module_args args = {
@@ -1582,3 +1605,19 @@ u64 tdh_mng_key_freeid(u64 tdr)
 	return seamcall(TDH_MNG_KEY_FREEID, &args);
 }
 EXPORT_SYMBOL_GPL(tdh_mng_key_freeid);
+
+u64 tdh_mng_init(u64 tdr, u64 td_params, u64 *rcx)
+{
+	struct tdx_module_args args = {
+		.rcx = tdr,
+		.rdx = td_params,
+	};
+	u64 ret;
+
+	ret = seamcall_ret(TDH_MNG_INIT, &args);
+
+	*rcx = args.rcx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mng_init);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 95002e7ff4c5..b9287304f372 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -17,8 +17,11 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_MNG_ADDCX			1
 #define TDH_MNG_KEY_CONFIG		8
+#define TDH_MNG_CREATE			9
 #define TDH_MNG_KEY_FREEID		20
+#define TDH_MNG_INIT			21
 #define TDH_PHYMEM_PAGE_RDMD		24
 #define TDH_SYS_KEY_CONFIG		31
 #define TDH_SYS_INIT			33
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 06/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation
  2024-10-30 19:00 ` [PATCH v2 06/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation Rick Edgecombe
@ 2024-11-12 20:17   ` Dave Hansen
  2024-11-12 21:21     ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Hansen @ 2024-11-12 20:17 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, xiaoyao.li, reinette.chatre, Isaku Yamahata,
	Sean Christopherson, Binbin Wu, Yuan Yao

On 10/30/24 12:00, Rick Edgecombe wrote:
> +u64 tdh_mng_create(u64 tdr, u64 hkid)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = tdr,
> +		.rdx = hkid,
> +	};
> +	clflush_cache_range(__va(tdr), PAGE_SIZE);
> +	return seamcall(TDH_MNG_CREATE, &args);
> +}
> +EXPORT_SYMBOL_GPL(tdh_mng_create);

I'd _prefer_ that this explain why the clflush is there.

The other goofy thing here is why it's getting a physical address passed
in.  It's my old 32-bit paranoia kicking in, but everything that has a
valid virtual address _also_ has a valid physical address.  The inverse
is not true, though.  So I like to keep things as pointers as long as
possible.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 06/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation
  2024-11-12 20:17   ` Dave Hansen
@ 2024-11-12 21:21     ` Edgecombe, Rick P
  2024-11-12 21:40       ` Dave Hansen
  0 siblings, 1 reply; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-12 21:21 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com
  Cc: sean.j.christopherson@intel.com, Yao, Yuan, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette,
	Yamahata, Isaku

On Tue, 2024-11-12 at 12:17 -0800, Dave Hansen wrote:
> On 10/30/24 12:00, Rick Edgecombe wrote:
> > +u64 tdh_mng_create(u64 tdr, u64 hkid)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = tdr,
> > +		.rdx = hkid,
> > +	};
> > +	clflush_cache_range(__va(tdr), PAGE_SIZE);
> > +	return seamcall(TDH_MNG_CREATE, &args);
> > +}
> > +EXPORT_SYMBOL_GPL(tdh_mng_create);
> 
> I'd _prefer_ that this explain why the clflush is there.

How about:
/*
 * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
 * a CLFLUSH of pages is required before handing them to the TDX module.
 * Be conservative and make the code simpler by doing the CLFLUSH 
 * unconditionally.
 */

> 
> The other goofy thing here is why it's getting a physical address passed
> in.  It's my old 32-bit paranoia kicking in, but everything that has a
> valid virtual address _also_ has a valid physical address.  The inverse
> is not true, though.  So I like to keep things as pointers as long as
> possible.

Ok, seems reasonable.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 06/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation
  2024-11-12 21:21     ` Edgecombe, Rick P
@ 2024-11-12 21:40       ` Dave Hansen
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Hansen @ 2024-11-12 21:40 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com
  Cc: sean.j.christopherson@intel.com, Yao, Yuan, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette,
	Yamahata, Isaku

On 11/12/24 13:21, Edgecombe, Rick P wrote:
> On Tue, 2024-11-12 at 12:17 -0800, Dave Hansen wrote:
>> On 10/30/24 12:00, Rick Edgecombe wrote:
>>> +u64 tdh_mng_create(u64 tdr, u64 hkid)
>>> +{
>>> +   struct tdx_module_args args = {
>>> +           .rcx = tdr,
>>> +           .rdx = hkid,
>>> +   };
>>> +   clflush_cache_range(__va(tdr), PAGE_SIZE);
>>> +   return seamcall(TDH_MNG_CREATE, &args);
>>> +}
>>> +EXPORT_SYMBOL_GPL(tdh_mng_create);
>> I'd _prefer_ that this explain why the clflush is there.
> How about:
> /*
>  * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
>  * a CLFLUSH of pages is required before handing them to the TDX module.
>  * Be conservative and make the code simpler by doing the CLFLUSH
>  * unconditionally.
>  */

Is there a chance we could put this in a helper so the "be conservative"
policy is centralized in one location?  The comment could also go there.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 07/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX vCPU creation
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (5 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 06/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management Rick Edgecombe
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson, Binbin Wu, Yuan Yao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Intel TDX protects guest VMs from malicious host and certain physical
attacks. It defines various control structures that hold state for
virtualized components of the TD (i.e. VMs or vCPUs) These control
structures are stored in pages given to the TDX module and encrypted
with either the global KeyID or the guest KeyIDs.

To manipulate these control structures the TDX module defines a few
SEAMCALLs. KVM will use these during the process of creating a vCPU as
follows:

1) Call TDH.VP.CREATE to create a TD vCPU Root (TDVPR) page for each
   vCPU.

2) Call TDH.VP.ADDCX to add per-vCPU control pages (TDCX) for each vCPU.

3) Call TDH.VP.INIT to initialize the TDCX for each vCPU.

To reclaim these pages for use by the kernel other SEAMCALLs are needed,
which will be added in future patches.

Export functions to allow KVM to make these SEAMCALLs. Export two
variants for TDH.VP.CREATE, in order to support the planned logic of KVM
to support TDX modules with and without the ENUM_TOPOLOGY feature. If
KVM can drop support for the !ENUM_TOPOLOGY case, this could go down a
single version. Leave that for later discussion.

For SEAMCALLs that give a page to the TDX module to be encrypted, clflush
the page mapped with KeyID 0, such that any dirty cache lines don't write
back later and clobber TD memory or control structures. Don't worry about
the other MK-TME KeyIDs because the kernel doesn't use them. The TDX docs
specify that this flush is not needed unless the TDX module exposes the
CLFLUSH_BEFORE_ALLOC feature bit. Be conservative and always flush.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---
uAPI breakout v2:
 - Change to use 'u64' as function parameter to prepare to move
   SEAMCALL wrappers to arch/x86. (Kai)
 - Split to separate patch
 - Move SEAMCALL wrappers from KVM to x86 core;
 - Move TDH_xx macros from KVM to x86 core;
 - Re-write log

uAPI breakout v1:
 - Make argument to C wrapper function struct kvm_tdx * or
   struct vcpu_tdx * .(Sean)
 - Drop unused helpers (Kai)
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)

v19:
 - Update the commit message to match the patch by Yuan
 - Use seamcall() and seamcall_ret() by paolo

v18:
 - removed stub functions for __seamcall{,_ret}()
 - Added Reviewed-by Binbin
 - Make tdx_seamcall() use struct tdx_module_args instead of taking
  each inputs.

v16:
 - use struct tdx_module_args instead of struct tdx_module_output
 - Add tdh_mem_sept_rd() for SEPT_VE_DISABLE=1.
---
 arch/x86/include/asm/tdx.h  |  4 +++
 arch/x86/virt/vmx/tdx/tdx.c | 49 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h | 12 +++++++++
 3 files changed, 65 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9d19ca33e884..6951faa37031 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -124,10 +124,14 @@ void tdx_guest_keyid_free(unsigned int keyid);
 
 /* SEAMCALL wrappers for creating/destroying/running TDX guests */
 u64 tdh_mng_addcx(u64 tdr, u64 tdcs);
+u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx);
 u64 tdh_mng_key_config(u64 tdr);
 u64 tdh_mng_create(u64 tdr, u64 hkid);
+u64 tdh_vp_create(u64 tdr, u64 tdvpr);
 u64 tdh_mng_key_freeid(u64 tdr);
 u64 tdh_mng_init(u64 tdr, u64 td_params, u64 *rcx);
+u64 tdh_vp_init(u64 tdvpr, u64 initial_rcx);
+u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 16122fd552ff..b3003031e0fe 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1575,6 +1575,18 @@ u64 tdh_mng_addcx(u64 tdr, u64 tdcs)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_addcx);
 
+u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx)
+{
+	struct tdx_module_args args = {
+		.rcx = tdcx,
+		.rdx = tdvpr,
+	};
+
+	clflush_cache_range(__va(tdcx), PAGE_SIZE);
+	return seamcall(TDH_VP_ADDCX, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_vp_addcx);
+
 u64 tdh_mng_key_config(u64 tdr)
 {
 	struct tdx_module_args args = {
@@ -1591,11 +1603,24 @@ u64 tdh_mng_create(u64 tdr, u64 hkid)
 		.rcx = tdr,
 		.rdx = hkid,
 	};
+
 	clflush_cache_range(__va(tdr), PAGE_SIZE);
 	return seamcall(TDH_MNG_CREATE, &args);
 }
 EXPORT_SYMBOL_GPL(tdh_mng_create);
 
+u64 tdh_vp_create(u64 tdr, u64 tdvpr)
+{
+	struct tdx_module_args args = {
+		.rcx = tdvpr,
+		.rdx = tdr,
+	};
+
+	clflush_cache_range(__va(tdr), PAGE_SIZE);
+	return seamcall(TDH_VP_CREATE, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_vp_create);
+
 u64 tdh_mng_key_freeid(u64 tdr)
 {
 	struct tdx_module_args args = {
@@ -1621,3 +1646,27 @@ u64 tdh_mng_init(u64 tdr, u64 td_params, u64 *rcx)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tdh_mng_init);
+
+u64 tdh_vp_init(u64 tdvpr, u64 initial_rcx)
+{
+	struct tdx_module_args args = {
+		.rcx = tdvpr,
+		.rdx = initial_rcx,
+	};
+
+	return seamcall(TDH_VP_INIT, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_vp_init);
+
+u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid)
+{
+	struct tdx_module_args args = {
+		.rcx = tdvpr,
+		.rdx = initial_rcx,
+		.r8 = x2apicid,
+	};
+
+	/* apicid requires version == 1. */
+	return seamcall(TDH_VP_INIT | (1ULL << TDX_VERSION_SHIFT), &args);
+}
+EXPORT_SYMBOL_GPL(tdh_vp_init_apicid);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index b9287304f372..64b6504791e1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -18,10 +18,13 @@
  * TDX module SEAMCALL leaf functions
  */
 #define TDH_MNG_ADDCX			1
+#define TDH_VP_ADDCX			4
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
+#define TDH_VP_CREATE			10
 #define TDH_MNG_KEY_FREEID		20
 #define TDH_MNG_INIT			21
+#define TDH_VP_INIT			22
 #define TDH_PHYMEM_PAGE_RDMD		24
 #define TDH_SYS_KEY_CONFIG		31
 #define TDH_SYS_INIT			33
@@ -30,6 +33,15 @@
 #define TDH_SYS_TDMR_INIT		36
 #define TDH_SYS_CONFIG			45
 
+
+/*
+ * SEAMCALL leaf:
+ *
+ * Bit 15:0	Leaf number
+ * Bit 23:16	Version number
+ */
+#define TDX_VERSION_SHIFT		16
+
 /* TDX page types */
 #define	PT_NDA		0x0
 #define	PT_RSVD		0x1
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (6 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 07/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX vCPU creation Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-31  3:57   ` Yan Zhao
  2024-11-13  0:20   ` Dave Hansen
  2024-10-30 19:00 ` [PATCH v2 09/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access Rick Edgecombe
                   ` (19 subsequent siblings)
  27 siblings, 2 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson, Binbin Wu, Yuan Yao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Intel TDX protects guest VMs from malicious host and certain physical
attacks. The TDX module uses pages provided by the host for both control
structures and for TD guest pages. These pages are encrypted using the
MK-TME encryption engine, with its special requirements around cache
invalidation. For its own security, the TDX module ensures pages are
flushed properly and track which usage they are currently assigned. For
creating and tearing down TD VMs and vCPUs KVM will need to use the
TDH.PHYMEM.PAGE.RECLAIM, TDH.PHYMEM.CACHE.WB, and TDH.PHYMEM.PAGE.WBINVD
SEAMCALLs.

Add tdh_phymem_page_reclaim() to enable KVM to call
TDH.PHYMEM.PAGE.RECLAIM to reclaim the page for use by the host kernel.
This effectively resets its state in the TDX module's page tracking
(PAMT), if the page is available to be reclaimed. This will be used by KVM
to reclaim the various types of pages owned by the TDX module. It will
have a small wrapper in KVM that retries in the case of a relevant error
code. Don't implement this wrapper in arch/x86 because KVM's solution
around retrying SEAMCALLs will be better located in a single place.

Add tdh_phymem_cache_wb() to enable KVM to call TDH.PHYMEM.CACHE.WB to do
a cache write back in a way that the TDX module can verify, before it
allows a KeyID to be freed. The KVM code will use this to have a small
wrapper that handles retries. Since the TDH.PHYMEM.CACHE.WB operation is
interruptible, have tdh_phymem_cache_wb() take a resume argument to pass
this info to the TDX module for restarts. It is worth noting that this
SEAMCALL uses a SEAM specific MSR to do the write back in sections. In
this way it does export some new functionality that affects CPU state.

Add tdh_phymem_page_wbinvd_tdr() to enable KVM to call
TDH.PHYMEM.PAGE.WBINVD to do a cache write back and invalidate of a TDR,
using the global KeyID. The underlying TDH.PHYMEM.PAGE.WBINVD SEAMCALL
requires the related KeyID to be encoded into the SEAMCALL args. Since the
global KeyID is not exposed to KVM, a dedicated wrapper is needed for TDR
focused TDH.PHYMEM.PAGE.WBINVD operations.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---
uAPI breakout v2:
 - Change to use 'u64' as function parameter to prepare to move
   SEAMCALL wrappers to arch/x86. (Kai)
 - Split to separate patch
 - Move SEAMCALL wrappers from KVM to x86 core;
 - Move TDH_xx macros from KVM to x86 core;
 - Re-write log

uAPI breakout v1:
 - Make argument to C wrapper function struct kvm_tdx * or
   struct vcpu_tdx * .(Sean)
 - Drop unused helpers (Kai)
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)

v19:
 - Update the commit message to match the patch by Yuan
 - Use seamcall() and seamcall_ret() by paolo

v18:
 - removed stub functions for __seamcall{,_ret}()
 - Added Reviewed-by Binbin
 - Make tdx_seamcall() use struct tdx_module_args instead of taking
  each inputs.

v16:
 - use struct tdx_module_args instead of struct tdx_module_output
 - Add tdh_mem_sept_rd() for SEPT_VE_DISABLE=1.
---
 arch/x86/include/asm/tdx.h  |  3 +++
 arch/x86/virt/vmx/tdx/tdx.c | 44 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  4 +++-
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6951faa37031..0cf8975759de 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -132,6 +132,9 @@ u64 tdh_mng_key_freeid(u64 tdr);
 u64 tdh_mng_init(u64 tdr, u64 td_params, u64 *rcx);
 u64 tdh_vp_init(u64 tdvpr, u64 initial_rcx);
 u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid);
+u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8);
+u64 tdh_phymem_cache_wb(bool resume);
+u64 tdh_phymem_page_wbinvd_tdr(u64 tdr);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b3003031e0fe..7e7c2e2360af 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1670,3 +1670,47 @@ u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid)
 	return seamcall(TDH_VP_INIT | (1ULL << TDX_VERSION_SHIFT), &args);
 }
 EXPORT_SYMBOL_GPL(tdh_vp_init_apicid);
+
+u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8)
+{
+	struct tdx_module_args args = {
+		.rcx = page,
+	};
+	u64 ret;
+
+	ret = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args);
+
+	/*
+	 * Additional error information:
+	 *
+	 *  - RCX: page type
+	 *  - RDX: owner
+	 *  - R8:  page size (4K, 2M or 1G)
+	 */
+	*rcx = args.rcx;
+	*rdx = args.rdx;
+	*r8 = args.r8;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim);
+
+u64 tdh_phymem_cache_wb(bool resume)
+{
+	struct tdx_module_args args = {
+		.rcx = resume ? 1 : 0,
+	};
+
+	return seamcall(TDH_PHYMEM_CACHE_WB, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
+
+u64 tdh_phymem_page_wbinvd_tdr(u64 tdr)
+{
+	struct tdx_module_args args = {};
+
+	args.rcx = tdr | ((u64)tdx_global_keyid << boot_cpu_data.x86_phys_bits);
+
+	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 64b6504791e1..191bdd1e571d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -26,14 +26,16 @@
 #define TDH_MNG_INIT			21
 #define TDH_VP_INIT			22
 #define TDH_PHYMEM_PAGE_RDMD		24
+#define TDH_PHYMEM_PAGE_RECLAIM		28
 #define TDH_SYS_KEY_CONFIG		31
 #define TDH_SYS_INIT			33
 #define TDH_SYS_RD			34
 #define TDH_SYS_LP_INIT			35
 #define TDH_SYS_TDMR_INIT		36
+#define TDH_PHYMEM_CACHE_WB		40
+#define TDH_PHYMEM_PAGE_WBINVD		41
 #define TDH_SYS_CONFIG			45
 
-
 /*
  * SEAMCALL leaf:
  *
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-10-30 19:00 ` [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management Rick Edgecombe
@ 2024-10-31  3:57   ` Yan Zhao
  2024-10-31 18:57     ` Edgecombe, Rick P
  2024-11-13  0:20   ` Dave Hansen
  1 sibling, 1 reply; 103+ messages in thread
From: Yan Zhao @ 2024-10-31  3:57 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: pbonzini, seanjc, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, xiaoyao.li, reinette.chatre, Isaku Yamahata,
	Sean Christopherson, Binbin Wu, Yuan Yao

On Wed, Oct 30, 2024 at 12:00:21PM -0700, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>

> Add tdh_phymem_page_reclaim() to enable KVM to call
> TDH.PHYMEM.PAGE.RECLAIM to reclaim the page for use by the host kernel.
> This effectively resets its state in the TDX module's page tracking
> (PAMT), if the page is available to be reclaimed. This will be used by KVM
> to reclaim the various types of pages owned by the TDX module. It will
> have a small wrapper in KVM that retries in the case of a relevant error
> code. Don't implement this wrapper in arch/x86 because KVM's solution
> around retrying SEAMCALLs will be better located in a single place.
With the current KVM code, it looks that KVM may not need the wrapper to retry
tdh_phymem_page_reclaim().

The logic of SEAMCALL TDH_PHYMEM_PAGE_RECLAIM is like this:                               
                                                                                 
SEAMCALL TDH_PHYMEM_PAGE_RECLAIM:
1.pamt_walk                                                                   
  case (a):if to reclaim TDR:                                           
           get shared lock of 1gb and 2mb pamt entries of TDR page,
           get exclusive lock of 4k pamt entry of TDR page.
  case (b):if to reclaim non-TDR & non-TD pages,
           get shared lock of 1gb and 2mb pamt entries of the page to reclaim,
           get exclusive lock of 4k pamt entry of the page to reclaim.
  case (c):if to reclaim TD pages,
           get exclusive lock of 1gb or 2mb or 4k pamt entry of the page to
           reclaim, depending on the page size of page to reclaim,
           get shared lock of pamt entries above the page size.
2.check the exclusively locked pamt entry of page to reclaim (e.g. page type,
  alignment)
3:case (a):if to reclaim TDR, map and check TDR page
  case (b)(c):if to reclaim non-TDR pages or TD pages,
              get shared lock of 4k pamt entry of TDR page,
              map, check of TDR page, atomically update TDR child cnt.
4.set page type to NDA to the exclusively locked pamt entry of the page to
  reclaim.

In summary,

------------------------------------------------------------------------------
page to reclaim     |        locks
--------------------|---------------------------------------------------------
     TDR            | exclusive lock of 4k pamt entry of TDR page
--------------------|---------------------------------------------------------
non-TDR and non-TD  | shared lock of 4k pamt entry of TDR page
                    | exclusive lock of 4k pamt entry of page to reclaim
--------------------|---------------------------------------------------------
   TD page          | shared lock of 4k pamt entry of TDR page
                    | exclusive lock of pamt entry of size of page to reclaim
------------------------------------------------------------------------------

When TD is tearing down,
- TD pages are removed and freed when hkid is assigned, so
  tdh_phymem_page_reclaim() will not be called for them.
- after vt_vm_destroy() releasing the hkid, kvm_arch_destroy_vm() calls
  kvm_destroy_vcpus(), kvm_mmu_uninit_tdp_mmu() and tdx_vm_free() to reclaim
  TDCX/TDVPR/EPT/TDR pages sequentially in a single thread.

So, there should be no contentions expected for current KVM to call
tdh_phymem_page_reclaim().

> +u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = page,
> +	};
> +	u64 ret;
> +
> +	ret = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args);
> +
> +	/*
> +	 * Additional error information:
> +	 *
> +	 *  - RCX: page type
> +	 *  - RDX: owner
> +	 *  - R8:  page size (4K, 2M or 1G)
> +	 */
> +	*rcx = args.rcx;
> +	*rdx = args.rdx;
> +	*r8 = args.r8;
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim);
 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-10-31  3:57   ` Yan Zhao
@ 2024-10-31 18:57     ` Edgecombe, Rick P
  2024-10-31 23:33       ` Huang, Kai
  0 siblings, 1 reply; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-10-31 18:57 UTC (permalink / raw)
  To: Hansen, Dave, Zhao, Yan Y
  Cc: sean.j.christopherson@intel.com, seanjc@google.com, Huang, Kai,
	binbin.wu@linux.intel.com, Yao, Yuan, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Chatre, Reinette, Yamahata, Isaku

On Thu, 2024-10-31 at 11:57 +0800, Yan Zhao wrote:
> When TD is tearing down,
> - TD pages are removed and freed when hkid is assigned, so
>   tdh_phymem_page_reclaim() will not be called for them.
> - after vt_vm_destroy() releasing the hkid, kvm_arch_destroy_vm() calls
>   kvm_destroy_vcpus(), kvm_mmu_uninit_tdp_mmu() and tdx_vm_free() to reclaim
>   TDCX/TDVPR/EPT/TDR pages sequentially in a single thread.
> 
> So, there should be no contentions expected for current KVM to call
> tdh_phymem_page_reclaim().

This links into the question of how much of the wrappers should be in KVM code
vs arch/x86. I got the impression Dave would like to not see SEAMCALLs just
getting wrapped on KVM's side with what it really needs. Towards that, it could
be tempting to move tdx_reclaim_page() (see "[PATCH v2 17/25] KVM: TDX:
create/destroy VM structure") into arch/x86 and have arch/x86 handle the
tdx_clear_page() part too. That would also be more symmetric with what arch/x86
already does for clflush on the calls that hand pages to the TDX module.

But the analysis of why we don't need to worry about TDX_OPERAND_BUSY is based
on KVM's current use of tdh_phymem_page_reclaim(). So KVM still has to be the
one to reason about TDX_OPERAND_BUSY, and the more we wrap the low level
SEAMCALLs, the more brittle and spread out the solution to dance around the TDX
module locks becomes.

I took a look at dropping the retry loop and moving tdx_reclaim_page() into
arch/x86 anyway:

 arch/x86/include/asm/tdx.h  |  3 +--
 arch/x86/kvm/vmx/tdx.c      | 74 ++++------------------------------------------
----------------------------
 arch/x86/virt/vmx/tdx/tdx.c | 63
++++++++++++++++++++++++++++++++++++++++++++++++++-------------
 3 files changed, 55 insertions(+), 85 deletions(-)


diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 051465261155..790d6d99d895 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -145,13 +145,12 @@ u64 tdh_vp_init(u64 tdvpr, u64 initial_rcx);
 u64 tdh_vp_rd(u64 tdvpr, u64 field, u64 *data);
 u64 tdh_vp_wr(u64 tdvpr, u64 field, u64 data, u64 mask);
 u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid);
-u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8);
+u64 tdx_reclaim_page(u64 pa, bool wbind);
 u64 tdh_mem_page_remove(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx);
 u64 tdh_mem_sept_remove(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx);
 u64 tdh_mem_track(u64 tdr);
 u64 tdh_mem_range_unblock(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx);
 u64 tdh_phymem_cache_wb(bool resume);
-u64 tdh_phymem_page_wbinvd_tdr(u64 tdr);
 u64 tdh_phymem_page_wbinvd_hkid(u64 hpa, u64 hkid);
 #else
 static inline void tdx_init(void) { }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0ee8ec86d02a..aca73d942344 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -291,67 +291,6 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu
*vcpu)
        vcpu->cpu = -1;
 }
 
-static void tdx_clear_page(unsigned long page_pa)
-{
-       const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
-       void *page = __va(page_pa);
-       unsigned long i;
-
-       /*
-        * The page could have been poisoned.  MOVDIR64B also clears
-        * the poison bit so the kernel can safely use the page again.
-        */
-       for (i = 0; i < PAGE_SIZE; i += 64)
-               movdir64b(page + i, zero_page);
-       /*
-        * MOVDIR64B store uses WC buffer.  Prevent following memory reads
-        * from seeing potentially poisoned cache.
-        */
-       __mb();
-}
-
-/* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
-static int __tdx_reclaim_page(hpa_t pa)
-{
-       u64 err, rcx, rdx, r8;
-       int i;
-
-       for (i = TDX_SEAMCALL_RETRIES; i > 0; i--) {
-               err = tdh_phymem_page_reclaim(pa, &rcx, &rdx, &r8);
-
-               /*
-                * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
-                * state.  i.e. destructing TD.
-                * TDH.PHYMEM.PAGE.RECLAIM requires TDR and target page.
-                * Because we're destructing TD, it's rare to contend with TDR.
-                */
-               switch (err) {
-               case TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX:
-               case TDX_OPERAND_BUSY | TDX_OPERAND_ID_TDR:
-                       cond_resched();
-                       continue;
-               default:
-                       goto out;
-               }
-       }
-
-out:
-       if (WARN_ON_ONCE(err)) {
-               pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, rcx, rdx, r8);
-               return -EIO;
-       }
-       return 0;
-}
-
-static int tdx_reclaim_page(hpa_t pa)
-{
-       int r;
-
-       r = __tdx_reclaim_page(pa);
-       if (!r)
-               tdx_clear_page(pa);
-       return r;
-}
 
 
 /*
@@ -365,7 +304,7 @@ static void tdx_reclaim_control_page(unsigned long
ctrl_page_pa)
         * Leak the page if the kernel failed to reclaim the page.
         * The kernel cannot use it safely anymore.
         */
-       if (tdx_reclaim_page(ctrl_page_pa))
+       if (tdx_reclaim_page(ctrl_page_pa, false))
                return;
 
        free_page((unsigned long)__va(ctrl_page_pa));
@@ -581,20 +520,16 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
        if (!kvm_tdx->tdr_pa)
                return;
 
-       if (__tdx_reclaim_page(kvm_tdx->tdr_pa))
-               return;
-
        /*
         * Use a SEAMCALL to ask the TDX module to flush the cache based on the
         * KeyID. TDX module may access TDR while operating on TD (Especially
         * when it is reclaiming TDCS).
         */
-       err = tdh_phymem_page_wbinvd_tdr(kvm_tdx->tdr_pa);
+       err = tdx_reclaim_page(kvm_tdx->tdr_pa, true);
        if (KVM_BUG_ON(err, kvm)) {
-               pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
+               pr_tdx_error(tdx_reclaim_page, err);
                return;
        }
-       tdx_clear_page(kvm_tdx->tdr_pa);
 
        free_page((unsigned long)__va(kvm_tdx->tdr_pa));
        kvm_tdx->tdr_pa = 0;
@@ -1694,7 +1629,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm,
gfn_t gfn,
                pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
                return -EIO;
        }
-       tdx_clear_page(hpa);
        tdx_unpin(kvm, pfn);
        return 0;
 }
@@ -1805,7 +1739,7 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
         * The HKID assigned to this TD was already freed and cache was
         * already flushed. We don't have to flush again.
         */
-       return tdx_reclaim_page(__pa(private_spt));
+       return tdx_reclaim_page(__pa(private_spt), false);
 }
 
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index bad83f6a3b0c..bb7cdb867581 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1892,7 +1892,7 @@ u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32
x2apicid)
 }
 EXPORT_SYMBOL_GPL(tdh_vp_init_apicid);
 
-u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8)
+static u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8)
 {
        struct tdx_module_args args = {
                .rcx = page,
@@ -1914,7 +1914,49 @@ u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx,
u64 *r8)
 
        return ret;
 }
-EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim);
+
+static void tdx_clear_page(unsigned long page_pa)
+{
+       const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
+       void *page = __va(page_pa);
+       unsigned long i;
+
+       /*
+        * The page could have been poisoned.  MOVDIR64B also clears
+        * the poison bit so the kernel can safely use the page again.
+        */
+       for (i = 0; i < PAGE_SIZE; i += 64)
+               movdir64b(page + i, zero_page);
+       /*
+        * MOVDIR64B store uses WC buffer.  Prevent following memory reads
+        * from seeing potentially poisoned cache.
+        */
+       __mb();
+}
+
+/*
+ * tdx_reclaim_page() calls tdh_phymem_page_reclaim() internally. Callers
should
+ * be prepared to handle TDX_OPERAND_BUSY.
+ * If return code is not an error, page has been cleared with MOVDIR64.
+ */
+u64 tdx_reclaim_page(u64 pa, bool wbind_global_key)
+{
+       u64 rcx, rdx, r8;
+       u64 r;
+
+       r = tdh_phymem_page_reclaim(pa, &rcx, &rdx, &r8);
+       if (r)
+               return r;
+
+       /* tdh_phymem_page_wbinvd_hkid() will do tdx_clear_page() */
+       if (wbind_global_key)
+               return tdh_phymem_page_wbinvd_hkid(pa, tdx_global_keyid);
+
+       tdx_clear_page(pa);
+
+       return r;
+}
+EXPORT_SYMBOL_GPL(tdx_reclaim_page);
 
 u64 tdh_mem_page_remove(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx)
 {
@@ -1987,22 +2029,17 @@ u64 tdh_phymem_cache_wb(bool resume)
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
 
-u64 tdh_phymem_page_wbinvd_tdr(u64 tdr)
-{
-       struct tdx_module_args args = {};
-
-       args.rcx = tdr | ((u64)tdx_global_keyid << boot_cpu_data.x86_phys_bits);
-
-       return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
-}
-EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
-
 u64 tdh_phymem_page_wbinvd_hkid(u64 hpa, u64 hkid)
 {
        struct tdx_module_args args = {};
+       u64 err;
 
        args.rcx = hpa | (hkid << boot_cpu_data.x86_phys_bits);
 
-       return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+       err = seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+       if (!err)
+               tdx_clear_page(hpa);
+
+       return err;
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-10-31 18:57     ` Edgecombe, Rick P
@ 2024-10-31 23:33       ` Huang, Kai
  0 siblings, 0 replies; 103+ messages in thread
From: Huang, Kai @ 2024-10-31 23:33 UTC (permalink / raw)
  To: Edgecombe, Rick P, Hansen, Dave, Zhao, Yan Y
  Cc: sean.j.christopherson@intel.com, seanjc@google.com,
	binbin.wu@linux.intel.com, Yao, Yuan, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Chatre, Reinette, Yamahata, Isaku



On 1/11/2024 7:57 am, Edgecombe, Rick P wrote:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index bad83f6a3b0c..bb7cdb867581 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c

...

> +static void tdx_clear_page(unsigned long page_pa)
> +{
> +       const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> +       void *page = __va(page_pa);
> +       unsigned long i;
> +
> +       /*
> +        * The page could have been poisoned.  MOVDIR64B also clears
> +        * the poison bit so the kernel can safely use the page again.
> +        */
> +       for (i = 0; i < PAGE_SIZE; i += 64)
> +               movdir64b(page + i, zero_page);
> +       /*
> +        * MOVDIR64B store uses WC buffer.  Prevent following memory reads
> +        * from seeing potentially poisoned cache.
> +        */
> +       __mb();
> +}

Just FYI there's already one reset_tdx_pages() doing the same thing in 
x86 tdx.c:

/*
  * Convert TDX private pages back to normal by using MOVDIR64B to
  * clear these pages.  Note this function doesn't flush cache of
  * these TDX private pages.  The caller should make sure of that.
  */
static void reset_tdx_pages(unsigned long base, unsigned long size)
{
         const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
         unsigned long phys, end;

         end = base + size;
         for (phys = base; phys < end; phys += 64)
                 movdir64b(__va(phys), zero_page);

         /*
          * MOVDIR64B uses WC protocol.  Use memory barrier to
          * make sure any later user of these pages sees the
          * updated data.
          */
         mb();
}



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-10-30 19:00 ` [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management Rick Edgecombe
  2024-10-31  3:57   ` Yan Zhao
@ 2024-11-13  0:20   ` Dave Hansen
  2024-11-13 20:51     ` Edgecombe, Rick P
  1 sibling, 1 reply; 103+ messages in thread
From: Dave Hansen @ 2024-11-13  0:20 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, xiaoyao.li, reinette.chatre, Isaku Yamahata,
	Sean Christopherson, Binbin Wu, Yuan Yao

On 10/30/24 12:00, Rick Edgecombe wrote:
> +u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = page,
> +	};
> +	u64 ret;

This isn't quite what I'm looking for in these wrappers.

For instance:

> +	/*
> +	 * Additional error information:
> +	 *
> +	 *  - RCX: page type
> +	 *  - RDX: owner
> +	 *  - R8:  page size (4K, 2M or 1G)
> +	 */
> +	*rcx = args.rcx;
> +	*rdx = args.rdx;
> +	*r8 = args.r8;

If this were, instead:

u64 tdh_phymem_page_reclaim(u64 page, u64 *type, u64 *owner, u64 *size)
{
	...
	*type = args.rcx;
	*owner = args.rdx;
	*size = args.r8;

Then you wouldn't need the comment in the first place.  Then you could
also be thinking about adding _some_ kind of type safety to the
arguments.  The 'size' or the 'type' could totally be enums.

There's really zero value in having wrappers like these.  They don't
have any type safety or add any readability or make the seamcall easier
to use.  There's almost no value in having these versus just exporting
seamcall_ret() itself.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-11-13  0:20   ` Dave Hansen
@ 2024-11-13 20:51     ` Edgecombe, Rick P
  2024-11-13 21:08       ` Dave Hansen
  0 siblings, 1 reply; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-13 20:51 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com
  Cc: sean.j.christopherson@intel.com, Yao, Yuan, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette,
	Yamahata, Isaku

On Tue, 2024-11-12 at 16:20 -0800, Dave Hansen wrote:
> If this were, instead:
> 
> u64 tdh_phymem_page_reclaim(u64 page, u64 *type, u64 *owner, u64 *size)
> {
> 	...
> 	*type = args.rcx;
> 	*owner = args.rdx;
> 	*size = args.r8;
> 
> Then you wouldn't need the comment in the first place.  Then you could
> also be thinking about adding _some_ kind of type safety to the
> arguments.  The 'size' or the 'type' could totally be enums.

Yes, *rcx and *rdx stand out.

> 
> There's really zero value in having wrappers like these.  They don't
> have any type safety or add any readability or make the seamcall easier
> to use.  There's almost no value in having these versus just exporting
> seamcall_ret() itself.

Hoping to solicit some more thoughts on the value question...

I thought the main thing was to not export *all* SEAMCALLs. Future TDX modules
could add new leafs that do who-knows-what.

For this SEAMCALL wrapper, the only use of the out args is printing them in an
error message (based on other logic). So turning them into enums would just add
a layer of translation to be decoded. A developer would have to translate them
back into the registers they came from to try to extract meaning from the TDX
docs.

However, some future user of TDH.PHYMEM.PAGE.RECLAIM might want to do something
else where the enums could add code clarity. But this goes down the road of
building things that are not needed today.

Is there value in maintaining a sensible looking API to be exported, even if it
is not needed today?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-11-13 20:51     ` Edgecombe, Rick P
@ 2024-11-13 21:08       ` Dave Hansen
  2024-11-13 21:25         ` Huang, Kai
  2024-11-13 21:44         ` Edgecombe, Rick P
  0 siblings, 2 replies; 103+ messages in thread
From: Dave Hansen @ 2024-11-13 21:08 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com
  Cc: sean.j.christopherson@intel.com, Yao, Yuan, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette,
	Yamahata, Isaku

On 11/13/24 12:51, Edgecombe, Rick P wrote:
> However, some future user of TDH.PHYMEM.PAGE.RECLAIM might want to do something
> else where the enums could add code clarity. But this goes down the road of
> building things that are not needed today.

Here's why the current code is a bit suboptimal:

> +/* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
> +static int __tdx_reclaim_page(hpa_t pa)
> +{
...
> +	for (i = TDX_SEAMCALL_RETRIES; i > 0; i--) {
> +		err = tdh_phymem_page_reclaim(pa, &rcx, &rdx, &r8);
...
> +out:
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, rcx, rdx, r8);
> +		return -EIO;
> +	}
> +	return 0;
> +}

Let's say I see the error get spit out on the console.  I can't make any
sense out of it from this spot.  I need to go over to the TDX docs or
tdh_phymem_page_reclaim() to look at the *comment* to figure out what
these the registers are named.

The code as proposed has zero self-documenting properties.  It's
actually completely non-self-documenting.  It isn't _any_ better for
readability than just doing:

	struct tdx_module_args args = {};

	for (i = TDX_SEAMCALL_RETRIES; i > 0; i--) {
		args.rcx = pa;
		err = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args);
		...
	}

	pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err,
			args.rcx, args.rdx, args.r8);

Also, this is also showing a lack of naming discipline where things are
named.  The first argument is 'pa' in here but 'page' on the other side:

> +u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = page,

I can't tell you how many recompiles it's cost me when I got lazy about
physical addr vs. virtual addr vs. struct page vs. pfn.

So, yeah, I'd rather not export seamcall_ret(), but I'd rather do that
than have a layer of abstraction that's adding little value while it
also brings obfuscation.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-11-13 21:08       ` Dave Hansen
@ 2024-11-13 21:25         ` Huang, Kai
  2024-11-13 22:01           ` Edgecombe, Rick P
  2024-11-13 21:44         ` Edgecombe, Rick P
  1 sibling, 1 reply; 103+ messages in thread
From: Huang, Kai @ 2024-11-13 21:25 UTC (permalink / raw)
  To: Hansen, Dave, Edgecombe, Rick P, pbonzini@redhat.com,
	seanjc@google.com
  Cc: sean.j.christopherson@intel.com, Yao, Yuan,
	binbin.wu@linux.intel.com, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette,
	Yamahata, Isaku


> 
> So, yeah, I'd rather not export seamcall_ret(), but I'd rather do that
> than have a layer of abstraction that's adding little value while it
> also brings obfuscation.

Just want to provide one more information:

Peter posted a series to allow us to export one symbol _only_ for a 
particular module:

https://lore.kernel.org/lkml/20241111105430.575636482@infradead.org/

IIUC we can use that to only export __seamcall*() for KVM.

I am not sure whether this addresses the concern of "the exported symbol 
could be potentially abused by other modules like out-of-tree ones"?


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-11-13 21:25         ` Huang, Kai
@ 2024-11-13 22:01           ` Edgecombe, Rick P
  0 siblings, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-13 22:01 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com, Huang, Kai
  Cc: Yao, Yuan, binbin.wu@linux.intel.com, Li, Xiaoyao,
	isaku.yamahata@gmail.com, Zhao, Yan Y,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	Chatre, Reinette, linux-kernel@vger.kernel.org, Yamahata, Isaku

On Thu, 2024-11-14 at 10:25 +1300, Huang, Kai wrote:
> > 
> > So, yeah, I'd rather not export seamcall_ret(), but I'd rather do that
> > than have a layer of abstraction that's adding little value while it
> > also brings obfuscation.
> 
> Just want to provide one more information:
> 
> Peter posted a series to allow us to export one symbol _only_ for a 
> particular module:
> 
> https://lore.kernel.org/lkml/20241111105430.575636482@infradead.org/
> 
> IIUC we can use that to only export __seamcall*() for KVM.
> 
> I am not sure whether this addresses the concern of "the exported symbol 
> could be potentially abused by other modules like out-of-tree ones"?

I think so. It's too bad it's an RFC v1. But maybe we could point to it for the
future, if we move the wrappers back into KVM.

The other small thing the export does is move the KVM disliked code generation
into arch/x86. This is a silly non-technical reason though.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-11-13 21:08       ` Dave Hansen
  2024-11-13 21:25         ` Huang, Kai
@ 2024-11-13 21:44         ` Edgecombe, Rick P
  2024-11-13 21:50           ` Dave Hansen
  1 sibling, 1 reply; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-13 21:44 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com
  Cc: Yao, Yuan, Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com, sean.j.christopherson@intel.com,
	kvm@vger.kernel.org, Chatre, Reinette, Yamahata, Isaku,
	Zhao, Yan Y

On Wed, 2024-11-13 at 13:08 -0800, Dave Hansen wrote:
> Let's say I see the error get spit out on the console.  I can't make any
> sense out of it from this spot.  I need to go over to the TDX docs or
> tdh_phymem_page_reclaim() to look at the *comment* to figure out what
> these the registers are named.
> 
> The code as proposed has zero self-documenting properties.  It's
> actually completely non-self-documenting.  It isn't _any_ better for
> readability than just doing:
> 
> 	struct tdx_module_args args = {};
> 
> 	for (i = TDX_SEAMCALL_RETRIES; i > 0; i--) {
> 		args.rcx = pa;
> 		err = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args);
> 		...
> 	}
> 
> 	pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err,
> 			args.rcx, args.rdx, args.r8);

If we extracted meaning from the registers and printed those, then we would not
have any new bits that popped up in there. For example, currently r8 has bits
63:3 described as reserved. While expectations around TDX module behavior
changes are still settling, I'd rather have the full register for debugging than
an easy to read error message. But we have actually gone down this road a little
bit already when we adjusted the KVM calling code to stop manually loading the
struct tdx_module_args.

> 
> Also, this is also showing a lack of naming discipline where things are
> named.  The first argument is 'pa' in here but 'page' on the other side:
> 
> > +u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = page,
> 
> I can't tell you how many recompiles it's cost me when I got lazy about
> physical addr vs. virtual addr vs. struct page vs. pfn.

Standardizing on VAs for the SEAMCALL wrappers seems like a good idea. I haven't
checked them all, but seems to be promising so far.

> 
> So, yeah, I'd rather not export seamcall_ret(), but I'd rather do that
> than have a layer of abstraction that's adding little value while it
> also brings obfuscation.

In KVM these types can get even more confusing. There are guest physical address
and virtual addresses as well as the host physical and virtual. So in KVM there
is a typedef for host physical addresses: hpa_t. Previously these wrappers used
it because they are in KVM code. It was:
+static inline u64 tdh_phymem_page_reclaim(hpa_t page, u64 *rcx, u64 *rdx,
+					  u64 *r8)
+{
+	struct tdx_module_args in = {
+		.rcx = page,
+	};
+	u64 ret;
+
+	ret = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &in);
+
+	*rcx = in.rcx;
+	*rdx = in.rdx;
+	*r8 = in.r8;
+
+	return ret;
+}

Moving them to arch/x86 means we need to translate some things between KVM's
parlance and the rest of the kernels. This is extra wrapping. Another example
that was used in the old SEAMCALL wrappers was gpa_t, which KVM uses to refers
to a guest physical address. void * to the host direct map doesn't fit, so we
are back to u64 or a new gpa struct (like in the other thread) to speak to the
arch/x86 layers.

So I think we will need some light layers of abstraction if we keep the wrappers
in arch/x86.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-11-13 21:44         ` Edgecombe, Rick P
@ 2024-11-13 21:50           ` Dave Hansen
  2024-11-13 22:00             ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Hansen @ 2024-11-13 21:50 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com
  Cc: Yao, Yuan, Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com, sean.j.christopherson@intel.com,
	kvm@vger.kernel.org, Chatre, Reinette, Yamahata, Isaku,
	Zhao, Yan Y

On 11/13/24 13:44, Edgecombe, Rick P wrote:
> Moving them to arch/x86 means we need to translate some things between KVM's
> parlance and the rest of the kernels. This is extra wrapping. Another example
> that was used in the old SEAMCALL wrappers was gpa_t, which KVM uses to refers
> to a guest physical address. void * to the host direct map doesn't fit, so we
> are back to u64 or a new gpa struct (like in the other thread) to speak to the
> arch/x86 layers.

I have zero issues with non-core x86 code doing a #include
<linux/kvm_types.h>.  Why not just use the KVM types?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-11-13 21:50           ` Dave Hansen
@ 2024-11-13 22:00             ` Edgecombe, Rick P
  2024-11-14  0:21               ` Huang, Kai
  0 siblings, 1 reply; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-13 22:00 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com
  Cc: Yao, Yuan, Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	tony.lindgren@linux.intel.com, sean.j.christopherson@intel.com,
	kvm@vger.kernel.org, Chatre, Reinette, Yamahata, Isaku,
	Zhao, Yan Y

On Wed, 2024-11-13 at 13:50 -0800, Dave Hansen wrote:
> On 11/13/24 13:44, Edgecombe, Rick P wrote:
> > Moving them to arch/x86 means we need to translate some things between KVM's
> > parlance and the rest of the kernels. This is extra wrapping. Another example
> > that was used in the old SEAMCALL wrappers was gpa_t, which KVM uses to refers
> > to a guest physical address. void * to the host direct map doesn't fit, so we
> > are back to u64 or a new gpa struct (like in the other thread) to speak to the
> > arch/x86 layers.
> 
> I have zero issues with non-core x86 code doing a #include
> <linux/kvm_types.h>.  Why not just use the KVM types?

You know...I assumed it wouldn't work because of some internal headers. But yea.
Nevermind, we can just do that. Probably because the old code also referred to
struct kvm_tdx, it just got fully separated. Kai did you attempt this path at
all?

I think, hand-waving in a general way, having the SEAMCALL wrappers in KVM code
will result in at least more marshaling of structs members into function args.
But I can't point to any specific problem in our current SEAMCALLs.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-11-13 22:00             ` Edgecombe, Rick P
@ 2024-11-14  0:21               ` Huang, Kai
  2024-11-14  0:32                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Huang, Kai @ 2024-11-14  0:21 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, Hansen, Dave,
	seanjc@google.com
  Cc: Yao, Yuan, binbin.wu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	tony.lindgren@linux.intel.com, sean.j.christopherson@intel.com,
	kvm@vger.kernel.org, Chatre, Reinette, Yamahata, Isaku,
	Zhao, Yan Y



On 14/11/2024 11:00 am, Edgecombe, Rick P wrote:
> On Wed, 2024-11-13 at 13:50 -0800, Dave Hansen wrote:
>> On 11/13/24 13:44, Edgecombe, Rick P wrote:
>>> Moving them to arch/x86 means we need to translate some things between KVM's
>>> parlance and the rest of the kernels. This is extra wrapping. Another example
>>> that was used in the old SEAMCALL wrappers was gpa_t, which KVM uses to refers
>>> to a guest physical address. void * to the host direct map doesn't fit, so we
>>> are back to u64 or a new gpa struct (like in the other thread) to speak to the
>>> arch/x86 layers.
>>
>> I have zero issues with non-core x86 code doing a #include
>> <linux/kvm_types.h>.  Why not just use the KVM types?
> 
> You know...I assumed it wouldn't work because of some internal headers. But yea.
> Nevermind, we can just do that. Probably because the old code also referred to
> struct kvm_tdx, it just got fully separated. Kai did you attempt this path at
> all?

'struct kvm_tdx' is a KVM internal structure so we cannot use that in 
SEAMCALL wrappers in the x86 core.  If you are talking about just use 
KVM types like 'gfn_t/hpa_t' etc (by including <linux/kvm_types.h>) 
perhaps this is fine.

But I didn't try to do in this way.  We can try if that's better, but I 
suppose we should get Sean/Paolo's feedback here?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management
  2024-11-14  0:21               ` Huang, Kai
@ 2024-11-14  0:32                 ` Edgecombe, Rick P
  0 siblings, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-14  0:32 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com, Huang, Kai
  Cc: Yao, Yuan, binbin.wu@linux.intel.com, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com, sean.j.christopherson@intel.com,
	kvm@vger.kernel.org, Chatre, Reinette, Yamahata, Isaku,
	Zhao, Yan Y

On Thu, 2024-11-14 at 13:21 +1300, Huang, Kai wrote:
> 
> On 14/11/2024 11:00 am, Edgecombe, Rick P wrote:
> > On Wed, 2024-11-13 at 13:50 -0800, Dave Hansen wrote:
> > > On 11/13/24 13:44, Edgecombe, Rick P wrote:
> > > > Moving them to arch/x86 means we need to translate some things between KVM's
> > > > parlance and the rest of the kernels. This is extra wrapping. Another example
> > > > that was used in the old SEAMCALL wrappers was gpa_t, which KVM uses to refers
> > > > to a guest physical address. void * to the host direct map doesn't fit, so we
> > > > are back to u64 or a new gpa struct (like in the other thread) to speak to the
> > > > arch/x86 layers.
> > > 
> > > I have zero issues with non-core x86 code doing a #include
> > > <linux/kvm_types.h>.  Why not just use the KVM types?
> > 
> > You know...I assumed it wouldn't work because of some internal headers. But yea.
> > Nevermind, we can just do that. Probably because the old code also referred to
> > struct kvm_tdx, it just got fully separated. Kai did you attempt this path at
> > all?
> 
> 'struct kvm_tdx' is a KVM internal structure so we cannot use that in 
> SEAMCALL wrappers in the x86 core.
> 
Yea, makes sense.

>   If you are talking about just use 
> KVM types like 'gfn_t/hpa_t' etc (by including <linux/kvm_types.h>) 
> perhaps this is fine.
> 
> But I didn't try to do in this way.  We can try if that's better, but I 
> suppose we should get Sean/Paolo's feedback here?

There are certainly a lot of style considerations here. I'm thinking to post
like an RFC. Like a fork to look at Dave's suggestions.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 09/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (7 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2025-01-05  9:45   ` Francesco Lavra
  2024-10-30 19:00 ` [PATCH v2 10/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations Rick Edgecombe
                   ` (18 subsequent siblings)
  27 siblings, 1 reply; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson, Binbin Wu, Yuan Yao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Intel TDX protects guest VMs from malicious host and certain physical
attacks. The TDX module has TD scoped and vCPU scoped "metadata fields".
These fields are a bit like VMCS fields, and stored in data structures
maintained by the TDX module. Export 3 SEAMCALLs for use in reading and
writing these fields:

Make tdh_mng_rd() use MNG.VP.RD to read the TD scoped metadata.

Make tdh_vp_rd()/tdh_vp_wr() use TDH.VP.RD/WR to read/write the vCPU
scoped metadata.

KVM will use these by creating inline helpers that target various metadata
sizes. Export the raw SEAMCALL leaf, to avoid exporting the large number
of various sized helpers.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---
uAPI breakout v2:
 - Change to use 'u64' as function parameter to prepare to move
   SEAMCALL wrappers to arch/x86. (Kai)
 - Split to separate patch
 - Move SEAMCALL wrappers from KVM to x86 core;
 - Move TDH_xx macros from KVM to x86 core;
 - Re-write log

uAPI breakout v1:
 - Make argument to C wrapper function struct kvm_tdx * or
   struct vcpu_tdx * .(Sean)
 - Drop unused helpers (Kai)
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)

v19:
 - Update the commit message to match the patch by Yuan
 - Use seamcall() and seamcall_ret() by paolo

v18:
 - removed stub functions for __seamcall{,_ret}()
 - Added Reviewed-by Binbin
 - Make tdx_seamcall() use struct tdx_module_args instead of taking
  each inputs.

---
 arch/x86/include/asm/tdx.h  |  3 +++
 arch/x86/virt/vmx/tdx/tdx.c | 47 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  3 +++
 3 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 0cf8975759de..a70933ec7808 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -128,9 +128,12 @@ u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx);
 u64 tdh_mng_key_config(u64 tdr);
 u64 tdh_mng_create(u64 tdr, u64 hkid);
 u64 tdh_vp_create(u64 tdr, u64 tdvpr);
+u64 tdh_mng_rd(u64 tdr, u64 field, u64 *data);
 u64 tdh_mng_key_freeid(u64 tdr);
 u64 tdh_mng_init(u64 tdr, u64 td_params, u64 *rcx);
 u64 tdh_vp_init(u64 tdvpr, u64 initial_rcx);
+u64 tdh_vp_rd(u64 tdvpr, u64 field, u64 *data);
+u64 tdh_vp_wr(u64 tdvpr, u64 field, u64 data, u64 mask);
 u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid);
 u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8);
 u64 tdh_phymem_cache_wb(bool resume);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 7e7c2e2360af..82820422d698 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1621,6 +1621,23 @@ u64 tdh_vp_create(u64 tdr, u64 tdvpr)
 }
 EXPORT_SYMBOL_GPL(tdh_vp_create);
 
+u64 tdh_mng_rd(u64 tdr, u64 field, u64 *data)
+{
+	struct tdx_module_args args = {
+		.rcx = tdr,
+		.rdx = field,
+	};
+	u64 ret;
+
+	ret = seamcall_ret(TDH_MNG_RD, &args);
+
+	/* R8: Content of the field, or 0 in case of error. */
+	*data = args.r8;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mng_rd);
+
 u64 tdh_mng_key_freeid(u64 tdr)
 {
 	struct tdx_module_args args = {
@@ -1658,6 +1675,36 @@ u64 tdh_vp_init(u64 tdvpr, u64 initial_rcx)
 }
 EXPORT_SYMBOL_GPL(tdh_vp_init);
 
+u64 tdh_vp_rd(u64 tdvpr, u64 field, u64 *data)
+{
+	struct tdx_module_args args = {
+		.rcx = tdvpr,
+		.rdx = field,
+	};
+	u64 ret;
+
+	ret = seamcall_ret(TDH_VP_RD, &args);
+
+	/* R8: Content of the field, or 0 in case of error. */
+	*data = args.r8;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_vp_rd);
+
+u64 tdh_vp_wr(u64 tdvpr, u64 field, u64 data, u64 mask)
+{
+	struct tdx_module_args args = {
+		.rcx = tdvpr,
+		.rdx = field,
+		.r8 = data,
+		.r9 = mask,
+	};
+
+	return seamcall(TDH_VP_WR, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_vp_wr);
+
 u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 191bdd1e571d..1915a558c126 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -22,10 +22,12 @@
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
 #define TDH_VP_CREATE			10
+#define TDH_MNG_RD			11
 #define TDH_MNG_KEY_FREEID		20
 #define TDH_MNG_INIT			21
 #define TDH_VP_INIT			22
 #define TDH_PHYMEM_PAGE_RDMD		24
+#define TDH_VP_RD			26
 #define TDH_PHYMEM_PAGE_RECLAIM		28
 #define TDH_SYS_KEY_CONFIG		31
 #define TDH_SYS_INIT			33
@@ -33,6 +35,7 @@
 #define TDH_SYS_LP_INIT			35
 #define TDH_SYS_TDMR_INIT		36
 #define TDH_PHYMEM_CACHE_WB		40
+#define TDH_VP_WR			43
 #define TDH_PHYMEM_PAGE_WBINVD		41
 #define TDH_SYS_CONFIG			45
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 09/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access
  2024-10-30 19:00 ` [PATCH v2 09/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access Rick Edgecombe
@ 2025-01-05  9:45   ` Francesco Lavra
  2025-01-06 18:59     ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Francesco Lavra @ 2025-01-05  9:45 UTC (permalink / raw)
  To: rick.p.edgecombe
  Cc: binbin.wu, isaku.yamahata, isaku.yamahata, kai.huang, kvm,
	linux-kernel, pbonzini, reinette.chatre, sean.j.christopherson,
	seanjc, tony.lindgren, xiaoyao.li, yan.y.zhao, yuan.yao

On 2024-10-30 at 19:00, Rick Edgecombe wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Intel TDX protects guest VMs from malicious host and certain physical
> attacks. The TDX module has TD scoped and vCPU scoped "metadata
> fields".
> These fields are a bit like VMCS fields, and stored in data
> structures
> maintained by the TDX module. Export 3 SEAMCALLs for use in reading
> and
> writing these fields:
> 
> Make tdh_mng_rd() use MNG.VP.RD to read the TD scoped metadata.

s/MNG.VP.RD/TDH.MNG.RD/

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 09/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access
  2025-01-05  9:45   ` Francesco Lavra
@ 2025-01-06 18:59     ` Edgecombe, Rick P
  0 siblings, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2025-01-06 18:59 UTC (permalink / raw)
  To: francescolavra.fl@gmail.com
  Cc: Li, Xiaoyao, seanjc@google.com, Huang, Kai,
	binbin.wu@linux.intel.com, yuan.yao@intel.com, Chatre, Reinette,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Yamahata, Isaku,
	sean.j.christopherson@intel.com, Zhao, Yan Y

On Sun, 2025-01-05 at 10:45 +0100, Francesco Lavra wrote:
> > Make tdh_mng_rd() use MNG.VP.RD to read the TD scoped metadata.
> 
> s/MNG.VP.RD/TDH.MNG.RD/

Oops, yep. Thanks.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 10/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (8 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 09/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-11-13  1:11   ` Dave Hansen
  2024-10-30 19:00 ` [PATCH v2 11/25] KVM: TDX: Add placeholders for TDX VM/vCPU structures Rick Edgecombe
                   ` (17 subsequent siblings)
  27 siblings, 1 reply; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson, Binbin Wu, Yuan Yao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Intel TDX protects guest VMs from malicious host and certain physical
attacks. The TDX module has the concept of flushing vCPUs. These flushes
include both a flush of the translation caches and also any other state
internal to the TDX module. Before freeing a KeyID, this flush operation
needs to be done. KVM will need to perform the flush on each pCPU
associated with the TD, and also perform a TD scoped operation that checks
if the flush has been done on all vCPU's associated with the TD.

Add a tdh_vp_flush() function to be used to call TDH.VP.FLUSH on each pCPU
associated with the TD during TD teardown. It will also be called when
disabling TDX and during vCPU migration between pCPUs.

Add tdh_mng_vpflushdone() to be used by KVM to call TDH.MNG.VPFLUSHDONE.
KVM will use this during TD teardown to verify that TDH.VP.FLUSH has been
called sufficiently, and advance the state machine that will allow for
reclaiming the TD's KeyID.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---
uAPI breakout v2:
 - Change to use 'u64' as function parameter to prepare to move
   SEAMCALL wrappers to arch/x86. (Kai)
 - Split to separate patch
 - Move SEAMCALL wrappers from KVM to x86 core;
 - Move TDH_xx macros from KVM to x86 core;
 - Re-write log

uAPI breakout v1:
 - Make argument to C wrapper function struct kvm_tdx * or
   struct vcpu_tdx * .(Sean)
 - Drop unused helpers (Kai)
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)

v19:
 - Update the commit message to match the patch by Yuan
 - Use seamcall() and seamcall_ret() by paolo

v18:
 - removed stub functions for __seamcall{,_ret}()
 - Added Reviewed-by Binbin
 - Make tdx_seamcall() use struct tdx_module_args instead of taking
  each inputs.

---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a70933ec7808..d093dc4350ac 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -129,6 +129,8 @@ u64 tdh_mng_key_config(u64 tdr);
 u64 tdh_mng_create(u64 tdr, u64 hkid);
 u64 tdh_vp_create(u64 tdr, u64 tdvpr);
 u64 tdh_mng_rd(u64 tdr, u64 field, u64 *data);
+u64 tdh_vp_flush(u64 tdvpr);
+u64 tdh_mng_vpflushdone(u64 tdr);
 u64 tdh_mng_key_freeid(u64 tdr);
 u64 tdh_mng_init(u64 tdr, u64 td_params, u64 *rcx);
 u64 tdh_vp_init(u64 tdvpr, u64 initial_rcx);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 82820422d698..af121a73de80 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1638,6 +1638,26 @@ u64 tdh_mng_rd(u64 tdr, u64 field, u64 *data)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_rd);
 
+u64 tdh_vp_flush(u64 tdvpr)
+{
+	struct tdx_module_args args = {
+		.rcx = tdvpr,
+	};
+
+	return seamcall(TDH_VP_FLUSH, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_vp_flush);
+
+u64 tdh_mng_vpflushdone(u64 tdr)
+{
+	struct tdx_module_args args = {
+		.rcx = tdr,
+	};
+
+	return seamcall(TDH_MNG_VPFLUSHDONE, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_mng_vpflushdone);
+
 u64 tdh_mng_key_freeid(u64 tdr)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 1915a558c126..a63037036c91 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -23,6 +23,8 @@
 #define TDH_MNG_CREATE			9
 #define TDH_VP_CREATE			10
 #define TDH_MNG_RD			11
+#define TDH_VP_FLUSH			18
+#define TDH_MNG_VPFLUSHDONE		19
 #define TDH_MNG_KEY_FREEID		20
 #define TDH_MNG_INIT			21
 #define TDH_VP_INIT			22
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 10/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations
  2024-10-30 19:00 ` [PATCH v2 10/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations Rick Edgecombe
@ 2024-11-13  1:11   ` Dave Hansen
  2024-11-13 21:18     ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Hansen @ 2024-11-13  1:11 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, xiaoyao.li, reinette.chatre, Isaku Yamahata,
	Sean Christopherson, Binbin Wu, Yuan Yao

On 10/30/24 12:00, Rick Edgecombe wrote:
> +u64 tdh_vp_flush(u64 tdvpr)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = tdvpr,
> +	};
> +
> +	return seamcall(TDH_VP_FLUSH, &args);
> +}
> +EXPORT_SYMBOL_GPL(tdh_vp_flush);

This also just isn't looking right.  The 'tdvpr' is a _thing_.  It has a
type and it came back from some _other_ bit of the same type.

So, in the worst case, this could be:

struct tdvpr {
	u64 tdvpr_paddr;
};

u64 tdh_vp_flush(struct tdvpr *tdpr)
{
	...

But just passing around physical addresses and then having this things
stick it right in to seamcall() doesn't seem like the best we can do.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 10/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations
  2024-11-13  1:11   ` Dave Hansen
@ 2024-11-13 21:18     ` Edgecombe, Rick P
  2024-11-13 21:41       ` Dave Hansen
  0 siblings, 1 reply; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-13 21:18 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com
  Cc: sean.j.christopherson@intel.com, Yao, Yuan, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette,
	Yamahata, Isaku

On Tue, 2024-11-12 at 17:11 -0800, Dave Hansen wrote:
> On 10/30/24 12:00, Rick Edgecombe wrote:
> > +u64 tdh_vp_flush(u64 tdvpr)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = tdvpr,
> > +	};
> > +
> > +	return seamcall(TDH_VP_FLUSH, &args);
> > +}
> > +EXPORT_SYMBOL_GPL(tdh_vp_flush);
> 
> This also just isn't looking right.  The 'tdvpr' is a _thing_.  It has a
> type and it came back from some _other_ bit of the same type.
> 
> So, in the worst case, this could be:
> 
> struct tdvpr {
> 	u64 tdvpr_paddr;
> };
> 
> u64 tdh_vp_flush(struct tdvpr *tdpr)
> {
> 	...
> 
> But just passing around physical addresses and then having this things
> stick it right in to seamcall() doesn't seem like the best we can do.

Earlier you mentioned passing pointers instead of PA's. Could we have something
like the below? It turns out the KVM code has to go through extra steps to
translate between PA and VA on its side. So if we keep it a VA in KVM and let
the SEAMCALL wrappers translate to PA, it actually simplifies the KVM code.

Or keep the VA in the struct.

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 01409a59224d..1f48813ade33 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -137,7 +137,7 @@ u64 tdh_vp_create(u64 tdr, u64 tdvpr);
 u64 tdh_mng_rd(u64 tdr, u64 field, u64 *data);
 u64 tdh_mr_extend(u64 tdr, u64 gpa, u64 *rcx, u64 *rdx);
 u64 tdh_mr_finalize(u64 tdr);
-u64 tdh_vp_flush(u64 tdvpr);
+u64 tdh_vp_flush(void *tdvpr);
 u64 tdh_mng_vpflushdone(u64 tdr);
 u64 tdh_mng_key_freeid(u64 tdr);
 u64 tdh_mng_init(u64 tdr, u64 td_params, u64 *rcx);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2a8997eb1ef1..d456e0b0b90c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1785,10 +1785,10 @@ u64 tdh_mr_finalize(u64 tdr)
 }
 EXPORT_SYMBOL_GPL(tdh_mr_finalize);
 
-u64 tdh_vp_flush(u64 tdvpr)
+u64 tdh_vp_flush(void *tdvpr)
 {
        struct tdx_module_args args = {
-               .rcx = tdvpr,
+               .rcx = __pa(tdvpr),
        };
 
        return seamcall(TDH_VP_FLUSH, &args);


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 10/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations
  2024-11-13 21:18     ` Edgecombe, Rick P
@ 2024-11-13 21:41       ` Dave Hansen
  2024-11-13 21:48         ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Hansen @ 2024-11-13 21:41 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com
  Cc: sean.j.christopherson@intel.com, Yao, Yuan, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette,
	Yamahata, Isaku

On 11/13/24 13:18, Edgecombe, Rick P wrote:
> -u64 tdh_vp_flush(u64 tdvpr)
> +u64 tdh_vp_flush(void *tdvpr)
>  {
>         struct tdx_module_args args = {
> -               .rcx = tdvpr,
> +               .rcx = __pa(tdvpr),
>         };
> 
>         return seamcall(TDH_VP_FLUSH, &args);

I'd much rather these be:

	tdx->tdvpr_page = alloc_page(GFP_KERNEL_ACCOUNT);

and then you pass around the struct page and do:

	.rcx = page_to_phys(tdvpr)

Because it's honestly _not_ an address.  It really and truly is a page
and you never need to dereference it, only pass it around as a handle.
You could get fancy and make a typedef for it or something, or even

struct tdvpr_struct {
	struct page *page;
}

But that's probably overkill.  It would help to, for instance, avoid
mixing up these two pages:

+u64 tdh_vp_create(u64 tdr, u64 tdvpr);

But it wouldn't help as much for these:

+u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx);
+u64 tdh_vp_init(u64 tdvpr, u64 initial_rcx);
+u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid);
+u64 tdh_vp_flush(u64 tdvpr);
+u64 tdh_vp_rd(u64 tdvpr, u64 field, u64 *data);
+u64 tdh_vp_wr(u64 tdvpr, u64 field, u64 data, u64 mask);

Except for (for instance) 'tdr' vs. 'tdvpr' confusion.  Spot the bug:

	tdh_vp_flush(kvm_tdx(foo)->tdr_pa);
	tdh_vp_flush(kvm_tdx(foo)->tdrvp_pa);

Do you want the compiler's help for those?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 10/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations
  2024-11-13 21:41       ` Dave Hansen
@ 2024-11-13 21:48         ` Edgecombe, Rick P
  0 siblings, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-13 21:48 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com
  Cc: Yao, Yuan, Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org, Zhao, Yan Y,
	Chatre, Reinette, Yamahata, Isaku

On Wed, 2024-11-13 at 13:41 -0800, Dave Hansen wrote:
> I'd much rather these be:
> 
> 	tdx->tdvpr_page = alloc_page(GFP_KERNEL_ACCOUNT);
> 
> and then you pass around the struct page and do:
> 
> 	.rcx = page_to_phys(tdvpr)
> 
> Because it's honestly _not_ an address.  It really and truly is a page
> and you never need to dereference it, only pass it around as a handle.

That is a really good point.

> You could get fancy and make a typedef for it or something, or even
> 
> struct tdvpr_struct {
> 	struct page *page;
> }
> 
> But that's probably overkill.  It would help to, for instance, avoid
> mixing up these two pages:
> 
> +u64 tdh_vp_create(u64 tdr, u64 tdvpr);
> 
> But it wouldn't help as much for these:
> 
> +u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx);
> +u64 tdh_vp_init(u64 tdvpr, u64 initial_rcx);
> +u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid);
> +u64 tdh_vp_flush(u64 tdvpr);
> +u64 tdh_vp_rd(u64 tdvpr, u64 field, u64 *data);
> +u64 tdh_vp_wr(u64 tdvpr, u64 field, u64 data, u64 mask);
> 
> Except for (for instance) 'tdr' vs. 'tdvpr' confusion.  Spot the bug:
> 
> 	tdh_vp_flush(kvm_tdx(foo)->tdr_pa);
> 	tdh_vp_flush(kvm_tdx(foo)->tdrvp_pa);
> 
> Do you want the compiler's help for those?

Haha, we have already had bugs around these names actually. If we we end up with
the current arch/x86 based approach we can see if we can fit it in.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 11/25] KVM: TDX: Add placeholders for TDX VM/vCPU structures
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (9 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 10/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2025-01-05 10:58   ` Francesco Lavra
  2024-10-30 19:00 ` [PATCH v2 12/25] KVM: TDX: Define TDX architectural definitions Rick Edgecombe
                   ` (16 subsequent siblings)
  27 siblings, 1 reply; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add TDX's own VM and vCPU structures as placeholder to manage and run
TDX guests.  Also add helper functions to check whether a VM/vCPU is
TDX or normal VMX one, and add helpers to convert between TDX VM/vCPU
and KVM VM/vCPU.

TDX protects guest VMs from malicious host.  Unlike VMX guests, TDX
guests are crypto-protected.  KVM cannot access TDX guests' memory and
vCPU states directly.  Instead, TDX requires KVM to use a set of TDX
architecture-defined firmware APIs (a.k.a TDX module SEAMCALLs) to
manage and run TDX guests.

In fact, the way to manage and run TDX guests and normal VMX guests are
quite different.  Because of that, the current structures
('struct kvm_vmx' and 'struct vcpu_vmx') to manage VMX guests are not
quite suitable for TDX guests.  E.g., the majority of the members of
'struct vcpu_vmx' don't apply to TDX guests.

Introduce TDX's own VM and vCPU structures ('struct kvm_tdx' and 'struct
vcpu_tdx' respectively) for KVM to manage and run TDX guests.  And
instead of building TDX's VM and vCPU structures based on VMX's, build
them directly based on 'struct kvm'.

As a result, TDX and VMX guests will have different VM size and vCPU
size/alignment.

Currently, kvm_arch_alloc_vm() uses 'kvm_x86_ops::vm_size' to allocate
enough space for the VM structure when creating guest.  With TDX guests,
ideally, KVM should allocate the VM structure based on the VM type so
that the precise size can be allocated for VMX and TDX guests.  But this
requires more extensive code change.  For now, simply choose the maximum
size of 'struct kvm_tdx' and 'struct kvm_vmx' for VM structure
allocation for both VMX and TDX guests.  This would result in small
memory waste for each VM which has smaller VM structure size but this is
acceptable.

For simplicity, use the same way for vCPU allocation too.  Otherwise KVM
would need to maintain a separate 'kvm_vcpu_cache' for each VM type.

Note, updating the 'vt_x86_ops::vm_size' needs to be done before calling
kvm_ops_update(), which copies vt_x86_ops to kvm_x86_ops.  However this
happens before TDX module initialization.  Therefore theoretically it is
possible that 'kvm_x86_ops::vm_size' is set to size of 'struct kvm_tdx'
(when it's larger) but TDX actually fails to initialize at a later time.

Again the worst case of this is wasting couple of bytes memory for each
VM.  KVM could choose to update 'kvm_x86_ops::vm_size' at a later time
depending on TDX's status but that would require base KVM module to
export either kvm_x86_ops or kvm_ops_update().

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Correct typo for update (Tony)

uAPI breakout v1:
 - Re-add __always_inline to to_kvm_tdx(), to_tdx(). (Sean)
 - Fix bisectability issues in headers (Kai)
 - Add a comment around updating vt_x86_ops.vm_size.
 - Update the comment around updating vcpu_size/align:
   https://lore.kernel.org/kvm/25d2bf93854ae7410d82119227be3cb2ce47c4f2.camel@intel.com/
 - Refine changelog:
   https://lore.kernel.org/kvm/9c592801471a137c51f583065764fbfc3081c016.camel@intel.com/

v19:
 - correctly update ops.vm_size, vcpu_size and, vcpu_align by Xiaoyao

v14 -> v15:
 - use KVM_X86_TDX_VM
---
 arch/x86/kvm/vmx/main.c | 53 ++++++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/tdx.c  |  2 +-
 arch/x86/kvm/vmx/tdx.h  | 49 +++++++++++++++++++++++++++++++++++++
 3 files changed, 100 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 053294939eb1..245f7d1f1bd4 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -8,6 +8,39 @@
 #include "posted_intr.h"
 #include "tdx.h"
 
+static __init int vt_hardware_setup(void)
+{
+	int ret;
+
+	ret = vmx_hardware_setup();
+	if (ret)
+		return ret;
+
+	/*
+	 * Update vt_x86_ops::vm_size here so it is ready before
+	 * kvm_ops_update() is called in kvm_x86_vendor_init().
+	 *
+	 * Note, the actual bringing up of TDX must be done after
+	 * kvm_ops_update() because enabling TDX requires enabling
+	 * hardware virtualization first, i.e., all online CPUs must
+	 * be in post-VMXON state.  This means the @vm_size here
+	 * may be updated to TDX's size but TDX may fail to enable
+	 * at later time.
+	 *
+	 * The VMX/VT code could update kvm_x86_ops::vm_size again
+	 * after bringing up TDX, but this would require exporting
+	 * either kvm_x86_ops or kvm_ops_update() from the base KVM
+	 * module, which looks overkill.  Anyway, the worst case here
+	 * is KVM may allocate couple of more bytes than needed for
+	 * each VM.
+	 */
+	if (enable_tdx)
+		vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
+				sizeof(struct kvm_tdx));
+
+	return 0;
+}
+
 #define VMX_REQUIRED_APICV_INHIBITS				\
 	(BIT(APICV_INHIBIT_REASON_DISABLED) |			\
 	 BIT(APICV_INHIBIT_REASON_ABSENT) |			\
@@ -161,7 +194,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
-	.hardware_setup = vmx_hardware_setup,
+	.hardware_setup = vt_hardware_setup,
 	.handle_intel_pt_intr = NULL,
 
 	.runtime_ops = &vt_x86_ops,
@@ -178,6 +211,7 @@ module_exit(vt_exit);
 
 static int __init vt_init(void)
 {
+	unsigned vcpu_size, vcpu_align;
 	int r;
 
 	r = vmx_init();
@@ -187,12 +221,25 @@ static int __init vt_init(void)
 	/* tdx_init() has been taken */
 	tdx_bringup();
 
+	/*
+	 * TDX and VMX have different vCPU structures.  Calculate the
+	 * maximum size/align so that kvm_init() can use the larger
+	 * values to create the kmem_vcpu_cache.
+	 */
+	vcpu_size = sizeof(struct vcpu_vmx);
+	vcpu_align = __alignof__(struct vcpu_vmx);
+	if (enable_tdx) {
+		vcpu_size = max_t(unsigned, vcpu_size,
+				sizeof(struct vcpu_tdx));
+		vcpu_align = max_t(unsigned, vcpu_align,
+				__alignof__(struct vcpu_tdx));
+	}
+
 	/*
 	 * Common KVM initialization _must_ come last, after this, /dev/kvm is
 	 * exposed to userspace!
 	 */
-	r = kvm_init(sizeof(struct vcpu_vmx), __alignof__(struct vcpu_vmx),
-		     THIS_MODULE);
+	r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
 	if (r)
 		goto err_kvm_init;
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f95a4dbcaf4a..f2830ff2af1d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -7,7 +7,7 @@
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
-static bool enable_tdx __ro_after_init;
+bool enable_tdx __ro_after_init;
 module_param_named(tdx, enable_tdx, bool, 0444);
 
 static enum cpuhp_state tdx_cpuhp_state;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 766a6121f670..e6a232d58e6a 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -4,9 +4,58 @@
 #ifdef CONFIG_INTEL_TDX_HOST
 void tdx_bringup(void);
 void tdx_cleanup(void);
+
+extern bool enable_tdx;
+
+struct kvm_tdx {
+	struct kvm kvm;
+	/* TDX specific members follow. */
+};
+
+struct vcpu_tdx {
+	struct kvm_vcpu	vcpu;
+	/* TDX specific members follow. */
+};
+
+static inline bool is_td(struct kvm *kvm)
+{
+	return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
+{
+	return is_td(vcpu->kvm);
+}
+
+static __always_inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm)
+{
+	return container_of(kvm, struct kvm_tdx, kvm);
+}
+
+static __always_inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
+{
+	return container_of(vcpu, struct vcpu_tdx, vcpu);
+}
+
 #else
 static inline void tdx_bringup(void) {}
 static inline void tdx_cleanup(void) {}
+
+#define enable_tdx	0
+
+struct kvm_tdx {
+	struct kvm kvm;
+};
+
+struct vcpu_tdx {
+	struct kvm_vcpu	vcpu;
+};
+
+static inline bool is_td(struct kvm *kvm) { return false; }
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
+
 #endif
 
 #endif
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 11/25] KVM: TDX: Add placeholders for TDX VM/vCPU structures
  2024-10-30 19:00 ` [PATCH v2 11/25] KVM: TDX: Add placeholders for TDX VM/vCPU structures Rick Edgecombe
@ 2025-01-05 10:58   ` Francesco Lavra
  2025-01-06 19:00     ` Edgecombe, Rick P
  2025-01-22  7:52     ` Tony Lindgren
  0 siblings, 2 replies; 103+ messages in thread
From: Francesco Lavra @ 2025-01-05 10:58 UTC (permalink / raw)
  To: rick.p.edgecombe
  Cc: isaku.yamahata, isaku.yamahata, kai.huang, kvm, linux-kernel,
	pbonzini, reinette.chatre, seanjc, tony.lindgren, xiaoyao.li,
	yan.y.zhao

On 2024-10-30 at 19:00, Rick Edgecombe wrote:
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 766a6121f670..e6a232d58e6a 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -4,9 +4,58 @@
>  #ifdef CONFIG_INTEL_TDX_HOST
>  void tdx_bringup(void);
>  void tdx_cleanup(void);
> +
> +extern bool enable_tdx;
> +
> +struct kvm_tdx {
> +	struct kvm kvm;
> +	/* TDX specific members follow. */
> +};
> +
> +struct vcpu_tdx {
> +	struct kvm_vcpu	vcpu;
> +	/* TDX specific members follow. */
> +};
> +
> +static inline bool is_td(struct kvm *kvm)
> +{
> +	return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}
> +
> +static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	return is_td(vcpu->kvm);
> +}
> +
> +static __always_inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm)
> +{
> +	return container_of(kvm, struct kvm_tdx, kvm);
> +}
> +
> +static __always_inline struct vcpu_tdx *to_tdx(struct kvm_vcpu
> *vcpu)
> +{
> +	return container_of(vcpu, struct vcpu_tdx, vcpu);
> +}
> +
>  #else
>  static inline void tdx_bringup(void) {}
>  static inline void tdx_cleanup(void) {}
> +
> +#define enable_tdx	0
> +
> +struct kvm_tdx {
> +	struct kvm kvm;
> +};
> +
> +struct vcpu_tdx {
> +	struct kvm_vcpu	vcpu;
> +};
> +
> +static inline bool is_td(struct kvm *kvm) { return false; }
> +static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false;
> }
> +static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return
> NULL; }
> +static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) {
> return NULL; }

IMO the definitions of to_kvm_tdx() and to_tdx() shouldn't be there
when CONFIG_INTEL_TDX_HOST is not defined: they are (and should be)
only used in CONFIG_INTEL_TDX_HOST code.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 11/25] KVM: TDX: Add placeholders for TDX VM/vCPU structures
  2025-01-05 10:58   ` Francesco Lavra
@ 2025-01-06 19:00     ` Edgecombe, Rick P
  2025-01-22  7:52     ` Tony Lindgren
  1 sibling, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2025-01-06 19:00 UTC (permalink / raw)
  To: francescolavra.fl@gmail.com
  Cc: Li, Xiaoyao, seanjc@google.com, Huang, Kai, Chatre, Reinette,
	linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Yamahata, Isaku, Zhao, Yan Y

On Sun, 2025-01-05 at 11:58 +0100, Francesco Lavra wrote:
> > +static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return
> > NULL; }
> > +static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) {
> > return NULL; }
> 
> IMO the definitions of to_kvm_tdx() and to_tdx() shouldn't be there
> when CONFIG_INTEL_TDX_HOST is not defined: they are (and should be)
> only used in CONFIG_INTEL_TDX_HOST code.

Seems reasonable. Thanks.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 11/25] KVM: TDX: Add placeholders for TDX VM/vCPU structures
  2025-01-05 10:58   ` Francesco Lavra
  2025-01-06 19:00     ` Edgecombe, Rick P
@ 2025-01-22  7:52     ` Tony Lindgren
  1 sibling, 0 replies; 103+ messages in thread
From: Tony Lindgren @ 2025-01-22  7:52 UTC (permalink / raw)
  To: Francesco Lavra
  Cc: rick.p.edgecombe, isaku.yamahata, isaku.yamahata, kai.huang, kvm,
	linux-kernel, pbonzini, reinette.chatre, seanjc, xiaoyao.li,
	yan.y.zhao

On Sun, Jan 05, 2025 at 11:58:12AM +0100, Francesco Lavra wrote:
> On 2024-10-30 at 19:00, Rick Edgecombe wrote:
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -4,9 +4,58 @@
> >  #ifdef CONFIG_INTEL_TDX_HOST
...

> >  #else
> >  static inline void tdx_bringup(void) {}
> >  static inline void tdx_cleanup(void) {}
> > +
> > +#define enable_tdx	0
> > +
> > +struct kvm_tdx {
> > +	struct kvm kvm;
> > +};
> > +
> > +struct vcpu_tdx {
> > +	struct kvm_vcpu	vcpu;
> > +};
> > +
> > +static inline bool is_td(struct kvm *kvm) { return false; }
> > +static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false;
> > }
> > +static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return
> > NULL; }
> > +static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) {
> > return NULL; }
> 
> IMO the definitions of to_kvm_tdx() and to_tdx() shouldn't be there
> when CONFIG_INTEL_TDX_HOST is not defined: they are (and should be)
> only used in CONFIG_INTEL_TDX_HOST code.

Good idea.

How about let's just make to_kvm_tdx() and to_tdx() private to tdx.c?
They are not currently used anywhere else.

And we can add the #pragma poison GCC to_vmx at the top of tdx.c to avoid
accidental use of to_vmx() in tdx.c like Paolo suggested earlier at [0]
below.

The dummy struct kvm_tdx and vcpu_tdx if CONFIG_INTEL_TDX_HOST is
not defined we could get rid of with a few helpers to get the size.

Regards,

Tony

[0] https://lore.kernel.org/kvm/89657f96-0ed1-4543-9074-f13f62cc4694@redhat.com/

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 12/25] KVM: TDX: Define TDX architectural definitions
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (10 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 11/25] KVM: TDX: Add placeholders for TDX VM/vCPU structures Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 22:38   ` Huang, Kai
  2024-10-30 19:00 ` [PATCH v2 13/25] KVM: TDX: Add TDX "architectural" error codes Rick Edgecombe
                   ` (15 subsequent siblings)
  27 siblings, 1 reply; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Define architectural definitions for KVM to issue the TDX SEAMCALLs.

Structures and values that are architecturally defined in the TDX module
specifications the chapter of ABI Reference.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
uAPI breakout v2:
 - Use TDX 1.5 naming of config_flags instead of exec_controls (Xiaoyao)

uAPI breakout v1:
 - Remove macros no longer needed due to reading metadata done in TDX
   host code:
   - Metadata field ID macros, bit definitions
   - TDX_MAX_NR_CPUID_CONFIGS
 - Drop unused defined (Kai)
 - Fix bisectability issues in headers (Kai)
 - Remove TDX_MAX_VCPUS define (Kai)
 - Remove unused TD_EXIT_OTHER_SMI_IS_MSMI define.
 - Move TDX vm type to separate patch
 - Move unions in tdx_arch.h to where they are introduced (Sean)

v19:
- drop tdvmcall constants by Xiaoyao

v18:
- Add metadata field id
---
 arch/x86/kvm/vmx/tdx.h      |   2 +
 arch/x86/kvm/vmx/tdx_arch.h | 158 ++++++++++++++++++++++++++++++++++++
 2 files changed, 160 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_arch.h

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e6a232d58e6a..1d6fa81a072d 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -1,6 +1,8 @@
 #ifndef  __KVM_X86_VMX_TDX_H
 #define __KVM_X86_VMX_TDX_H
 
+#include "tdx_arch.h"
+
 #ifdef CONFIG_INTEL_TDX_HOST
 void tdx_bringup(void);
 void tdx_cleanup(void);
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
new file mode 100644
index 000000000000..84af7666e958
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -0,0 +1,158 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_ARCH_H
+#define __KVM_X86_TDX_ARCH_H
+
+#include <linux/types.h>
+
+#define TDX_VERSION_SHIFT		16
+
+/*
+ * TDX SEAMCALL API function leaves
+ */
+#define TDH_VP_ENTER			0
+#define TDH_MNG_ADDCX			1
+#define TDH_MEM_PAGE_ADD		2
+#define TDH_MEM_SEPT_ADD		3
+#define TDH_VP_ADDCX			4
+#define TDH_MEM_PAGE_AUG		6
+#define TDH_MEM_RANGE_BLOCK		7
+#define TDH_MNG_KEY_CONFIG		8
+#define TDH_MNG_CREATE			9
+#define TDH_VP_CREATE			10
+#define TDH_MNG_RD			11
+#define TDH_MR_EXTEND			16
+#define TDH_MR_FINALIZE			17
+#define TDH_VP_FLUSH			18
+#define TDH_MNG_VPFLUSHDONE		19
+#define TDH_MNG_KEY_FREEID		20
+#define TDH_MNG_INIT			21
+#define TDH_VP_INIT			22
+#define TDH_VP_RD			26
+#define TDH_MNG_KEY_RECLAIMID		27
+#define TDH_PHYMEM_PAGE_RECLAIM		28
+#define TDH_MEM_PAGE_REMOVE		29
+#define TDH_MEM_SEPT_REMOVE		30
+#define TDH_SYS_RD			34
+#define TDH_MEM_TRACK			38
+#define TDH_MEM_RANGE_UNBLOCK		39
+#define TDH_PHYMEM_CACHE_WB		40
+#define TDH_PHYMEM_PAGE_WBINVD		41
+#define TDH_VP_WR			43
+
+/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_NON_ARCH			BIT_ULL(63)
+#define TDX_CLASS_SHIFT			56
+#define TDX_FIELD_MASK			GENMASK_ULL(31, 0)
+
+#define __BUILD_TDX_FIELD(non_arch, class, field)	\
+	(((non_arch) ? TDX_NON_ARCH : 0) |		\
+	 ((u64)(class) << TDX_CLASS_SHIFT) |		\
+	 ((u64)(field) & TDX_FIELD_MASK))
+
+#define BUILD_TDX_FIELD(class, field)			\
+	__BUILD_TDX_FIELD(false, (class), (field))
+
+#define BUILD_TDX_FIELD_NON_ARCH(class, field)		\
+	__BUILD_TDX_FIELD(true, (class), (field))
+
+
+/* Class code for TD */
+#define TD_CLASS_EXECUTION_CONTROLS	17ULL
+
+/* Class code for TDVPS */
+#define TDVPS_CLASS_VMCS		0ULL
+#define TDVPS_CLASS_GUEST_GPR		16ULL
+#define TDVPS_CLASS_OTHER_GUEST		17ULL
+#define TDVPS_CLASS_MANAGEMENT		32ULL
+
+enum tdx_tdcs_execution_control {
+	TD_TDCS_EXEC_TSC_OFFSET = 10,
+};
+
+/* @field is any of enum tdx_tdcs_execution_control */
+#define TDCS_EXEC(field)		BUILD_TDX_FIELD(TD_CLASS_EXECUTION_CONTROLS, (field))
+
+/* @field is the VMCS field encoding */
+#define TDVPS_VMCS(field)		BUILD_TDX_FIELD(TDVPS_CLASS_VMCS, (field))
+
+/* @field is any of enum tdx_guest_other_state */
+#define TDVPS_STATE(field)		BUILD_TDX_FIELD(TDVPS_CLASS_OTHER_GUEST, (field))
+#define TDVPS_STATE_NON_ARCH(field)	BUILD_TDX_FIELD_NON_ARCH(TDVPS_CLASS_OTHER_GUEST, (field))
+
+/* Management class fields */
+enum tdx_vcpu_guest_management {
+	TD_VCPU_PEND_NMI = 11,
+};
+
+/* @field is any of enum tdx_vcpu_guest_management */
+#define TDVPS_MANAGEMENT(field)		BUILD_TDX_FIELD(TDVPS_CLASS_MANAGEMENT, (field))
+
+#define TDX_EXTENDMR_CHUNKSIZE		256
+
+struct tdx_cpuid_value {
+	u32 eax;
+	u32 ebx;
+	u32 ecx;
+	u32 edx;
+} __packed;
+
+#define TDX_TD_ATTR_DEBUG		BIT_ULL(0)
+#define TDX_TD_ATTR_SEPT_VE_DISABLE	BIT_ULL(28)
+#define TDX_TD_ATTR_PKS			BIT_ULL(30)
+#define TDX_TD_ATTR_KL			BIT_ULL(31)
+#define TDX_TD_ATTR_PERFMON		BIT_ULL(63)
+
+/*
+ * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
+ */
+struct td_params {
+	u64 attributes;
+	u64 xfam;
+	u16 max_vcpus;
+	u8 reserved0[6];
+
+	u64 eptp_controls;
+	u64 config_flags;
+	u16 tsc_frequency;
+	u8  reserved1[38];
+
+	u64 mrconfigid[6];
+	u64 mrowner[6];
+	u64 mrownerconfig[6];
+	u64 reserved2[4];
+
+	union {
+		DECLARE_FLEX_ARRAY(struct tdx_cpuid_value, cpuid_values);
+		u8 reserved3[768];
+	};
+} __packed __aligned(1024);
+
+/*
+ * Guest uses MAX_PA for GPAW when set.
+ * 0: GPA.SHARED bit is GPA[47]
+ * 1: GPA.SHARED bit is GPA[51]
+ */
+#define TDX_CONFIG_FLAGS_MAX_GPAW      BIT_ULL(0)
+
+/*
+ * TDH.VP.ENTER, TDG.VP.VMCALL preserves RBP
+ * 0: RBP can be used for TDG.VP.VMCALL input. RBP is clobbered.
+ * 1: RBP can't be used for TDG.VP.VMCALL input. RBP is preserved.
+ */
+#define TDX_CONFIG_FLAGS_NO_RBP_MOD	BIT_ULL(2)
+
+
+/*
+ * TDX requires the frequency to be defined in units of 25MHz, which is the
+ * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
+ * module can only program frequencies that are multiples of 25MHz.  The
+ * frequency must be between 100mhz and 10ghz (inclusive).
+ */
+#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz)	((tsc_in_khz) / (25 * 1000))
+#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz)	((tsc_in_25mhz) * (25 * 1000))
+#define TDX_MIN_TSC_FREQUENCY_KHZ		(100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ		(10 * 1000 * 1000)
+
+#endif /* __KVM_X86_TDX_ARCH_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 12/25] KVM: TDX: Define TDX architectural definitions
  2024-10-30 19:00 ` [PATCH v2 12/25] KVM: TDX: Define TDX architectural definitions Rick Edgecombe
@ 2024-10-30 22:38   ` Huang, Kai
  2024-10-30 22:53     ` Huang, Kai
  0 siblings, 1 reply; 103+ messages in thread
From: Huang, Kai @ 2024-10-30 22:38 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Edgecombe, Rick P
  Cc: sean.j.christopherson@intel.com, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org, Zhao, Yan Y,
	Chatre, Reinette, Yamahata, Isaku


> +#include <linux/types.h>
> +
> +#define TDX_VERSION_SHIFT		16
> +
> +/*
> + * TDX SEAMCALL API function leaves
> + */
> +#define TDH_VP_ENTER			0
> +#define TDH_MNG_ADDCX			1
> +#define TDH_MEM_PAGE_ADD		2
> +#define TDH_MEM_SEPT_ADD		3
> +#define TDH_VP_ADDCX			4
> +#define TDH_MEM_PAGE_AUG		6
> +#define TDH_MEM_RANGE_BLOCK		7
> +#define TDH_MNG_KEY_CONFIG		8
> +#define TDH_MNG_CREATE			9
> +#define TDH_VP_CREATE			10
> +#define TDH_MNG_RD			11
> +#define TDH_MR_EXTEND			16
> +#define TDH_MR_FINALIZE			17
> +#define TDH_VP_FLUSH			18
> +#define TDH_MNG_VPFLUSHDONE		19
> +#define TDH_MNG_KEY_FREEID		20
> +#define TDH_MNG_INIT			21
> +#define TDH_VP_INIT			22
> +#define TDH_VP_RD			26
> +#define TDH_MNG_KEY_RECLAIMID		27
> +#define TDH_PHYMEM_PAGE_RECLAIM		28
> +#define TDH_MEM_PAGE_REMOVE		29
> +#define TDH_MEM_SEPT_REMOVE		30
> +#define TDH_SYS_RD			34
> +#define TDH_MEM_TRACK			38
> +#define TDH_MEM_RANGE_UNBLOCK		39
> +#define TDH_PHYMEM_CACHE_WB		40
> +#define TDH_PHYMEM_PAGE_WBINVD		41
> +#define TDH_VP_WR			43

Those are not needed anymore given the x86 core is exporting all KVM-needed
SEAMCALL wrappers.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 12/25] KVM: TDX: Define TDX architectural definitions
  2024-10-30 22:38   ` Huang, Kai
@ 2024-10-30 22:53     ` Huang, Kai
  0 siblings, 0 replies; 103+ messages in thread
From: Huang, Kai @ 2024-10-30 22:53 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Edgecombe, Rick P
  Cc: isaku.yamahata@gmail.com, Li, Xiaoyao, Chatre, Reinette,
	Zhao, Yan Y, tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	sean.j.christopherson@intel.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org

On Wed, 2024-10-30 at 22:38 +0000, Huang, Kai wrote:
> > +#include <linux/types.h>
> > +
> > +#define TDX_VERSION_SHIFT		16
> > +
> > +/*
> > + * TDX SEAMCALL API function leaves
> > + */
> > +#define TDH_VP_ENTER			0
> > +#define TDH_MNG_ADDCX			1
> > +#define TDH_MEM_PAGE_ADD		2
> > +#define TDH_MEM_SEPT_ADD		3
> > +#define TDH_VP_ADDCX			4
> > +#define TDH_MEM_PAGE_AUG		6
> > +#define TDH_MEM_RANGE_BLOCK		7
> > +#define TDH_MNG_KEY_CONFIG		8
> > +#define TDH_MNG_CREATE			9
> > +#define TDH_VP_CREATE			10
> > +#define TDH_MNG_RD			11
> > +#define TDH_MR_EXTEND			16
> > +#define TDH_MR_FINALIZE			17
> > +#define TDH_VP_FLUSH			18
> > +#define TDH_MNG_VPFLUSHDONE		19
> > +#define TDH_MNG_KEY_FREEID		20
> > +#define TDH_MNG_INIT			21
> > +#define TDH_VP_INIT			22
> > +#define TDH_VP_RD			26
> > +#define TDH_MNG_KEY_RECLAIMID		27
> > +#define TDH_PHYMEM_PAGE_RECLAIM		28
> > +#define TDH_MEM_PAGE_REMOVE		29
> > +#define TDH_MEM_SEPT_REMOVE		30
> > +#define TDH_SYS_RD			34
> > +#define TDH_MEM_TRACK			38
> > +#define TDH_MEM_RANGE_UNBLOCK		39
> > +#define TDH_PHYMEM_CACHE_WB		40
> > +#define TDH_PHYMEM_PAGE_WBINVD		41
> > +#define TDH_VP_WR			43
> 
> Those are not needed anymore given the x86 core is exporting all KVM-needed
> SEAMCALL wrappers.
> 

To clarify I meant all those macros for SEAMCALL leafs are not needed.  Sorry in
the above reply I mistakenly quoted the "#include <linux/types.h>" which should
still be kept obviously.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 13/25] KVM: TDX: Add TDX "architectural" error codes
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (11 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 12/25] KVM: TDX: Define TDX architectural definitions Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 14/25] KVM: TDX: Add helper functions to print TDX SEAMCALL error Rick Edgecombe
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Sean Christopherson, Isaku Yamahata, Yuan Yao

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH
SEAMCALL and TDX guest side for TDG.VP.VMCALL.  KVM issues the TDX
SEAMCALLs and checks its error code.  KVM handles hypercall from the TDX
guest and may return an error.  So error code for the TDX guest is also
needed.

TDX SEAMCALL uses bits 31:0 to return more information, so these error
codes will only exactly match RAX[63:32].  Error codes for TDG.VP.VMCALL is
defined by TDX Guest-Host-Communication interface spec.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
uAPI breakout v2:
 - Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)

v19:
 - Drop TDX_EPT_WALK_FAILED, TDX_EPT_ENTRY_NOT_FREE
 - Rename TDG_VP_VMCALL_ => TDVMCALL_ to match the existing code
 - Move TDVMCALL error codes to shared/tdx.h
 - Added TDX_OPERAND_ID_TDR
 - Fix bisectability issues in headers (Kai)
---
 arch/x86/include/asm/shared/tdx.h |  7 +++++-
 arch/x86/kvm/vmx/tdx.h            |  1 +
 arch/x86/kvm/vmx/tdx_errno.h      | 36 +++++++++++++++++++++++++++++++
 3 files changed, 43 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kvm/vmx/tdx_errno.h

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index fdfd41511b02..620327f0161f 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -26,7 +26,12 @@
 #define TDVMCALL_GET_QUOTE		0x10002
 #define TDVMCALL_REPORT_FATAL_ERROR	0x10003
 
-#define TDVMCALL_STATUS_RETRY		1
+/*
+ * TDG.VP.VMCALL Status Codes (returned in R10)
+ */
+#define TDVMCALL_STATUS_SUCCESS		0x0000000000000000ULL
+#define TDVMCALL_STATUS_RETRY		0x0000000000000001ULL
+#define TDVMCALL_STATUS_INVALID_OPERAND	0x8000000000000000ULL
 
 /*
  * Bitmasks of exposed registers (with VMM).
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 1d6fa81a072d..faed454385ca 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -2,6 +2,7 @@
 #define __KVM_X86_VMX_TDX_H
 
 #include "tdx_arch.h"
+#include "tdx_errno.h"
 
 #ifdef CONFIG_INTEL_TDX_HOST
 void tdx_bringup(void);
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
new file mode 100644
index 000000000000..dc3fa2a58c2c
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural status code for SEAMCALL */
+
+#ifndef __KVM_X86_TDX_ERRNO_H
+#define __KVM_X86_TDX_ERRNO_H
+
+#define TDX_SEAMCALL_STATUS_MASK		0xFFFFFFFF00000000ULL
+
+/*
+ * TDX SEAMCALL Status Codes (returned in RAX)
+ */
+#define TDX_NON_RECOVERABLE_VCPU		0x4000000100000000ULL
+#define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
+#define TDX_OPERAND_INVALID			0xC000010000000000ULL
+#define TDX_OPERAND_BUSY			0x8000020000000000ULL
+#define TDX_PREVIOUS_TLB_EPOCH_BUSY		0x8000020100000000ULL
+#define TDX_PAGE_METADATA_INCORRECT		0xC000030000000000ULL
+#define TDX_VCPU_NOT_ASSOCIATED			0x8000070200000000ULL
+#define TDX_KEY_GENERATION_FAILED		0x8000080000000000ULL
+#define TDX_KEY_STATE_INCORRECT			0xC000081100000000ULL
+#define TDX_KEY_CONFIGURED			0x0000081500000000ULL
+#define TDX_NO_HKID_READY_TO_WBCACHE		0x0000082100000000ULL
+#define TDX_FLUSHVP_NOT_DONE			0x8000082400000000ULL
+#define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
+#define TDX_EPT_ENTRY_STATE_INCORRECT		0xC0000B0D00000000ULL
+
+/*
+ * TDX module operand ID, appears in 31:0 part of error code as
+ * detail information
+ */
+#define TDX_OPERAND_ID_RCX			0x01
+#define TDX_OPERAND_ID_TDR			0x80
+#define TDX_OPERAND_ID_SEPT			0x92
+#define TDX_OPERAND_ID_TD_EPOCH			0xa9
+
+#endif /* __KVM_X86_TDX_ERRNO_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 14/25] KVM: TDX: Add helper functions to print TDX SEAMCALL error
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (12 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 13/25] KVM: TDX: Add TDX "architectural" error codes Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 15/25] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl Rick Edgecombe
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Binbin Wu, Yuan Yao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add helper functions to print out errors from the TDX module in a uniform
manner.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---
uAPI breakout v2:
 - Stringify the error codes, use ____pr_tdx_error_N() naming (Isaku, Kai)

uAPI breakout v1:
- Update for the wrapper functions for SEAMCALLs. (Sean)
- Reorder header file include to adjust argument change of the C wrapper.
- Fix bisectability issues in headers (Kai)
- Updates from seamcall overhaul (Kai)

v19:
- dropped unnecessary include <asm/tdx.h>

v18:
- Added Reviewed-by Binbin.
---
 arch/x86/kvm/vmx/tdx.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f2830ff2af1d..60b577379a9a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -7,6 +7,21 @@
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#define pr_tdx_error(__fn, __err)	\
+	pr_err_ratelimited("SEAMCALL %s failed: 0x%llx\n", #__fn, __err)
+
+#define __pr_tdx_error_N(__fn_str, __err, __fmt, ...)		\
+	pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx, " __fmt,  __err,  __VA_ARGS__)
+
+#define pr_tdx_error_1(__fn, __err, __rcx)		\
+	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx\n", __rcx)
+
+#define pr_tdx_error_2(__fn, __err, __rcx, __rdx)	\
+	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx\n", __rcx, __rdx)
+
+#define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
+	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)
+
 bool enable_tdx __ro_after_init;
 module_param_named(tdx, enable_tdx, bool, 0444);
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 15/25] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (13 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 14/25] KVM: TDX: Add helper functions to print TDX SEAMCALL error Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization Rick Edgecombe
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
guest state-protected VM.  It defined subcommands for technology-specific
operations under KVM_MEMORY_ENCRYPT_OP.  Despite its name, the subcommands
are not limited to memory encryption, but various technology-specific
operations are defined.  It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
for TDX specific operations and define subcommands.

Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
TDX specific sub-commands will be added to retrieve/pass TDX specific
parameters.  Make mem_enc_ioctl non-optional as it's always filled.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Correct comment to use hw_error naming (Isaku)
 - Drop KVM_TDX_CAPABILITIES, it's not needed yet (Binbin)

uAPI breakout v1:
 - rename error->hw_error (Kai)
 - Include "x86_ops.h" to tdx.c as the patch to initialize TDX module
   doesn't include it anymore.
 - Introduce tdx_vm_ioctl() as the first tdx func in x86_ops.h
 - Drop middle paragraph in the commit log (Tony)

v15:
  - change struct kvm_tdx_cmd to drop unused member.
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 +-
 arch/x86/include/uapi/asm/kvm.h    | 24 ++++++++++++++++++++++
 arch/x86/kvm/vmx/main.c            | 10 ++++++++++
 arch/x86/kvm/vmx/tdx.c             | 32 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h         |  6 ++++++
 arch/x86/kvm/x86.c                 |  4 ----
 6 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 53756a670f41..f250137c837a 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -123,7 +123,7 @@ KVM_X86_OP(leave_smm)
 KVM_X86_OP(enable_smi_window)
 #endif
 KVM_X86_OP_OPTIONAL(dev_get_attr)
-KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
+KVM_X86_OP(mem_enc_ioctl)
 KVM_X86_OP_OPTIONAL(mem_enc_register_region)
 KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
 KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index cba4351b3091..b6cb87f2b477 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -926,4 +926,28 @@ struct kvm_hyperv_eventfd {
 #define KVM_X86_SNP_VM		4
 #define KVM_X86_TDX_VM		5
 
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+	KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+	/* enum kvm_tdx_cmd_id */
+	__u32 id;
+	/* flags for sub-commend. If sub-command doesn't use this, set zero. */
+	__u32 flags;
+	/*
+	 * data for each sub-command. An immediate or a pointer to the actual
+	 * data in process virtual address.  If sub-command doesn't use it,
+	 * set zero.
+	 */
+	__u64 data;
+	/*
+	 * Auxiliary error code.  The sub-command may return TDX SEAMCALL
+	 * status code in addition to -Exxx.
+	 * Defined for consistency with struct kvm_sev_cmd.
+	 */
+	__u64 hw_error;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 245f7d1f1bd4..6ed78deea543 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -41,6 +41,14 @@ static __init int vt_hardware_setup(void)
 	return 0;
 }
 
+static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
+{
+	if (!is_td(kvm))
+		return -ENOTTY;
+
+	return tdx_vm_ioctl(kvm, argp);
+}
+
 #define VMX_REQUIRED_APICV_INHIBITS				\
 	(BIT(APICV_INHIBIT_REASON_DISABLED) |			\
 	 BIT(APICV_INHIBIT_REASON_ABSENT) |			\
@@ -191,6 +199,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
 
 	.get_untagged_addr = vmx_get_untagged_addr,
+
+	.mem_enc_ioctl = vt_mem_enc_ioctl,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 60b577379a9a..76655d82f749 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,6 +2,7 @@
 #include <linux/cpu.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
+#include "x86_ops.h"
 #include "tdx.h"
 
 #undef pr_fmt
@@ -29,6 +30,37 @@ static enum cpuhp_state tdx_cpuhp_state;
 
 static const struct tdx_sys_info *tdx_sysinfo;
 
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_tdx_cmd tdx_cmd;
+	int r;
+
+	if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
+		return -EFAULT;
+
+	/*
+	 * Userspace should never set hw_error. It is used to fill
+	 * hardware-defined error by the kernel.
+	 */
+	if (tdx_cmd.hw_error)
+		return -EINVAL;
+
+	mutex_lock(&kvm->lock);
+
+	switch (tdx_cmd.id) {
+	default:
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
+		r = -EFAULT;
+
+out:
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
 static int tdx_online_cpu(unsigned int cpu)
 {
 	unsigned long flags;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index a55981c5216e..42901be70f9d 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -118,4 +118,10 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 #endif
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
+#ifdef CONFIG_INTEL_TDX_HOST
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
+#else
+static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
+#endif
+
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9da7c728c391..d86a18a4195b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7308,10 +7308,6 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 		goto out;
 	}
 	case KVM_MEMORY_ENCRYPT_OP: {
-		r = -ENOTTY;
-		if (!kvm_x86_ops.mem_enc_ioctl)
-			goto out;
-
 		r = kvm_x86_call(mem_enc_ioctl)(kvm, argp);
 		break;
 	}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (14 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 15/25] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-31  9:09   ` Binbin Wu
                     ` (2 more replies)
  2024-10-30 19:00 ` [PATCH v2 17/25] KVM: TDX: create/destroy VM structure Rick Edgecombe
                   ` (11 subsequent siblings)
  27 siblings, 3 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Binbin Wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX KVM needs system-wide information about the TDX module. Generate the
data based on tdx_sysinfo td_conf CPUID data.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
uAPI breakout v2:
 - Update stale patch description (Binbin)
 - Add KVM_TDX_CAPABILITIES where it's first used (Binbin)
 - Drop Drop unused KVM_TDX_CPUID_NO_SUBLEAF (Chao)
 - Drop mmu.h, it's only needed in later patches (Binbin)
 - Fold in Xiaoyao's capabilities changes (Tony)
 - Generate data without struct kvm_tdx_caps (Tony)
 - Use struct kvm_cpuid_entry2 as suggested (Binbin)
 - Use helpers for phys_addr_bits (Paolo)
 - Check TDX and KVM capabilities on _tdx_bringup() (Xiaoyao)
 - Change code around cpuid_config_value since
   struct tdx_cpuid_config_value {} is removed (Kai)

uAPI breakout v1:
 - Mention about hardware_unsetup(). (Binbin)
 - Added Reviewed-by. (Binbin)
 - Eliminated tdx_md_read(). (Kai)
 - Include "x86_ops.h" to tdx.c as the patch to initialize TDX module
   doesn't include it anymore.
 - Introduce tdx_vm_ioctl() as the first tdx func in x86_ops.h

v19:
 - Added features0
 - Use tdx_sys_metadata_read()
 - Fix error recovery path by Yuan

Change v18:
 - Newly Added
---
 arch/x86/include/uapi/asm/kvm.h |   9 +++
 arch/x86/kvm/vmx/tdx.c          | 137 ++++++++++++++++++++++++++++++++
 2 files changed, 146 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index b6cb87f2b477..0630530af334 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -928,6 +928,8 @@ struct kvm_hyperv_eventfd {
 
 /* Trust Domain eXtension sub-ioctl() commands. */
 enum kvm_tdx_cmd_id {
+	KVM_TDX_CAPABILITIES = 0,
+
 	KVM_TDX_CMD_NR_MAX,
 };
 
@@ -950,4 +952,11 @@ struct kvm_tdx_cmd {
 	__u64 hw_error;
 };
 
+struct kvm_tdx_capabilities {
+	__u64 supported_attrs;
+	__u64 supported_xfam;
+	__u64 reserved[254];
+	struct kvm_cpuid2 cpuid;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 76655d82f749..253debbe685f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -30,6 +30,134 @@ static enum cpuhp_state tdx_cpuhp_state;
 
 static const struct tdx_sys_info *tdx_sysinfo;
 
+#define KVM_SUPPORTED_TD_ATTRS (TDX_TD_ATTR_SEPT_VE_DISABLE)
+
+static u64 tdx_get_supported_attrs(const struct tdx_sys_info_td_conf *td_conf)
+{
+	u64 val = KVM_SUPPORTED_TD_ATTRS;
+
+	if ((val & td_conf->attributes_fixed1) != td_conf->attributes_fixed1)
+		return 0;
+
+	val &= td_conf->attributes_fixed0;
+
+	return val;
+}
+
+static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
+{
+	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
+
+	/*
+	 * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
+	 * and, CET support.
+	 */
+	val |= XFEATURE_MASK_PT | XFEATURE_MASK_CET_USER |
+	       XFEATURE_MASK_CET_KERNEL;
+
+	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
+		return 0;
+
+	val &= td_conf->xfam_fixed0;
+
+	return val;
+}
+
+static u32 tdx_set_guest_phys_addr_bits(const u32 eax, int addr_bits)
+{
+	return (eax & ~GENMASK(23, 16)) | (addr_bits & 0xff) << 16;
+}
+
+#define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
+
+static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
+{
+	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
+
+	entry->function = (u32)td_conf->cpuid_config_leaves[idx];
+	entry->index = td_conf->cpuid_config_leaves[idx] >> 32;
+	entry->eax = (u32)td_conf->cpuid_config_values[idx][0];
+	entry->ebx = td_conf->cpuid_config_values[idx][0] >> 32;
+	entry->ecx = (u32)td_conf->cpuid_config_values[idx][1];
+	entry->edx = td_conf->cpuid_config_values[idx][1] >> 32;
+
+	if (entry->index == KVM_TDX_CPUID_NO_SUBLEAF)
+		entry->index = 0;
+
+	/* Work around missing support on old TDX modules */
+	if (entry->function == 0x80000008)
+		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
+}
+
+static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
+			     struct kvm_tdx_capabilities *caps)
+{
+	int i;
+
+	caps->supported_attrs = tdx_get_supported_attrs(td_conf);
+	if (!caps->supported_attrs)
+		return -EIO;
+
+	caps->supported_xfam = tdx_get_supported_xfam(td_conf);
+	if (!caps->supported_xfam)
+		return -EIO;
+
+	caps->cpuid.nent = td_conf->num_cpuid_config;
+
+	for (i = 0; i < td_conf->num_cpuid_config; i++)
+		td_init_cpuid_entry2(&caps->cpuid.entries[i], i);
+
+	return 0;
+}
+
+static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
+{
+	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
+	struct kvm_tdx_capabilities __user *user_caps;
+	struct kvm_tdx_capabilities *caps = NULL;
+	int ret = 0;
+
+	/* flags is reserved for future use */
+	if (cmd->flags)
+		return -EINVAL;
+
+	caps = kmalloc(sizeof(*caps) +
+		       sizeof(struct kvm_cpuid_entry2) * td_conf->num_cpuid_config,
+		       GFP_KERNEL);
+	if (!caps)
+		return -ENOMEM;
+
+	user_caps = u64_to_user_ptr(cmd->data);
+	if (copy_from_user(caps, user_caps, sizeof(*caps))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (caps->cpuid.nent < td_conf->num_cpuid_config) {
+		ret = -E2BIG;
+		goto out;
+	}
+
+	ret = init_kvm_tdx_caps(td_conf, caps);
+	if (ret)
+		goto out;
+
+	if (copy_to_user(user_caps, caps, sizeof(*caps))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (copy_to_user(user_caps->cpuid.entries, caps->cpuid.entries,
+			 caps->cpuid.nent *
+			 sizeof(caps->cpuid.entries[0])))
+		ret = -EFAULT;
+
+out:
+	/* kfree() accepts NULL. */
+	kfree(caps);
+	return ret;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -48,6 +176,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	mutex_lock(&kvm->lock);
 
 	switch (tdx_cmd.id) {
+	case KVM_TDX_CAPABILITIES:
+		r = tdx_get_capabilities(&tdx_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -147,11 +278,17 @@ static int __init __tdx_bringup(void)
 		goto get_sysinfo_err;
 	}
 
+	/* Check TDX module and KVM capabilities */
+	if (!tdx_get_supported_attrs(&tdx_sysinfo->td_conf) ||
+	    !tdx_get_supported_xfam(&tdx_sysinfo->td_conf))
+		goto get_sysinfo_err;
+
 	/*
 	 * Leave hardware virtualization enabled after TDX is enabled
 	 * successfully.  TDX CPU hotplug depends on this.
 	 */
 	return 0;
+
 get_sysinfo_err:
 	__do_tdx_cleanup();
 tdx_bringup_err:
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-30 19:00 ` [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization Rick Edgecombe
@ 2024-10-31  9:09   ` Binbin Wu
  2024-10-31  9:18     ` Tony Lindgren
  2024-10-31  9:23     ` Xiaoyao Li
  2024-12-06  8:45   ` Xiaoyao Li
  2025-01-08  2:34   ` Chao Gao
  2 siblings, 2 replies; 103+ messages in thread
From: Binbin Wu @ 2024-10-31  9:09 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, xiaoyao.li, reinette.chatre, Isaku Yamahata




On 10/31/2024 3:00 AM, Rick Edgecombe wrote:
[...]
> +static u32 tdx_set_guest_phys_addr_bits(const u32 eax, int addr_bits)
> +{
> +	return (eax & ~GENMASK(23, 16)) | (addr_bits & 0xff) << 16;
> +}
> +
> +#define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
> +
> +static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
> +{
> +	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
> +
> +	entry->function = (u32)td_conf->cpuid_config_leaves[idx];
> +	entry->index = td_conf->cpuid_config_leaves[idx] >> 32;
> +	entry->eax = (u32)td_conf->cpuid_config_values[idx][0];
> +	entry->ebx = td_conf->cpuid_config_values[idx][0] >> 32;
> +	entry->ecx = (u32)td_conf->cpuid_config_values[idx][1];
> +	entry->edx = td_conf->cpuid_config_values[idx][1] >> 32;
> +
> +	if (entry->index == KVM_TDX_CPUID_NO_SUBLEAF)
> +		entry->index = 0;
> +
> +	/* Work around missing support on old TDX modules */
> +	if (entry->function == 0x80000008)
> +		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
Is it necessary to set bit 16~23 to 0xff?
It seems that when userspace wants to retrieve the value, the GPAW will
be set in tdx_read_cpuid() anyway.

> +}
> +
>
[...]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-31  9:09   ` Binbin Wu
@ 2024-10-31  9:18     ` Tony Lindgren
  2024-10-31  9:22       ` Binbin Wu
  2024-10-31  9:23     ` Xiaoyao Li
  1 sibling, 1 reply; 103+ messages in thread
From: Tony Lindgren @ 2024-10-31  9:18 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Rick Edgecombe, pbonzini, seanjc, yan.y.zhao, isaku.yamahata,
	kai.huang, kvm, linux-kernel, xiaoyao.li, reinette.chatre,
	Isaku Yamahata

On Thu, Oct 31, 2024 at 05:09:17PM +0800, Binbin Wu wrote:
> 
> 
> 
> On 10/31/2024 3:00 AM, Rick Edgecombe wrote:
> [...]
> > +static u32 tdx_set_guest_phys_addr_bits(const u32 eax, int addr_bits)
> > +{
> > +	return (eax & ~GENMASK(23, 16)) | (addr_bits & 0xff) << 16;
> > +}
> > +
> > +#define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
> > +
> > +static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
> > +{
> > +	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
> > +
> > +	entry->function = (u32)td_conf->cpuid_config_leaves[idx];
> > +	entry->index = td_conf->cpuid_config_leaves[idx] >> 32;
> > +	entry->eax = (u32)td_conf->cpuid_config_values[idx][0];
> > +	entry->ebx = td_conf->cpuid_config_values[idx][0] >> 32;
> > +	entry->ecx = (u32)td_conf->cpuid_config_values[idx][1];
> > +	entry->edx = td_conf->cpuid_config_values[idx][1] >> 32;
> > +
> > +	if (entry->index == KVM_TDX_CPUID_NO_SUBLEAF)
> > +		entry->index = 0;
> > +
> > +	/* Work around missing support on old TDX modules */
> > +	if (entry->function == 0x80000008)
> > +		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
> Is it necessary to set bit 16~23 to 0xff?
> It seems that when userspace wants to retrieve the value, the GPAW will
> be set in tdx_read_cpuid() anyway.

Leaving it out currently produces:

qemu-system-x86_64: KVM_TDX_INIT_VM failed: Invalid argument

Regards,

Tony

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-31  9:18     ` Tony Lindgren
@ 2024-10-31  9:22       ` Binbin Wu
  0 siblings, 0 replies; 103+ messages in thread
From: Binbin Wu @ 2024-10-31  9:22 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Rick Edgecombe, pbonzini, seanjc, yan.y.zhao, isaku.yamahata,
	kai.huang, kvm, linux-kernel, xiaoyao.li, reinette.chatre,
	Isaku Yamahata




On 10/31/2024 5:18 PM, Tony Lindgren wrote:
> On Thu, Oct 31, 2024 at 05:09:17PM +0800, Binbin Wu wrote:
>>
>>
>> On 10/31/2024 3:00 AM, Rick Edgecombe wrote:
>> [...]
>>> +static u32 tdx_set_guest_phys_addr_bits(const u32 eax, int addr_bits)
>>> +{
>>> +	return (eax & ~GENMASK(23, 16)) | (addr_bits & 0xff) << 16;
>>> +}
>>> +
>>> +#define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
>>> +
>>> +static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
>>> +{
>>> +	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
>>> +
>>> +	entry->function = (u32)td_conf->cpuid_config_leaves[idx];
>>> +	entry->index = td_conf->cpuid_config_leaves[idx] >> 32;
>>> +	entry->eax = (u32)td_conf->cpuid_config_values[idx][0];
>>> +	entry->ebx = td_conf->cpuid_config_values[idx][0] >> 32;
>>> +	entry->ecx = (u32)td_conf->cpuid_config_values[idx][1];
>>> +	entry->edx = td_conf->cpuid_config_values[idx][1] >> 32;
>>> +
>>> +	if (entry->index == KVM_TDX_CPUID_NO_SUBLEAF)
>>> +		entry->index = 0;
>>> +
>>> +	/* Work around missing support on old TDX modules */
>>> +	if (entry->function == 0x80000008)
>>> +		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
>> Is it necessary to set bit 16~23 to 0xff?
>> It seems that when userspace wants to retrieve the value, the GPAW will
>> be set in tdx_read_cpuid() anyway.
> Leaving it out currently produces:
>
> qemu-system-x86_64: KVM_TDX_INIT_VM failed: Invalid argument
Yes, I forgot that userspace would use the value as the mask to filter cpuid.


>
> Regards,
>
> Tony


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-31  9:09   ` Binbin Wu
  2024-10-31  9:18     ` Tony Lindgren
@ 2024-10-31  9:23     ` Xiaoyao Li
  2024-10-31  9:37       ` Tony Lindgren
  1 sibling, 1 reply; 103+ messages in thread
From: Xiaoyao Li @ 2024-10-31  9:23 UTC (permalink / raw)
  To: Binbin Wu, Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, reinette.chatre, Isaku Yamahata

On 10/31/2024 5:09 PM, Binbin Wu wrote:
> 
> 
> 
> On 10/31/2024 3:00 AM, Rick Edgecombe wrote:
> [...]
>> +static u32 tdx_set_guest_phys_addr_bits(const u32 eax, int addr_bits)
>> +{
>> +    return (eax & ~GENMASK(23, 16)) | (addr_bits & 0xff) << 16;
>> +}
>> +
>> +#define KVM_TDX_CPUID_NO_SUBLEAF    ((__u32)-1)
>> +
>> +static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, 
>> unsigned char idx)
>> +{
>> +    const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
>> +
>> +    entry->function = (u32)td_conf->cpuid_config_leaves[idx];
>> +    entry->index = td_conf->cpuid_config_leaves[idx] >> 32;
>> +    entry->eax = (u32)td_conf->cpuid_config_values[idx][0];
>> +    entry->ebx = td_conf->cpuid_config_values[idx][0] >> 32;
>> +    entry->ecx = (u32)td_conf->cpuid_config_values[idx][1];
>> +    entry->edx = td_conf->cpuid_config_values[idx][1] >> 32;
>> +
>> +    if (entry->index == KVM_TDX_CPUID_NO_SUBLEAF)
>> +        entry->index = 0;
>> +
>> +    /* Work around missing support on old TDX modules */
>> +    if (entry->function == 0x80000008)
>> +        entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
> Is it necessary to set bit 16~23 to 0xff?
> It seems that when userspace wants to retrieve the value, the GPAW will
> be set in tdx_read_cpuid() anyway.

here it is to initialize the configurable CPUID bits that get reported 
to userspace. Though TDX module doesn't allow them to be set in TD_PARAM 
for KVM_TDX_INIT_VM, they get set to 0xff because KVM reuse these bits 
EBX[23:16] as the interface for userspace to configure GPAW of TD guest 
(implemented in setup_tdparams_eptp_controls() in patch 19). That's why 
they need to be set as all-1 to allow userspace to configure.

And the comment above it is wrong and vague. we need to change it to 
something like

	/*
          * Though TDX module doesn't allow the configuration of guest
          * phys addr bits (EBX[23:16]), KVM uses it as the interface for
          * userspace to configure the GPAW. So need to report these bits
          * as configurable to userspace.
          */
>> +}
>> +
>>
> [...]


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-31  9:23     ` Xiaoyao Li
@ 2024-10-31  9:37       ` Tony Lindgren
  2024-10-31 14:27         ` Xiaoyao Li
  0 siblings, 1 reply; 103+ messages in thread
From: Tony Lindgren @ 2024-10-31  9:37 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Binbin Wu, Rick Edgecombe, pbonzini, seanjc, yan.y.zhao,
	isaku.yamahata, kai.huang, kvm, linux-kernel, reinette.chatre,
	Isaku Yamahata

On Thu, Oct 31, 2024 at 05:23:57PM +0800, Xiaoyao Li wrote:
> here it is to initialize the configurable CPUID bits that get reported to
> userspace. Though TDX module doesn't allow them to be set in TD_PARAM for
> KVM_TDX_INIT_VM, they get set to 0xff because KVM reuse these bits
> EBX[23:16] as the interface for userspace to configure GPAW of TD guest
> (implemented in setup_tdparams_eptp_controls() in patch 19). That's why they
> need to be set as all-1 to allow userspace to configure.
> 
> And the comment above it is wrong and vague. we need to change it to
> something like
> 
> 	/*
>          * Though TDX module doesn't allow the configuration of guest
>          * phys addr bits (EBX[23:16]), KVM uses it as the interface for
>          * userspace to configure the GPAW. So need to report these bits
>          * as configurable to userspace.
>          */

That sounds good to me.

Hmm so care to check if we can also just leave out another "old module"
comment in tdx_read_cpuid()?

Regards,

Tony

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-31  9:37       ` Tony Lindgren
@ 2024-10-31 14:27         ` Xiaoyao Li
  2024-11-01  8:19           ` Tony Lindgren
  0 siblings, 1 reply; 103+ messages in thread
From: Xiaoyao Li @ 2024-10-31 14:27 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Binbin Wu, Rick Edgecombe, pbonzini, seanjc, yan.y.zhao,
	isaku.yamahata, kai.huang, kvm, linux-kernel, reinette.chatre,
	Isaku Yamahata

On 10/31/2024 5:37 PM, Tony Lindgren wrote:
> On Thu, Oct 31, 2024 at 05:23:57PM +0800, Xiaoyao Li wrote:
>> here it is to initialize the configurable CPUID bits that get reported to
>> userspace. Though TDX module doesn't allow them to be set in TD_PARAM for
>> KVM_TDX_INIT_VM, they get set to 0xff because KVM reuse these bits
>> EBX[23:16] as the interface for userspace to configure GPAW of TD guest
>> (implemented in setup_tdparams_eptp_controls() in patch 19). That's why they
>> need to be set as all-1 to allow userspace to configure.
>>
>> And the comment above it is wrong and vague. we need to change it to
>> something like
>>
>> 	/*
>>           * Though TDX module doesn't allow the configuration of guest
>>           * phys addr bits (EBX[23:16]), KVM uses it as the interface for
>>           * userspace to configure the GPAW. So need to report these bits
>>           * as configurable to userspace.
>>           */
> 
> That sounds good to me.
> 
> Hmm so care to check if we can also just leave out another "old module"
> comment in tdx_read_cpuid()?

That one did relate to old module, the module that without 
TDX_CONFIG_FLAGS_MAXGPA_VIRT reported in tdx_feature0.

I will sent an follow up patch to complement the handling if TDX module 
supports TDX_CONFIG_FLAGS_MAXGPA_VIRT.

> Regards,
> 
> Tony


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-31 14:27         ` Xiaoyao Li
@ 2024-11-01  8:19           ` Tony Lindgren
  0 siblings, 0 replies; 103+ messages in thread
From: Tony Lindgren @ 2024-11-01  8:19 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Binbin Wu, Rick Edgecombe, pbonzini, seanjc, yan.y.zhao,
	isaku.yamahata, kai.huang, kvm, linux-kernel, reinette.chatre,
	Isaku Yamahata

On Thu, Oct 31, 2024 at 10:27:11PM +0800, Xiaoyao Li wrote:
> On 10/31/2024 5:37 PM, Tony Lindgren wrote:
> > On Thu, Oct 31, 2024 at 05:23:57PM +0800, Xiaoyao Li wrote:
> > > here it is to initialize the configurable CPUID bits that get reported to
> > > userspace. Though TDX module doesn't allow them to be set in TD_PARAM for
> > > KVM_TDX_INIT_VM, they get set to 0xff because KVM reuse these bits
> > > EBX[23:16] as the interface for userspace to configure GPAW of TD guest
> > > (implemented in setup_tdparams_eptp_controls() in patch 19). That's why they
> > > need to be set as all-1 to allow userspace to configure.
> > > 
> > > And the comment above it is wrong and vague. we need to change it to
> > > something like
> > > 
> > > 	/*
> > >           * Though TDX module doesn't allow the configuration of guest
> > >           * phys addr bits (EBX[23:16]), KVM uses it as the interface for
> > >           * userspace to configure the GPAW. So need to report these bits
> > >           * as configurable to userspace.
> > >           */
> > 
> > That sounds good to me.
> > 
> > Hmm so care to check if we can also just leave out another "old module"
> > comment in tdx_read_cpuid()?
> 
> That one did relate to old module, the module that without
> TDX_CONFIG_FLAGS_MAXGPA_VIRT reported in tdx_feature0.

OK thanks for checking.

> I will sent an follow up patch to complement the handling if TDX module
> supports TDX_CONFIG_FLAGS_MAXGPA_VIRT.

OK

Regards,

Tony

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-30 19:00 ` [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization Rick Edgecombe
  2024-10-31  9:09   ` Binbin Wu
@ 2024-12-06  8:45   ` Xiaoyao Li
  2024-12-10  9:35     ` Tony Lindgren
  2025-01-08  2:34   ` Chao Gao
  2 siblings, 1 reply; 103+ messages in thread
From: Xiaoyao Li @ 2024-12-06  8:45 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, reinette.chatre, Isaku Yamahata, Binbin Wu

On 10/31/2024 3:00 AM, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX KVM needs system-wide information about the TDX module. Generate the
> data based on tdx_sysinfo td_conf CPUID data.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> uAPI breakout v2:
>   - Update stale patch description (Binbin)
>   - Add KVM_TDX_CAPABILITIES where it's first used (Binbin)
>   - Drop Drop unused KVM_TDX_CPUID_NO_SUBLEAF (Chao)
>   - Drop mmu.h, it's only needed in later patches (Binbin)
>   - Fold in Xiaoyao's capabilities changes (Tony)
>   - Generate data without struct kvm_tdx_caps (Tony)
>   - Use struct kvm_cpuid_entry2 as suggested (Binbin)
>   - Use helpers for phys_addr_bits (Paolo)
>   - Check TDX and KVM capabilities on _tdx_bringup() (Xiaoyao)
>   - Change code around cpuid_config_value since
>     struct tdx_cpuid_config_value {} is removed (Kai)
> 
> uAPI breakout v1:
>   - Mention about hardware_unsetup(). (Binbin)
>   - Added Reviewed-by. (Binbin)
>   - Eliminated tdx_md_read(). (Kai)
>   - Include "x86_ops.h" to tdx.c as the patch to initialize TDX module
>     doesn't include it anymore.
>   - Introduce tdx_vm_ioctl() as the first tdx func in x86_ops.h
> 
> v19:
>   - Added features0
>   - Use tdx_sys_metadata_read()
>   - Fix error recovery path by Yuan
> 
> Change v18:
>   - Newly Added
> ---
>   arch/x86/include/uapi/asm/kvm.h |   9 +++
>   arch/x86/kvm/vmx/tdx.c          | 137 ++++++++++++++++++++++++++++++++
>   2 files changed, 146 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index b6cb87f2b477..0630530af334 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -928,6 +928,8 @@ struct kvm_hyperv_eventfd {
>   
>   /* Trust Domain eXtension sub-ioctl() commands. */
>   enum kvm_tdx_cmd_id {
> +	KVM_TDX_CAPABILITIES = 0,
> +
>   	KVM_TDX_CMD_NR_MAX,
>   };
>   
> @@ -950,4 +952,11 @@ struct kvm_tdx_cmd {
>   	__u64 hw_error;
>   };
>   
> +struct kvm_tdx_capabilities {
> +	__u64 supported_attrs;
> +	__u64 supported_xfam;
> +	__u64 reserved[254];
> +	struct kvm_cpuid2 cpuid;

Could we rename it to "configurable_cpuid" to call out that it only 
reports the bits that are allowable for userspace to configure at 0 or 1 
at will.

If could we can even add a comment of

	/*
          * Bit of value 1 means the bit is free to be configured as
          * 0/1 by userspace VMM.
          * The value 0 only means the bit is not configurable, while
          * the actual value of it depends on TDX module/KVM
          * implementation.
          */

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-12-06  8:45   ` Xiaoyao Li
@ 2024-12-10  9:35     ` Tony Lindgren
  0 siblings, 0 replies; 103+ messages in thread
From: Tony Lindgren @ 2024-12-10  9:35 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Rick Edgecombe, pbonzini, seanjc, yan.y.zhao, isaku.yamahata,
	kai.huang, kvm, linux-kernel, reinette.chatre, Isaku Yamahata,
	Binbin Wu

On Fri, Dec 06, 2024 at 04:45:01PM +0800, Xiaoyao Li wrote:
> On 10/31/2024 3:00 AM, Rick Edgecombe wrote:
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -928,6 +928,8 @@ struct kvm_hyperv_eventfd {
> >   /* Trust Domain eXtension sub-ioctl() commands. */
> >   enum kvm_tdx_cmd_id {
> > +	KVM_TDX_CAPABILITIES = 0,
> > +
> >   	KVM_TDX_CMD_NR_MAX,
> >   };
> > @@ -950,4 +952,11 @@ struct kvm_tdx_cmd {
> >   	__u64 hw_error;
> >   };
> > +struct kvm_tdx_capabilities {
> > +	__u64 supported_attrs;
> > +	__u64 supported_xfam;
> > +	__u64 reserved[254];
> > +	struct kvm_cpuid2 cpuid;
> 
> Could we rename it to "configurable_cpuid" to call out that it only reports
> the bits that are allowable for userspace to configure at 0 or 1 at will.

Well it's already in the capabilities struct.. So to me it seems like just
adding a comment should do the trick.

Regards,

Tony

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2024-10-30 19:00 ` [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization Rick Edgecombe
  2024-10-31  9:09   ` Binbin Wu
  2024-12-06  8:45   ` Xiaoyao Li
@ 2025-01-08  2:34   ` Chao Gao
  2025-01-08  5:41     ` Huang, Kai
  2 siblings, 1 reply; 103+ messages in thread
From: Chao Gao @ 2025-01-08  2:34 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: pbonzini, seanjc, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Binbin Wu

>@@ -147,11 +278,17 @@ static int __init __tdx_bringup(void)
> 		goto get_sysinfo_err;
> 	}
> 
>+	/* Check TDX module and KVM capabilities */
>+	if (!tdx_get_supported_attrs(&tdx_sysinfo->td_conf) ||
>+	    !tdx_get_supported_xfam(&tdx_sysinfo->td_conf))
>+		goto get_sysinfo_err;

The return value should be set to -EINVAL before the goto.

>+
> 	/*
> 	 * Leave hardware virtualization enabled after TDX is enabled
> 	 * successfully.  TDX CPU hotplug depends on this.
> 	 */
> 	return 0;
>+
> get_sysinfo_err:
> 	__do_tdx_cleanup();
> tdx_bringup_err:
>-- 
>2.47.0
>
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization
  2025-01-08  2:34   ` Chao Gao
@ 2025-01-08  5:41     ` Huang, Kai
  0 siblings, 0 replies; 103+ messages in thread
From: Huang, Kai @ 2025-01-08  5:41 UTC (permalink / raw)
  To: Gao, Chao, Edgecombe, Rick P
  Cc: seanjc@google.com, binbin.wu@linux.intel.com, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Chatre, Reinette, Yamahata, Isaku,
	Zhao, Yan Y

On Wed, 2025-01-08 at 10:34 +0800, Gao, Chao wrote:
> > @@ -147,11 +278,17 @@ static int __init __tdx_bringup(void)
> > 		goto get_sysinfo_err;
> > 	}
> > 
> > +	/* Check TDX module and KVM capabilities */
> > +	if (!tdx_get_supported_attrs(&tdx_sysinfo->td_conf) ||
> > +	    !tdx_get_supported_xfam(&tdx_sysinfo->td_conf))
> > +		goto get_sysinfo_err;
> 
> The return value should be set to -EINVAL before the goto.
> 

Yeah.  Sean actually pointed this out before.  I proposed internally to do below
change to the patch "[PATCH v2 02/25] KVM: TDX: Get TDX global information":

--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3274,12 +3274,11 @@ static int __init __tdx_bringup(void)
        if (r)
                goto tdx_bringup_err;
 
+       r = -EINVAL;
        /* Get TDX global information for later use */
        tdx_sysinfo = tdx_get_sysinfo();
-       if (WARN_ON_ONCE(!tdx_sysinfo)) {
-               r = -EINVAL;
+       if (WARN_ON_ONCE(!tdx_sysinfo))
                goto get_sysinfo_err;
-       }

.. so that further failures can just 'goto <err_label>'.  I.e., below should be
done to the patch "[PATCH v2 18/25] KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS
extension check":

        /* Check TDX module and KVM capabilities */
        if (!tdx_get_supported_attrs(&tdx_sysinfo->td_conf) ||
@@ -3319,7 +3318,6 @@ static int __init __tdx_bringup(void)
        if (td_conf->max_vcpus_per_td < num_present_cpus()) {
                pr_err("Disable TDX: MAX_VCPU_PER_TD (%u) smaller than number of
logical CPUs (%u).\n",
                                td_conf->max_vcpus_per_td, num_present_cpus());
-               r = -EINVAL;
                goto get_sysinfo_err;
        }

Alternatively, we can just set ret to -EINVAL before the goto which is a simple
fix to this patch, which probably is easier for Paolo to do.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 17/25] KVM: TDX: create/destroy VM structure
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (15 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-11-04  2:03   ` Chao Gao
  2024-10-30 19:00 ` [PATCH v2 18/25] KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check Rick Edgecombe
                   ` (10 subsequent siblings)
  27 siblings, 1 reply; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement managing the TDX private KeyID to implement, create, destroy
and free for a TDX guest.

When creating at TDX guest, assign a TDX private KeyID for the TDX guest
for memory encryption, and allocate pages for the guest. These are used
for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS).

On destruction, free the allocated pages, and the KeyID.

Before tearing down the private page tables, TDX requires the guest TD to
be destroyed by reclaiming the KeyID. Do it at vm_destroy() kvm_x86_ops
hook.

Add a call for vm_free() at the end of kvm_arch_destroy_vm() because the
per-VM TDR needs to be freed after the KeyID.

Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
Comment: this SOB/developed-by chain is getting a bit long.

uAPI breakout v2:
 - Remove write_lock(&kvm->mmu_lock) in HKID release was moved and vCPUs
   should be destroyed (Isaku)
 - Drop include for tdx_ops.h (Xu)
 - Drop unncessary clearing of tdr_pa in __tdx_td_init() (Yuan)
 - Use is_td_created() in tdx_vm_free() (Nikolay)
 - Precalculate nr_tcds_pages (Nikolay)
 - fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
   wrappers (Kai)
 - Add TD state handling (Tony)

uAPI breakout v1:
 - Fix unnecessary include re-ordering (Chao)
 - Fix the unpaired curly brackets (Chao)
 - Drop the tdx_mng_key_config_lock  (Chao)
 - Drop unnecessary is_hkid_assigned() check (Chao)
 - Use KVM_GENERIC_PRIVATE_MEM and undo the removal of EXPERT (Binbin)
 - Drop the word typically from comments (Binbin)
 - Clarify comments for the need of global tdx_lock mutex (Kai)
 - Add function comments for tdx_clear_page() (Kai)
 - Clarify comments for tdx_clear_page() poisoned page (Kai)
 - Move and update comments for limitations of __tdx_reclaim_page() (Kai)
 - Drop comment related to "rare to contend" (Kai)
 - Drop comment related to TDR and target page (Tony)
 - Make code easier to read with line breaks between paragraphs (Kai)
 - Use cond_resched() retry (Kai)
 - Use for loop for retries (Tony)
 - Use switch to handle errors (Tony)
 - Drop loop for tdh_mng_key_config() (Tony)
 - Rename tdx_reclaim_control_page() td_page_pa to ctrl_page_pa (Kai)
 - Reorganize comments for tdx_reclaim_control_page() (Kai)
 - Use smp_func_do_phymem_cache_wb() naming to indicate SMP (Kai)
 - Use bool resume in smp_func_do_phymem_cache_wb() (Kai)
 - Add comment on retrying to smp_func_do_phymem_cache_wb() (Kai)
 - Move code change to tdx_module_setup() to __tdx_bringup() due to
   initializing is done in post hardware_setup() now and
   tdx_module_setup() is removed.  Remove the code to use API to read
   global metadata but use exported 'struct tdx_sysinfo' pointer.
 - Replace 'tdx_info->nr_tdcs_pages' with a wrapper
   tdx_sysinfo_nr_tdcs_pages() because the 'struct tdx_sysinfo' doesn't
   have nr_tdcs_pages directly.
 - Replace tdx_info->max_vcpus_per_td with the new exported pointer in
   tdx_vm_init().
 - Add comment to tdx_mmu_release_hkid() on KeyID allocated (Kai)
 - Update comments for tdx_mmu_release_hkid() for locking (Kai)
 - Clarify tdx_mmu_release_hkid() comments for freeing HKID (Kai)
 - Use KVM_BUG_ON() for SEAMCALLs in tdx_mmu_release_hkid() (Kai)
 - Use continue for loop in tdx_vm_free() (Kai)
 - Clarify comments in  tdx_vm_free() for reclaiming TDCS (Kai)
 - Use KVM_BUG_ON() for tdx_vm_free()
 - Prettify format with line breaks in tdx_vm_free() (Tony)
 - Prettify formatting for __tdx_td_init() with line breaks (Kai)
 - Simplify comments for __tdx_td_init() locking (Kai)
 - Update patch description (Kai)
---
 arch/x86/include/asm/kvm-x86-ops.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   1 +
 arch/x86/kvm/Kconfig               |   2 +
 arch/x86/kvm/vmx/main.c            |  28 +-
 arch/x86/kvm/vmx/tdx.c             | 458 +++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h             |   7 +-
 arch/x86/kvm/vmx/x86_ops.h         |   6 +
 arch/x86/kvm/x86.c                 |   1 +
 8 files changed, 501 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index f250137c837a..e7bd7867cb94 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -21,6 +21,7 @@ KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP(vm_init)
 KVM_X86_OP_OPTIONAL(vm_destroy)
+KVM_X86_OP_OPTIONAL(vm_free)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
 KVM_X86_OP(vcpu_create)
 KVM_X86_OP(vcpu_free)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 85ed576660ee..d8478e103f07 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1647,6 +1647,7 @@ struct kvm_x86_ops {
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
+	void (*vm_free)(struct kvm *kvm);
 
 	/* Create, but do not attach this VCPU */
 	int (*vcpu_precreate)(struct kvm *kvm);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index f09f13c01c6b..8d1c3f75028d 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -92,6 +92,8 @@ config KVM_SW_PROTECTED_VM
 config KVM_INTEL
 	tristate "KVM for Intel (and compatible) processors support"
 	depends on KVM && IA32_FEAT_CTL
+	select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
+	select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
 	  extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 6ed78deea543..ed4afa45b16b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -41,6 +41,28 @@ static __init int vt_hardware_setup(void)
 	return 0;
 }
 
+static int vt_vm_init(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return tdx_vm_init(kvm);
+
+	return vmx_vm_init(kvm);
+}
+
+static void vt_vm_destroy(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return tdx_mmu_release_hkid(kvm);
+
+	vmx_vm_destroy(kvm);
+}
+
+static void vt_vm_free(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		tdx_vm_free(kvm);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -72,8 +94,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.has_emulated_msr = vmx_has_emulated_msr,
 
 	.vm_size = sizeof(struct kvm_vmx),
-	.vm_init = vmx_vm_init,
-	.vm_destroy = vmx_vm_destroy,
+
+	.vm_init = vt_vm_init,
+	.vm_destroy = vt_vm_destroy,
+	.vm_free = vt_vm_free,
 
 	.vcpu_precreate = vmx_vcpu_precreate,
 	.vcpu_create = vmx_vcpu_create,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 253debbe685f..50217f601061 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -110,6 +110,285 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
 	return 0;
 }
 
+/*
+ * Some SEAMCALLs acquire the TDX module globally, and can fail with
+ * TDX_OPERAND_BUSY.  Use a global mutex to serialize these SEAMCALLs.
+ */
+static DEFINE_MUTEX(tdx_lock);
+
+/* Maximum number of retries to attempt for SEAMCALLs. */
+#define TDX_SEAMCALL_RETRIES	10000
+
+static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
+{
+	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
+}
+
+static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
+{
+	tdx_guest_keyid_free(kvm_tdx->hkid);
+	kvm_tdx->hkid = -1;
+}
+
+static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->hkid > 0;
+}
+
+static void tdx_clear_page(unsigned long page_pa)
+{
+	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
+	void *page = __va(page_pa);
+	unsigned long i;
+
+	/*
+	 * The page could have been poisoned.  MOVDIR64B also clears
+	 * the poison bit so the kernel can safely use the page again.
+	 */
+	for (i = 0; i < PAGE_SIZE; i += 64)
+		movdir64b(page + i, zero_page);
+	/*
+	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
+	 * from seeing potentially poisoned cache.
+	 */
+	__mb();
+}
+
+/* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
+static int __tdx_reclaim_page(hpa_t pa)
+{
+	u64 err, rcx, rdx, r8;
+	int i;
+
+	for (i = TDX_SEAMCALL_RETRIES; i > 0; i--) {
+		err = tdh_phymem_page_reclaim(pa, &rcx, &rdx, &r8);
+
+		/*
+		 * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
+		 * state.  i.e. destructing TD.
+		 * TDH.PHYMEM.PAGE.RECLAIM requires TDR and target page.
+		 * Because we're destructing TD, it's rare to contend with TDR.
+		 */
+		switch (err) {
+		case TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX:
+		case TDX_OPERAND_BUSY | TDX_OPERAND_ID_TDR:
+			cond_resched();
+			continue;
+		default:
+			goto out;
+		}
+	}
+
+out:
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, rcx, rdx, r8);
+		return -EIO;
+	}
+	return 0;
+}
+
+static int tdx_reclaim_page(hpa_t pa)
+{
+	int r;
+
+	r = __tdx_reclaim_page(pa);
+	if (!r)
+		tdx_clear_page(pa);
+	return r;
+}
+
+
+/*
+ * Reclaim the TD control page(s) which are crypto-protected by TDX guest's
+ * private KeyID.  Assume the cache associated with the TDX private KeyID has
+ * been flushed.
+ */
+static void tdx_reclaim_control_page(unsigned long ctrl_page_pa)
+{
+	/*
+	 * Leak the page if the kernel failed to reclaim the page.
+	 * The kernel cannot use it safely anymore.
+	 */
+	if (tdx_reclaim_page(ctrl_page_pa))
+		return;
+
+	free_page((unsigned long)__va(ctrl_page_pa));
+}
+
+static void smp_func_do_phymem_cache_wb(void *unused)
+{
+	u64 err = 0;
+	bool resume;
+	int i;
+
+	/*
+	 * TDH.PHYMEM.CACHE.WB flushes caches associated with any TDX private
+	 * KeyID on the package or core.  The TDX module may not finish the
+	 * cache flush but return TDX_INTERRUPTED_RESUMEABLE instead.  The
+	 * kernel should retry it until it returns success w/o rescheduling.
+	 */
+	for (i = TDX_SEAMCALL_RETRIES; i > 0; i--) {
+		resume = !!err;
+		err = tdh_phymem_cache_wb(resume);
+		switch (err) {
+		case TDX_INTERRUPTED_RESUMABLE:
+			continue;
+		case TDX_NO_HKID_READY_TO_WBCACHE:
+			err = TDX_SUCCESS; /* Already done by other thread */
+			fallthrough;
+		default:
+			goto out;
+		}
+	}
+
+out:
+	if (WARN_ON_ONCE(err))
+		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err);
+}
+
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+	bool packages_allocated, targets_allocated;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	cpumask_var_t packages, targets;
+	u64 err;
+	int i;
+
+	if (!is_hkid_assigned(kvm_tdx))
+		return;
+
+	/* KeyID has been allocated but guest is not yet configured */
+	if (!kvm_tdx->tdr_pa) {
+		tdx_hkid_free(kvm_tdx);
+		return;
+	}
+
+	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
+	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
+	cpus_read_lock();
+
+	/*
+	 * TDH.PHYMEM.CACHE.WB tries to acquire the TDX module global lock
+	 * and can fail with TDX_OPERAND_BUSY when it fails to get the lock.
+	 * Multiple TDX guests can be destroyed simultaneously. Take the
+	 * mutex to prevent it from getting error.
+	 */
+	mutex_lock(&tdx_lock);
+
+	/*
+	 * Releasing HKID is in vm_destroy().
+	 * After the above flushing vps, there should be no more vCPU
+	 * associations, as all vCPU fds have been released at this stage.
+	 */
+	for_each_online_cpu(i) {
+		if (packages_allocated &&
+		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
+					     packages))
+			continue;
+		if (targets_allocated)
+			cpumask_set_cpu(i, targets);
+	}
+	if (targets_allocated)
+		on_each_cpu_mask(targets, smp_func_do_phymem_cache_wb, NULL, true);
+	else
+		on_each_cpu(smp_func_do_phymem_cache_wb, NULL, true);
+	/*
+	 * In the case of error in smp_func_do_phymem_cache_wb(), the following
+	 * tdh_mng_key_freeid() will fail.
+	 */
+	err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MNG_KEY_FREEID, err);
+		pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
+		       kvm_tdx->hkid);
+	} else {
+		tdx_hkid_free(kvm_tdx);
+	}
+
+	mutex_unlock(&tdx_lock);
+	cpus_read_unlock();
+	free_cpumask_var(targets);
+	free_cpumask_var(packages);
+}
+
+static void tdx_reclaim_td_control_pages(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err;
+	int i;
+
+	/*
+	 * tdx_mmu_release_hkid() failed to reclaim HKID.  Something went wrong
+	 * heavily with TDX module.  Give up freeing TD pages.  As the function
+	 * already warned, don't warn it again.
+	 */
+	if (is_hkid_assigned(kvm_tdx))
+		return;
+
+	if (kvm_tdx->tdcs_pa) {
+		for (i = 0; i < kvm_tdx->nr_tdcs_pages; i++) {
+			if (!kvm_tdx->tdcs_pa[i])
+				continue;
+
+			tdx_reclaim_control_page(kvm_tdx->tdcs_pa[i]);
+		}
+		kfree(kvm_tdx->tdcs_pa);
+		kvm_tdx->tdcs_pa = NULL;
+	}
+
+	if (!kvm_tdx->tdr_pa)
+		return;
+
+	if (__tdx_reclaim_page(kvm_tdx->tdr_pa))
+		return;
+
+	/*
+	 * Use a SEAMCALL to ask the TDX module to flush the cache based on the
+	 * KeyID. TDX module may access TDR while operating on TD (Especially
+	 * when it is reclaiming TDCS).
+	 */
+	err = tdh_phymem_page_wbinvd_tdr(kvm_tdx->tdr_pa);
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
+		return;
+	}
+	tdx_clear_page(kvm_tdx->tdr_pa);
+
+	free_page((unsigned long)__va(kvm_tdx->tdr_pa));
+	kvm_tdx->tdr_pa = 0;
+}
+
+void tdx_vm_free(struct kvm *kvm)
+{
+	tdx_reclaim_td_control_pages(kvm);
+}
+
+static int tdx_do_tdh_mng_key_config(void *param)
+{
+	struct kvm_tdx *kvm_tdx = param;
+	u64 err;
+
+	/* TDX_RND_NO_ENTROPY related retries are handled by sc_retry() */
+	err = tdh_mng_key_config(kvm_tdx->tdr_pa);
+
+	if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
+		pr_tdx_error(TDH_MNG_KEY_CONFIG, err);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int __tdx_td_init(struct kvm *kvm);
+
+int tdx_vm_init(struct kvm *kvm)
+{
+	kvm->arch.has_private_mem = true;
+
+	/* Place holder for TDX specific logic. */
+	return __tdx_td_init(kvm);
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
@@ -158,6 +437,180 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 	return ret;
 }
 
+static int __tdx_td_init(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	cpumask_var_t packages;
+	unsigned long *tdcs_pa = NULL;
+	unsigned long tdr_pa = 0;
+	unsigned long va;
+	int ret, i;
+	u64 err;
+
+	ret = tdx_guest_keyid_alloc();
+	if (ret < 0)
+		return ret;
+	kvm_tdx->hkid = ret;
+
+	va = __get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!va)
+		goto free_hkid;
+	tdr_pa = __pa(va);
+
+	kvm_tdx->nr_tdcs_pages = tdx_sysinfo->td_ctrl.tdcs_base_size / PAGE_SIZE;
+	tdcs_pa = kcalloc(kvm_tdx->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
+			  GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	if (!tdcs_pa)
+		goto free_tdr;
+
+	for (i = 0; i < kvm_tdx->nr_tdcs_pages; i++) {
+		va = __get_free_page(GFP_KERNEL_ACCOUNT);
+		if (!va)
+			goto free_tdcs;
+		tdcs_pa[i] = __pa(va);
+	}
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto free_tdcs;
+	}
+
+	cpus_read_lock();
+
+	/*
+	 * Need at least one CPU of the package to be online in order to
+	 * program all packages for host key id.  Check it.
+	 */
+	for_each_present_cpu(i)
+		cpumask_set_cpu(topology_physical_package_id(i), packages);
+	for_each_online_cpu(i)
+		cpumask_clear_cpu(topology_physical_package_id(i), packages);
+	if (!cpumask_empty(packages)) {
+		ret = -EIO;
+		/*
+		 * Because it's hard for human operator to figure out the
+		 * reason, warn it.
+		 */
+#define MSG_ALLPKG	"All packages need to have online CPU to create TD. Online CPU and retry.\n"
+		pr_warn_ratelimited(MSG_ALLPKG);
+		goto free_packages;
+	}
+
+	/*
+	 * TDH.MNG.CREATE tries to grab the global TDX module and fails
+	 * with TDX_OPERAND_BUSY when it fails to grab.  Take the global
+	 * lock to prevent it from failure.
+	 */
+	mutex_lock(&tdx_lock);
+	kvm_tdx->tdr_pa = tdr_pa;
+	err = tdh_mng_create(kvm_tdx->tdr_pa, kvm_tdx->hkid);
+	mutex_unlock(&tdx_lock);
+
+	if (err == TDX_RND_NO_ENTROPY) {
+		ret = -EAGAIN;
+		goto free_packages;
+	}
+
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_CREATE, err);
+		ret = -EIO;
+		goto free_packages;
+	}
+
+	for_each_online_cpu(i) {
+		int pkg = topology_physical_package_id(i);
+
+		if (cpumask_test_and_set_cpu(pkg, packages))
+			continue;
+
+		/*
+		 * Program the memory controller in the package with an
+		 * encryption key associated to a TDX private host key id
+		 * assigned to this TDR.  Concurrent operations on same memory
+		 * controller results in TDX_OPERAND_BUSY. No locking needed
+		 * beyond the cpus_read_lock() above as it serializes against
+		 * hotplug and the first online CPU of the package is always
+		 * used. We never have two CPUs in the same socket trying to
+		 * program the key.
+		 */
+		ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
+				      kvm_tdx, true);
+		if (ret)
+			break;
+	}
+	cpus_read_unlock();
+	free_cpumask_var(packages);
+	if (ret) {
+		i = 0;
+		goto teardown;
+	}
+
+	kvm_tdx->tdcs_pa = tdcs_pa;
+	for (i = 0; i < kvm_tdx->nr_tdcs_pages; i++) {
+		err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
+		if (err == TDX_RND_NO_ENTROPY) {
+			/* Here it's hard to allow userspace to retry. */
+			ret = -EBUSY;
+			goto teardown;
+		}
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_MNG_ADDCX, err);
+			ret = -EIO;
+			goto teardown;
+		}
+	}
+
+	/*
+	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
+	 * ioctl() to define the configure CPUID values for the TD.
+	 */
+	return 0;
+
+	/*
+	 * The sequence for freeing resources from a partially initialized TD
+	 * varies based on where in the initialization flow failure occurred.
+	 * Simply use the full teardown and destroy, which naturally play nice
+	 * with partial initialization.
+	 */
+teardown:
+	/* Only free pages not yet added, so start at 'i' */
+	for (; i < kvm_tdx->nr_tdcs_pages; i++) {
+		if (tdcs_pa[i]) {
+			free_page((unsigned long)__va(tdcs_pa[i]));
+			tdcs_pa[i] = 0;
+		}
+	}
+	if (!kvm_tdx->tdcs_pa)
+		kfree(tdcs_pa);
+
+	tdx_mmu_release_hkid(kvm);
+	tdx_reclaim_td_control_pages(kvm);
+
+	return ret;
+
+free_packages:
+	cpus_read_unlock();
+	free_cpumask_var(packages);
+
+free_tdcs:
+	for (i = 0; i < kvm_tdx->nr_tdcs_pages; i++) {
+		if (tdcs_pa[i])
+			free_page((unsigned long)__va(tdcs_pa[i]));
+	}
+	kfree(tdcs_pa);
+	kvm_tdx->tdcs_pa = NULL;
+
+free_tdr:
+	if (tdr_pa)
+		free_page((unsigned long)__va(tdr_pa));
+	kvm_tdx->tdr_pa = 0;
+
+free_hkid:
+	tdx_hkid_free(kvm_tdx);
+
+	return ret;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -251,6 +704,11 @@ static int __init __tdx_bringup(void)
 {
 	int r;
 
+	if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
+		pr_warn("MOVDIR64B is reqiured for TDX\n");
+		return -EOPNOTSUPP;
+	}
+
 	if (!enable_ept) {
 		pr_err("Cannot enable TDX with EPT disabled.\n");
 		return -EINVAL;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index faed454385ca..e557a82bc882 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -12,7 +12,12 @@ extern bool enable_tdx;
 
 struct kvm_tdx {
 	struct kvm kvm;
-	/* TDX specific members follow. */
+
+	unsigned long tdr_pa;
+	unsigned long *tdcs_pa;
+
+	int hkid;
+	u8 nr_tdcs_pages;
 };
 
 struct vcpu_tdx {
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 42901be70f9d..e7d5afce68f0 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -119,8 +119,14 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
 #ifdef CONFIG_INTEL_TDX_HOST
+int tdx_vm_init(struct kvm *kvm);
+void tdx_mmu_release_hkid(struct kvm *kvm);
+void tdx_vm_free(struct kvm *kvm);
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 #else
+static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
+static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
+static inline void tdx_vm_free(struct kvm *kvm) {}
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d86a18a4195b..8a103c29dcd0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12883,6 +12883,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_page_track_cleanup(kvm);
 	kvm_xen_destroy_vm(kvm);
 	kvm_hv_destroy_vm(kvm);
+	static_call_cond(kvm_x86_vm_free)(kvm);
 }
 
 static void memslot_rmap_free(struct kvm_memory_slot *slot)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 17/25] KVM: TDX: create/destroy VM structure
  2024-10-30 19:00 ` [PATCH v2 17/25] KVM: TDX: create/destroy VM structure Rick Edgecombe
@ 2024-11-04  2:03   ` Chao Gao
  2024-11-04  5:59     ` Tony Lindgren
  0 siblings, 1 reply; 103+ messages in thread
From: Chao Gao @ 2024-11-04  2:03 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: pbonzini, seanjc, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson

>+static int __tdx_td_init(struct kvm *kvm)
>+{
>+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>+	cpumask_var_t packages;
>+	unsigned long *tdcs_pa = NULL;
>+	unsigned long tdr_pa = 0;
>+	unsigned long va;
>+	int ret, i;
>+	u64 err;
>+
>+	ret = tdx_guest_keyid_alloc();
>+	if (ret < 0)
>+		return ret;
>+	kvm_tdx->hkid = ret;
>+
>+	va = __get_free_page(GFP_KERNEL_ACCOUNT);
>+	if (!va)
>+		goto free_hkid;

@ret should be set to -ENOMEM before goto. otherwise, the error code would be
the guest HKID.

>+	tdr_pa = __pa(va);
>+
>+	kvm_tdx->nr_tdcs_pages = tdx_sysinfo->td_ctrl.tdcs_base_size / PAGE_SIZE;
>+	tdcs_pa = kcalloc(kvm_tdx->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
>+			  GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>+	if (!tdcs_pa)
>+		goto free_tdr;

ditto

>+
>+	for (i = 0; i < kvm_tdx->nr_tdcs_pages; i++) {
>+		va = __get_free_page(GFP_KERNEL_ACCOUNT);
>+		if (!va)
>+			goto free_tdcs;

ditto

>+		tdcs_pa[i] = __pa(va);
>+	}
>+
>+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
>+		ret = -ENOMEM;

maybe just hoist this line before allocating tdr.

>+		goto free_tdcs;
>+	}

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 17/25] KVM: TDX: create/destroy VM structure
  2024-11-04  2:03   ` Chao Gao
@ 2024-11-04  5:59     ` Tony Lindgren
  0 siblings, 0 replies; 103+ messages in thread
From: Tony Lindgren @ 2024-11-04  5:59 UTC (permalink / raw)
  To: Chao Gao
  Cc: Rick Edgecombe, pbonzini, seanjc, yan.y.zhao, isaku.yamahata,
	kai.huang, kvm, linux-kernel, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson

On Mon, Nov 04, 2024 at 10:03:59AM +0800, Chao Gao wrote:
> >+static int __tdx_td_init(struct kvm *kvm)
> >+{
> >+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >+	cpumask_var_t packages;
> >+	unsigned long *tdcs_pa = NULL;
> >+	unsigned long tdr_pa = 0;
> >+	unsigned long va;
> >+	int ret, i;
> >+	u64 err;
> >+
> >+	ret = tdx_guest_keyid_alloc();
> >+	if (ret < 0)
> >+		return ret;
> >+	kvm_tdx->hkid = ret;
> >+
> >+	va = __get_free_page(GFP_KERNEL_ACCOUNT);
> >+	if (!va)
> >+		goto free_hkid;
> 
> @ret should be set to -ENOMEM before goto. otherwise, the error code would be
> the guest HKID.

Good catch.

> >+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
> >+		ret = -ENOMEM;
> 
> maybe just hoist this line before allocating tdr.

Yeah it should be initialized earlier.

Tony

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 18/25] KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (16 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 17/25] KVM: TDX: create/destroy VM structure Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2025-01-05 22:12   ` Huang, Kai
  2024-10-30 19:00 ` [PATCH v2 19/25] KVM: TDX: initialize VM with TDX specific parameters Rick Edgecombe
                   ` (9 subsequent siblings)
  27 siblings, 1 reply; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Change to report the KVM_CAP_MAX_VCPUS extension from globally to per-VM
to allow userspace to be able to query maximum vCPUs for TDX guest via
checking the KVM_CAP_MAX_VCPU extension on per-VM basis.

Today KVM x86 reports KVM_MAX_VCPUS as guest's maximum vCPUs for all
guests globally, and userspace, i.e. Qemu, queries the KVM_MAX_VCPUS
extension globally but not on per-VM basis.

TDX has its own limit of maximum vCPUs it can support for all TDX guests
in addition to KVM_MAX_VCPUS.  TDX module reports this limit via the
MAX_VCPU_PER_TD global metadata.  Different modules may report different
values.  In practice, the reported value reflects the maximum logical
CPUs that ALL the platforms that the module supports can possibly have.

Note some old modules may also not support this metadata, in which case
the limit is U16_MAX.

The current way to always report KVM_MAX_VCPUS in the KVM_CAP_MAX_VCPUS
extension is not enough for TDX.  To accommodate TDX, change to report
the KVM_CAP_MAX_VCPUS extension on per-VM basis.

Specifically, override kvm->max_vcpus in tdx_vm_init() for TDX guest,
and report kvm->max_vcpus in the KVM_CAP_MAX_VCPUS extension check.

Change to report "the number of logical CPUs the platform has" as the
maximum vCPUs for TDX guest.  Simply forwarding the MAX_VCPU_PER_TD
reported by the TDX module would result in an unpredictable ABI because
the reported value to userspace would be depending on whims of TDX
modules.

This works in practice because of the MAX_VCPU_PER_TD reported by the
TDX module will never be smaller than the one reported to userspace.
But to make sure KVM never reports an unsupported value, sanity check
the MAX_VCPU_PER_TD reported by TDX module is not smaller than the
number of logical CPUs the platform has, otherwise refuse to use TDX.

Note, when creating a TDX guest, TDX actually requires the "maximum
vCPUs for _this_ TDX guest" as an input to initialize the TDX guest.
But TDX guest's maximum vCPUs is not part of TDREPORT thus not part of
attestation, thus there's no need to allow userspace to explicitly
_configure_ the maximum vCPUs on per-VM basis.  KVM will simply use
kvm->max_vcpus as input when initializing the TDX guest.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Implement proposal from Sean: (Kai)
   - https://lore.kernel.org/kvm/ZmzaqRy2zjvlsDfL@google.com/
   - https://lore.kernel.org/kvm/fcbc5a898c3434af98656b92a83dbba01d055e51.camel@intel.com/
 - Change title from "KVM: TDX: Allow userspace to configure maximum
   vCPUs for TDX guests"
 - Correct setting of kvm->max_vcpus (Kai)

uAPI breakout v1:
 - Change to use exported 'struct tdx_sysinfo' pointer.
 - Remove the code to read 'max_vcpus_per_td' since it is now done in
   TDX host code.
 - Drop max_vcpu ops to use kvm.max_vcpus
 - Remove TDX_MAX_VCPUS (Kai)
 - Use type cast (u16) instead of calling memcpy() when reading the
   'max_vcpus_per_td' (Kai)
 - Improve change log and change patch title from "KVM: TDX: Make
   KVM_CAP_MAX_VCPUS backend specific" (Kai)
---
 arch/x86/kvm/vmx/main.c |  1 +
 arch/x86/kvm/vmx/tdx.c  | 51 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c      |  2 ++
 3 files changed, 54 insertions(+)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ed4afa45b16b..559f9450dec7 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -7,6 +7,7 @@
 #include "pmu.h"
 #include "posted_intr.h"
 #include "tdx.h"
+#include "tdx_arch.h"
 
 static __init int vt_hardware_setup(void)
 {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 50217f601061..c9093b003c13 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -385,6 +385,19 @@ int tdx_vm_init(struct kvm *kvm)
 {
 	kvm->arch.has_private_mem = true;
 
+	/*
+	 * TDX has its own limit of maximum vCPUs it can support for all
+	 * TDX guests in addition to KVM_MAX_VCPUS.  TDX module reports
+	 * such limit via the MAX_VCPU_PER_TD global metadata.  In
+	 * practice, it reflects the number of logical CPUs that ALL
+	 * platforms that the TDX module supports can possibly have.
+	 *
+	 * Limit TDX guest's maximum vCPUs to the number of logical CPUs
+	 * the platform has.  Simply forwarding the MAX_VCPU_PER_TD to
+	 * userspace would result in an unpredictable ABI.
+	 */
+	kvm->max_vcpus = min_t(int, kvm->max_vcpus, num_present_cpus());
+
 	/* Place holder for TDX specific logic. */
 	return __tdx_td_init(kvm);
 }
@@ -702,6 +715,7 @@ static int __init __do_tdx_bringup(void)
 
 static int __init __tdx_bringup(void)
 {
+	const struct tdx_sys_info_td_conf *td_conf;
 	int r;
 
 	if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
@@ -741,6 +755,43 @@ static int __init __tdx_bringup(void)
 	    !tdx_get_supported_xfam(&tdx_sysinfo->td_conf))
 		goto get_sysinfo_err;
 
+	/*
+	 * TDX has its own limit of maximum vCPUs it can support for all
+	 * TDX guests in addition to KVM_MAX_VCPUS.  Userspace needs to
+	 * query TDX guest's maximum vCPUs by checking KVM_CAP_MAX_VCPU
+	 * extension on per-VM basis.
+	 *
+	 * TDX module reports such limit via the MAX_VCPU_PER_TD global
+	 * metadata.  Different modules may report different values.
+	 * Some old module may also not support this metadata (in which
+	 * case this limit is U16_MAX).
+	 *
+	 * In practice, the reported value reflects the maximum logical
+	 * CPUs that ALL the platforms that the module supports can
+	 * possibly have.
+	 *
+	 * Simply forwarding the MAX_VCPU_PER_TD to userspace could
+	 * result in an unpredictable ABI.  KVM instead always advertise
+	 * the number of logical CPUs the platform has as the maximum
+	 * vCPUs for TDX guests.
+	 *
+	 * Make sure MAX_VCPU_PER_TD reported by TDX module is not
+	 * smaller than the number of logical CPUs, otherwise KVM will
+	 * report an unsupported value to userspace.
+	 *
+	 * Note, a platform with TDX enabled in the BIOS cannot support
+	 * physical CPU hotplug, and TDX requires the BIOS has marked
+	 * all logical CPUs in MADT table as enabled.  Just use
+	 * num_present_cpus() for the number of logical CPUs.
+	 */
+	td_conf = &tdx_sysinfo->td_conf;
+	if (td_conf->max_vcpus_per_td < num_present_cpus()) {
+		pr_err("Disable TDX: MAX_VCPU_PER_TD (%u) smaller than number of logical CPUs (%u).\n",
+				td_conf->max_vcpus_per_td, num_present_cpus());
+		r = -EINVAL;
+		goto get_sysinfo_err;
+	}
+
 	/*
 	 * Leave hardware virtualization enabled after TDX is enabled
 	 * successfully.  TDX CPU hotplug depends on this.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8a103c29dcd0..95a10c7bc507 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4744,6 +4744,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 	case KVM_CAP_MAX_VCPUS:
 		r = KVM_MAX_VCPUS;
+		if (kvm)
+			r = kvm->max_vcpus;
 		break;
 	case KVM_CAP_MAX_VCPU_ID:
 		r = KVM_MAX_VCPU_IDS;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 18/25] KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check
  2024-10-30 19:00 ` [PATCH v2 18/25] KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check Rick Edgecombe
@ 2025-01-05 22:12   ` Huang, Kai
  2025-01-06 19:09     ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Huang, Kai @ 2025-01-05 22:12 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com,
	Edgecombe, Rick P
  Cc: Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette,
	Yamahata, Isaku

On Wed, 2024-10-30 at 12:00 -0700, Rick Edgecombe wrote:
> Note some old modules may also not support this metadata, in which case
> the limit is U16_MAX.

+Dave for a side topic.

I think we should delete this sentence in the new version of this patch since
this sentence is now obsolete which the new patch to read essential metadata for
KVM.

This sentence was needed since originally we had code to do (pseudo):

  if (read_sys_metadata_field(MAX_VCPUS_PER_TD, &td_conf->max_vcpus_per_td))
      td_conf->max_vcpus_per_td = U16_MAX;

Now the above code is removed in the patch which reads essential metadata for
KVM, and reading failure of this metadata will be fatal just like reading
others.

It was removed because when I was trying to avoid special handling in the the
python script when generating the metadata reading code, I found the NO_BRP_MOD
feature was introduced to the module way after the MAX_VCPUS_PER_TD metadata was
added, therefore practically this field will always be present for the modules
that Linux support.

Please let me know if you have different opinion, i.e., we should still do the
old way in the patch which reads essential metadata for KVM?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 18/25] KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check
  2025-01-05 22:12   ` Huang, Kai
@ 2025-01-06 19:09     ` Edgecombe, Rick P
  0 siblings, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2025-01-06 19:09 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hansen, Dave, seanjc@google.com, Huang, Kai
  Cc: isaku.yamahata@gmail.com, Li, Xiaoyao, Chatre, Reinette,
	Zhao, Yan Y, tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	Yamahata, Isaku, linux-kernel@vger.kernel.org

On Sun, 2025-01-05 at 22:12 +0000, Huang, Kai wrote:
> I think we should delete this sentence in the new version of this patch since
> this sentence is now obsolete which the new patch to read essential metadata for
> KVM.
> 
> This sentence was needed since originally we had code to do (pseudo):
> 
>   if (read_sys_metadata_field(MAX_VCPUS_PER_TD, &td_conf->max_vcpus_per_td))
>       td_conf->max_vcpus_per_td = U16_MAX;
> 
> Now the above code is removed in the patch which reads essential metadata for
> KVM, and reading failure of this metadata will be fatal just like reading
> others.
> 
> It was removed because when I was trying to avoid special handling in the the
> python script when generating the metadata reading code, I found the NO_BRP_MOD
> feature was introduced to the module way after the MAX_VCPUS_PER_TD metadata was
> added, therefore practically this field will always be present for the modules
> that Linux support.
> 
> Please let me know if you have different opinion, i.e., we should still do the
> old way in the patch which reads essential metadata for KVM?

Makes sense to me.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 19/25] KVM: TDX: initialize VM with TDX specific parameters
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (17 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 18/25] KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 20/25] KVM: TDX: Make pmu_intel.c ignore guest TD case Rick Edgecombe
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

After the crypto-protection key has been configured, TDX requires a
VM-scope initialization as a step of creating the TDX guest.  This
"per-VM" TDX initialization does the global configurations/features that
the TDX guest can support, such as guest's CPUIDs (emulated by the TDX
module), the maximum number of vcpus etc.

This "per-VM" TDX initialization must be done before any "vcpu-scope" TDX
initialization.  To match this better, require the KVM_TDX_INIT_VM IOCTL()
to be done before KVM creates any vcpus.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Move setting of gfn_direct_bits into MMU part 2 series (Yan)
 - Enable NO_RBP_MOD after removing workaround in later pactches (Fengwei)
 - Use TDX 1.5 naming of config_flags instead of exec_controls (Xiaoyao)
 - fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
   wrappers (Kai)
 - Reject leaves that are not in tdx caps (Rick)
 - Add TD state handling (Tony)
 - Generate data directly from td_conf (Tony)
 - Fold in guest physical address to configure EPT level (Xiaoyao)
 - Use helpers for phys_addr_bits and add comments (Paolo)

uAPI breakout v1:
 - Drop TDX_TD_XFAM_CET and use XFEATURE_MASK_CET_{USER, KERNEL}.
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Move gfn_shared_mask settings into this patch due to MMU section move
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)
 - Allow userspace configure xfam directly
 - Check if user sets non-configurable bits in CPUIDs
 - Rename error->hw_error
 - Move code change to tdx_module_setup() to __tdx_bringup() due to
   initializing is done in post hardware_setup() now and
   tdx_module_setup() is removed.  Remove the code to use API to read
   global metadata but use exported 'struct tdx_sysinfo' pointer.
 - Replace 'tdx_info->nr_tdcs_pages' with a wrapper
   tdx_sysinfo_nr_tdcs_pages() because the 'struct tdx_sysinfo' doesn't
   have nr_tdcs_pages directly.
 - Replace tdx_info->max_vcpus_per_td with the new exported pointer in
   tdx_vm_init().
 - Decrease the reserved space for struct kvm_tdx_init_vm (Kai)
 - Use sizeof_field() for struct kvm_tdx_init_vm cpuids (Tony)
 - No need to init init_vm, it gets copied over in tdx_td_init() (Chao)
 - Use kmalloc() instead of () kzalloc for init_vm in tdx_td_init() (Chao)
 - Add more line breaks to tdx_td_init() to make code easier to read (Tony)
 - Clarify patch description (Kai)

v19:
 - Check NO_RBP_MOD of feature0 and set it
 - Update the comment for PT and CET

v18:
 - remove the change of tools/arch/x86/include/uapi/asm/kvm.h
 - typo in comment. sha348 => sha384
 - updated comment in setup_tdparams_xfam()
 - fix setup_tdparams_xfam() to use init_vm instead of td_params

---
 arch/x86/include/uapi/asm/kvm.h |  24 +++
 arch/x86/kvm/cpuid.c            |   7 +
 arch/x86/kvm/cpuid.h            |   2 +
 arch/x86/kvm/vmx/tdx.c          | 259 ++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.h          |  24 +++
 5 files changed, 306 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 0630530af334..892e16bd7430 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -929,6 +929,7 @@ struct kvm_hyperv_eventfd {
 /* Trust Domain eXtension sub-ioctl() commands. */
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
+	KVM_TDX_INIT_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -959,4 +960,27 @@ struct kvm_tdx_capabilities {
 	struct kvm_cpuid2 cpuid;
 };
 
+struct kvm_tdx_init_vm {
+	__u64 attributes;
+	__u64 xfam;
+	__u64 mrconfigid[6];	/* sha384 digest */
+	__u64 mrowner[6];	/* sha384 digest */
+	__u64 mrownerconfig[6];	/* sha384 digest */
+
+	/* The total space for TD_PARAMS before the CPUIDs is 256 bytes */
+	__u64 reserved[12];
+
+	/*
+	 * Call KVM_TDX_INIT_VM before vcpu creation, thus before
+	 * KVM_SET_CPUID2.
+	 * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the
+	 * TDX module directly virtualizes those CPUIDs without VMM.  The user
+	 * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with
+	 * those values.  If it doesn't, KVM may have wrong idea of vCPUIDs of
+	 * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX
+	 * module doesn't virtualize.
+	 */
+	struct kvm_cpuid2 cpuid;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 41786b834b16..14be20e003f4 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1513,6 +1513,13 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid,
 	return r;
 }
 
+struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(
+	struct kvm_cpuid_entry2 *entries, int nent, u32 function, u64 index)
+{
+	return cpuid_entry2_find(entries, nent, function, index);
+}
+EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry2);
+
 struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
 						    u32 function, u32 index)
 {
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 41697cca354e..00570227e2ae 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -13,6 +13,8 @@ void kvm_set_cpu_caps(void);
 
 void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
 void kvm_update_pv_runtime(struct kvm_vcpu *vcpu);
+struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(struct kvm_cpuid_entry2 *entries,
+					       int nent, u32 function, u64 index);
 struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
 						    u32 function, u32 index);
 struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c9093b003c13..ac224d79ba1e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,6 +5,7 @@
 #include "x86_ops.h"
 #include "tdx.h"
 
+
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
@@ -63,6 +64,11 @@ static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
 	return val;
 }
 
+static int tdx_get_guest_phys_addr_bits(const u32 eax)
+{
+	return (eax & GENMASK(23, 16)) >> 16;
+}
+
 static u32 tdx_set_guest_phys_addr_bits(const u32 eax, int addr_bits)
 {
 	return (eax & ~GENMASK(23, 16)) | (addr_bits & 0xff) << 16;
@@ -360,7 +366,11 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 
 void tdx_vm_free(struct kvm *kvm)
 {
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
 	tdx_reclaim_td_control_pages(kvm);
+
+	kvm_tdx->state = TD_STATE_UNINITIALIZED;
 }
 
 static int tdx_do_tdh_mng_key_config(void *param)
@@ -379,10 +389,10 @@ static int tdx_do_tdh_mng_key_config(void *param)
 	return 0;
 }
 
-static int __tdx_td_init(struct kvm *kvm);
-
 int tdx_vm_init(struct kvm *kvm)
 {
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
 	kvm->arch.has_private_mem = true;
 
 	/*
@@ -398,8 +408,9 @@ int tdx_vm_init(struct kvm *kvm)
 	 */
 	kvm->max_vcpus = min_t(int, kvm->max_vcpus, num_present_cpus());
 
-	/* Place holder for TDX specific logic. */
-	return __tdx_td_init(kvm);
+	kvm_tdx->state = TD_STATE_UNINITIALIZED;
+
+	return 0;
 }
 
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
@@ -450,7 +461,142 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 	return ret;
 }
 
-static int __tdx_td_init(struct kvm *kvm)
+/*
+ * KVM reports guest physical address in CPUID.0x800000008.EAX[23:16], which is
+ * similar to TDX's GPAW. Use this field as the interface for userspace to
+ * configure the GPAW and EPT level for TDs.
+ *
+ * Only values 48 and 52 are supported. Value 52 means GPAW-52 and EPT level
+ * 5, Value 48 means GPAW-48 and EPT level 4. For value 48, GPAW-48 is always
+ * supported. Value 52 is only supported when the platform supports 5 level
+ * EPT.
+ */
+static int setup_tdparams_eptp_controls(struct kvm_cpuid2 *cpuid,
+					struct td_params *td_params)
+{
+	const struct kvm_cpuid_entry2 *entry;
+	int guest_pa;
+
+	entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x80000008, 0);
+	if (!entry)
+		return -EINVAL;
+
+	guest_pa = tdx_get_guest_phys_addr_bits(entry->eax);
+
+	if (guest_pa != 48 && guest_pa != 52)
+		return -EINVAL;
+
+	if (guest_pa == 52 && !cpu_has_vmx_ept_5levels())
+		return -EINVAL;
+
+	td_params->eptp_controls = VMX_EPTP_MT_WB;
+	if (guest_pa == 52) {
+		td_params->eptp_controls |= VMX_EPTP_PWL_5;
+		td_params->config_flags |= TDX_CONFIG_FLAGS_MAX_GPAW;
+	} else {
+		td_params->eptp_controls |= VMX_EPTP_PWL_4;
+	}
+
+	return 0;
+}
+
+static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
+				 struct td_params *td_params)
+{
+	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
+	const struct kvm_cpuid_entry2 *entry;
+	struct tdx_cpuid_value *value;
+	int i, copy_cnt = 0;
+
+	/*
+	 * td_params.cpuid_values: The number and the order of cpuid_value must
+	 * be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs}
+	 * It's assumed that td_params was zeroed.
+	 */
+	for (i = 0; i < td_conf->num_cpuid_config; i++) {
+		struct kvm_cpuid_entry2 tmp;
+
+		td_init_cpuid_entry2(&tmp, i);
+
+		entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent,
+					      tmp.function, tmp.index);
+		if (!entry)
+			continue;
+
+		copy_cnt++;
+
+		value = &td_params->cpuid_values[i];
+		value->eax = entry->eax;
+		value->ebx = entry->ebx;
+		value->ecx = entry->ecx;
+		value->edx = entry->edx;
+
+		/*
+		 * TDX module does not accept nonzero bits 16..23 for the
+		 * CPUID[0x80000008].EAX, see setup_tdparams_eptp_controls().
+		 */
+		if (tmp.function == 0x80000008)
+			value->eax = tdx_set_guest_phys_addr_bits(value->eax, 0);
+	}
+
+	/*
+	 * Rely on the TDX module to reject invalid configuration, but it can't
+	 * check of leafs that don't have a proper slot in td_params->cpuid_values
+	 * to stick then. So fail if there were entries that didn't get copied to
+	 * td_params.
+	 */
+	if (copy_cnt != cpuid->nent)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
+			struct kvm_tdx_init_vm *init_vm)
+{
+	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
+	struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
+	int ret;
+
+	if (kvm->created_vcpus)
+		return -EBUSY;
+
+	if (init_vm->attributes & ~tdx_get_supported_attrs(td_conf))
+		return -EINVAL;
+
+	if (init_vm->xfam & ~tdx_get_supported_xfam(td_conf))
+		return -EINVAL;
+
+	td_params->max_vcpus = kvm->max_vcpus;
+	td_params->attributes = init_vm->attributes | td_conf->attributes_fixed1;
+	td_params->xfam = init_vm->xfam | td_conf->xfam_fixed1;
+
+	td_params->config_flags = TDX_CONFIG_FLAGS_NO_RBP_MOD;
+	td_params->tsc_frequency = TDX_TSC_KHZ_TO_25MHZ(kvm->arch.default_tsc_khz);
+
+	ret = setup_tdparams_eptp_controls(cpuid, td_params);
+	if (ret)
+		return ret;
+
+	ret = setup_tdparams_cpuids(cpuid, td_params);
+	if (ret)
+		return ret;
+
+#define MEMCPY_SAME_SIZE(dst, src)				\
+	do {							\
+		BUILD_BUG_ON(sizeof(dst) != sizeof(src));	\
+		memcpy((dst), (src), sizeof(dst));		\
+	} while (0)
+
+	MEMCPY_SAME_SIZE(td_params->mrconfigid, init_vm->mrconfigid);
+	MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
+	MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
+
+	return 0;
+}
+
+static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
+			 u64 *seamcall_err)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	cpumask_var_t packages;
@@ -458,8 +604,9 @@ static int __tdx_td_init(struct kvm *kvm)
 	unsigned long tdr_pa = 0;
 	unsigned long va;
 	int ret, i;
-	u64 err;
+	u64 err, rcx;
 
+	*seamcall_err = 0;
 	ret = tdx_guest_keyid_alloc();
 	if (ret < 0)
 		return ret;
@@ -573,10 +720,23 @@ static int __tdx_td_init(struct kvm *kvm)
 		}
 	}
 
-	/*
-	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
-	 * ioctl() to define the configure CPUID values for the TD.
-	 */
+	err = tdh_mng_init(kvm_tdx->tdr_pa, __pa(td_params), &rcx);
+	if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_INVALID) {
+		/*
+		 * Because a user gives operands, don't warn.
+		 * Return a hint to the user because it's sometimes hard for the
+		 * user to figure out which operand is invalid.  SEAMCALL status
+		 * code includes which operand caused invalid operand error.
+		 */
+		*seamcall_err = err;
+		ret = -EINVAL;
+		goto teardown;
+	} else if (WARN_ON_ONCE(err)) {
+		pr_tdx_error_1(TDH_MNG_INIT, err, rcx);
+		ret = -EIO;
+		goto teardown;
+	}
+
 	return 0;
 
 	/*
@@ -624,6 +784,82 @@ static int __tdx_td_init(struct kvm *kvm)
 	return ret;
 }
 
+static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_tdx_init_vm *init_vm;
+	struct td_params *td_params = NULL;
+	int ret;
+
+	BUILD_BUG_ON(sizeof(*init_vm) != 256 + sizeof_field(struct kvm_tdx_init_vm, cpuid));
+	BUILD_BUG_ON(sizeof(struct td_params) != 1024);
+
+	if (kvm_tdx->state != TD_STATE_UNINITIALIZED)
+		return -EINVAL;
+
+	if (cmd->flags)
+		return -EINVAL;
+
+	init_vm = kmalloc(sizeof(*init_vm) +
+			  sizeof(init_vm->cpuid.entries[0]) * KVM_MAX_CPUID_ENTRIES,
+			  GFP_KERNEL);
+	if (!init_vm)
+		return -ENOMEM;
+
+	if (copy_from_user(init_vm, u64_to_user_ptr(cmd->data), sizeof(*init_vm))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (init_vm->cpuid.nent > KVM_MAX_CPUID_ENTRIES) {
+		ret = -E2BIG;
+		goto out;
+	}
+
+	if (copy_from_user(init_vm->cpuid.entries,
+			   u64_to_user_ptr(cmd->data) + sizeof(*init_vm),
+			   flex_array_size(init_vm, cpuid.entries, init_vm->cpuid.nent))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (memchr_inv(init_vm->reserved, 0, sizeof(init_vm->reserved))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (init_vm->cpuid.padding) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	td_params = kzalloc(sizeof(struct td_params), GFP_KERNEL);
+	if (!td_params) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = setup_tdparams(kvm, td_params, init_vm);
+	if (ret)
+		goto out;
+
+	ret = __tdx_td_init(kvm, td_params, &cmd->hw_error);
+	if (ret)
+		goto out;
+
+	kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET);
+	kvm_tdx->attributes = td_params->attributes;
+	kvm_tdx->xfam = td_params->xfam;
+
+	kvm_tdx->state = TD_STATE_INITIALIZED;
+out:
+	/* kfree() accepts NULL. */
+	kfree(init_vm);
+	kfree(td_params);
+
+	return ret;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -645,6 +881,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_CAPABILITIES:
 		r = tdx_get_capabilities(&tdx_cmd);
 		break;
+	case KVM_TDX_INIT_VM:
+		r = tdx_td_init(kvm, &tdx_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e557a82bc882..1fcb7c1b078d 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -10,14 +10,27 @@ void tdx_cleanup(void);
 
 extern bool enable_tdx;
 
+/* TDX module hardware states. These follow the TDX module OP_STATEs. */
+enum kvm_tdx_state {
+	TD_STATE_UNINITIALIZED = 0,
+	TD_STATE_INITIALIZED,
+	TD_STATE_RUNNABLE,
+};
+
 struct kvm_tdx {
 	struct kvm kvm;
 
 	unsigned long tdr_pa;
 	unsigned long *tdcs_pa;
 
+	u64 attributes;
+	u64 xfam;
 	int hkid;
 	u8 nr_tdcs_pages;
+
+	u64 tsc_offset;
+
+	enum kvm_tdx_state state;
 };
 
 struct vcpu_tdx {
@@ -45,6 +58,17 @@ static __always_inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
 	return container_of(vcpu, struct vcpu_tdx, vcpu);
 }
 
+static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
+{
+	u64 err, data;
+
+	err = tdh_mng_rd(kvm_tdx->tdr_pa, TDCS_EXEC(field), &data);
+	if (unlikely(err)) {
+		pr_err("TDH_MNG_RD[EXEC.0x%x] failed: 0x%llx\n", field, err);
+		return 0;
+	}
+	return data;
+}
 #else
 static inline void tdx_bringup(void) {}
 static inline void tdx_cleanup(void) {}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 20/25] KVM: TDX: Make pmu_intel.c ignore guest TD case
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (18 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 19/25] KVM: TDX: initialize VM with TDX specific parameters Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 21/25] KVM: TDX: Don't offline the last cpu of one package when there's TDX guest Rick Edgecombe
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX KVM doesn't support PMU yet, it's future work of TDX KVM support as
another patch series. For now, handle TDX by updating vcpu_to_lbr_desc()
and vcpu_to_lbr_records() to return NULL.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Make vcpu_to_lbr_desc() to return NULL (Paolo)
 - Drop unecessary ifdefs around is_td_vcpu() (Tony)

uAPI breakout v1:
 - Fix bisectability issues in headers (Kai)
 - Fix rebase error from v19 (Chao Gao)
 - Make helpers static (Tony Lindgren)
 - Improve whitespace (Tony Lindgren)

v18:
 - Removed unnecessary change to vmx.c which caused kernel warning.
---
 arch/x86/kvm/vmx/pmu_intel.c | 50 +++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/pmu_intel.h | 28 ++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.h       | 34 +-----------------------
 3 files changed, 78 insertions(+), 34 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/pmu_intel.h

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 83382a4d1d66..1cd92b43f463 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -19,6 +19,7 @@
 #include "lapic.h"
 #include "nested.h"
 #include "pmu.h"
+#include "tdx.h"
 
 /*
  * Perf's "BASE" is wildly misleading, architectural PMUs use bits 31:16 of ECX
@@ -34,6 +35,22 @@
 
 #define MSR_PMC_FULL_WIDTH_BIT      (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0)
 
+static struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return NULL;
+
+	return &to_vmx(vcpu)->lbr_desc;
+}
+
+static struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return NULL;
+
+	return &to_vmx(vcpu)->lbr_desc.records;
+}
+
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
 	struct kvm_pmc *pmc;
@@ -129,6 +146,22 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr)
 	return get_gp_pmc(pmu, msr, MSR_IA32_PMC0);
 }
 
+static bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return cpuid_model_is_consistent(vcpu);
+}
+
+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return !!vcpu_to_lbr_records(vcpu)->nr;
+}
+
 static bool intel_pmu_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
 {
 	struct x86_pmu_lbr *records = vcpu_to_lbr_records(vcpu);
@@ -194,6 +227,9 @@ static inline void intel_pmu_release_guest_lbr_event(struct kvm_vcpu *vcpu)
 {
 	struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu);
 
+	if (!lbr_desc)
+		return;
+
 	if (lbr_desc->event) {
 		perf_event_release_kernel(lbr_desc->event);
 		lbr_desc->event = NULL;
@@ -235,6 +271,9 @@ int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu)
 					PERF_SAMPLE_BRANCH_USER,
 	};
 
+	if (WARN_ON_ONCE(!lbr_desc))
+		return 0;
+
 	if (unlikely(lbr_desc->event)) {
 		__set_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use);
 		return 0;
@@ -466,6 +505,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	u64 perf_capabilities;
 	u64 counter_rsvd;
 
+	if (!lbr_desc)
+		return;
+
 	memset(&lbr_desc->records, 0, sizeof(lbr_desc->records));
 
 	/*
@@ -542,7 +584,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 		INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);
 
 	perf_capabilities = vcpu_get_perf_capabilities(vcpu);
-	if (cpuid_model_is_consistent(vcpu) &&
+	if (intel_pmu_lbr_is_compatible(vcpu) &&
 	    (perf_capabilities & PMU_CAP_LBR_FMT))
 		memcpy(&lbr_desc->records, &vmx_lbr_caps, sizeof(vmx_lbr_caps));
 	else
@@ -570,6 +612,9 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu);
 
+	if (!lbr_desc)
+		return;
+
 	for (i = 0; i < KVM_MAX_NR_INTEL_GP_COUNTERS; i++) {
 		pmu->gp_counters[i].type = KVM_PMC_GP;
 		pmu->gp_counters[i].vcpu = vcpu;
@@ -677,6 +722,9 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu);
 
+	if (WARN_ON_ONCE(!lbr_desc))
+		return;
+
 	if (!lbr_desc->event) {
 		vmx_disable_lbr_msrs_passthrough(vcpu);
 		if (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR)
diff --git a/arch/x86/kvm/vmx/pmu_intel.h b/arch/x86/kvm/vmx/pmu_intel.h
new file mode 100644
index 000000000000..5620d0882cdc
--- /dev/null
+++ b/arch/x86/kvm/vmx/pmu_intel.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_PMU_INTEL_H
+#define  __KVM_X86_VMX_PMU_INTEL_H
+
+#include <linux/kvm_host.h>
+
+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu);
+int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
+
+struct lbr_desc {
+	/* Basic info about guest LBR records. */
+	struct x86_pmu_lbr records;
+
+	/*
+	 * Emulate LBR feature via passthrough LBR registers when the
+	 * per-vcpu guest LBR event is scheduled on the current pcpu.
+	 *
+	 * The records may be inaccurate if the host reclaims the LBR.
+	 */
+	struct perf_event *event;
+
+	/* True if LBRs are marked as not intercepted in the MSR bitmap */
+	bool msr_passthrough;
+};
+
+extern struct x86_pmu_lbr vmx_lbr_caps;
+
+#endif /* __KVM_X86_VMX_PMU_INTEL_H */
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index ad9efe41e691..37a555c6dfbf 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -11,6 +11,7 @@
 
 #include "capabilities.h"
 #include "../kvm_cache_regs.h"
+#include "pmu_intel.h"
 #include "vmcs.h"
 #include "vmx_ops.h"
 #include "../cpuid.h"
@@ -90,24 +91,6 @@ union vmx_exit_reason {
 	u32 full;
 };
 
-struct lbr_desc {
-	/* Basic info about guest LBR records. */
-	struct x86_pmu_lbr records;
-
-	/*
-	 * Emulate LBR feature via passthrough LBR registers when the
-	 * per-vcpu guest LBR event is scheduled on the current pcpu.
-	 *
-	 * The records may be inaccurate if the host reclaims the LBR.
-	 */
-	struct perf_event *event;
-
-	/* True if LBRs are marked as not intercepted in the MSR bitmap */
-	bool msr_passthrough;
-};
-
-extern struct x86_pmu_lbr vmx_lbr_caps;
-
 /*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
@@ -659,21 +642,6 @@ static __always_inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
-static inline struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu)
-{
-	return &to_vmx(vcpu)->lbr_desc;
-}
-
-static inline struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu)
-{
-	return &vcpu_to_lbr_desc(vcpu)->records;
-}
-
-static inline bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
-{
-	return !!vcpu_to_lbr_records(vcpu)->nr;
-}
-
 void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
 int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
 void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 21/25] KVM: TDX: Don't offline the last cpu of one package when there's TDX guest
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (19 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 20/25] KVM: TDX: Make pmu_intel.c ignore guest TD case Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 22/25] KVM: TDX: create/free TDX vcpu structure Rick Edgecombe
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Binbin Wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Destroying TDX guest requires there's at least one cpu online for each
package, because reclaiming the TDX KeyID of the guest (as part of the
teardown process) requires to call some SEAMCALL (on any cpu) on all
packages.

Do not offline the last cpu of one package when there's any TDX guest
running, otherwise KVM may not be able to teardown TDX guest resulting
in leaking of TDX KeyID and other resources like TDX guest control
structure pages.

Implement the TDX version 'offline_cpu()' to prevent the cpu from going
offline if it is the last cpu on the package.

Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
uAPI breakout v2:
 - Update description to leave out stale part (Binbin)
 - Add some local hkid tracking on KVM side, now that the allocator is
   in arch/x86 code (Kai)

uAPI breakout v1:
 - Remove nr_configured_keyid, use ida_is_empty() instead (Chao)
 - Change to use a simpler way to check whether the to-go-offline cpu is
   the last online cpu on the package. (Chao)
 - Improve the changelog (Kai)
 - Improve the patch title to call out "when there's TDX guest".  (Kai)
 - Significantly reduce the code by using TDX's own CPUHP callback,
   instead of hooking into KVM's.
 - Update changelog to reflect the change.

v18:
 - Added reviewed-by BinBin
---
 arch/x86/kvm/vmx/tdx.c | 43 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ac224d79ba1e..17df857ae4c1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -122,6 +122,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
  */
 static DEFINE_MUTEX(tdx_lock);
 
+static atomic_t nr_configured_hkid;
+
 /* Maximum number of retries to attempt for SEAMCALLs. */
 #define TDX_SEAMCALL_RETRIES	10000
 
@@ -134,6 +136,7 @@ static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
 {
 	tdx_guest_keyid_free(kvm_tdx->hkid);
 	kvm_tdx->hkid = -1;
+	atomic_dec(&nr_configured_hkid);
 }
 
 static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
@@ -612,6 +615,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 		return ret;
 	kvm_tdx->hkid = ret;
 
+	atomic_inc(&nr_configured_hkid);
+
 	va = __get_free_page(GFP_KERNEL_ACCOUNT);
 	if (!va)
 		goto free_hkid;
@@ -913,6 +918,42 @@ static int tdx_online_cpu(unsigned int cpu)
 	return r;
 }
 
+static int tdx_offline_cpu(unsigned int cpu)
+{
+	int i;
+
+	/* No TD is running.  Allow any cpu to be offline. */
+	if (!atomic_read(&nr_configured_hkid))
+		return 0;
+
+	/*
+	 * In order to reclaim TDX HKID, (i.e. when deleting guest TD), need to
+	 * call TDH.PHYMEM.PAGE.WBINVD on all packages to program all memory
+	 * controller with pconfig.  If we have active TDX HKID, refuse to
+	 * offline the last online cpu.
+	 */
+	for_each_online_cpu(i) {
+		/*
+		 * Found another online cpu on the same package.
+		 * Allow to offline.
+		 */
+		if (i != cpu && topology_physical_package_id(i) ==
+				topology_physical_package_id(cpu))
+			return 0;
+	}
+
+	/*
+	 * This is the last cpu of this package.  Don't offline it.
+	 *
+	 * Because it's hard for human operator to understand the
+	 * reason, warn it.
+	 */
+#define MSG_ALLPKG_ONLINE \
+	"TDX requires all packages to have an online CPU. Delete all TDs in order to offline all CPUs of a package.\n"
+	pr_warn_ratelimited(MSG_ALLPKG_ONLINE);
+	return -EBUSY;
+}
+
 static void __do_tdx_cleanup(void)
 {
 	/*
@@ -938,7 +979,7 @@ static int __init __do_tdx_bringup(void)
 	 */
 	r = cpuhp_setup_state_cpuslocked(CPUHP_AP_ONLINE_DYN,
 					 "kvm/cpu/tdx:online",
-					 tdx_online_cpu, NULL);
+					 tdx_online_cpu, tdx_offline_cpu);
 	if (r < 0)
 		return r;
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 22/25] KVM: TDX: create/free TDX vcpu structure
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (20 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 21/25] KVM: TDX: Don't offline the last cpu of one package when there's TDX guest Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 23/25] KVM: TDX: Do TDX specific vcpu initialization Rick Edgecombe
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement vcpu related stubs for TDX for create, reset and free.

For now, create only the features that do not require the TDX SEAMCALL.
The TDX specific vcpu initialization will be handled by KVM_TDX_INIT_VCPU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Drop dummy tdx_vcpu_reset() (Binbin)
 - Add TD state handling (Tony)
uAPI breakout v1:
 - Dropped unnecessary WARN_ON_ONCE() in tdx_vcpu_create().
   WARN_ON_ONCE(vcpu->arch.cpuid_entries),
   WARN_ON_ONCE(vcpu->arch.cpuid_nent)
 - Use kvm_tdx instead of to_kvm_tdx() in tdx_vcpu_create() (Chao)

v19:
 - removed stale comment in tdx_vcpu_create().

v18:
 - update commit log to use create instead of allocate because the patch
   doesn't newly allocate memory for TDX vcpu.

v16:
 - Add AMX support as the KVM upstream supports it.
--
2.46.0
---
 arch/x86/kvm/vmx/main.c    | 42 ++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/tdx.c     | 34 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  8 ++++++++
 3 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 559f9450dec7..0548d54eb055 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -64,6 +64,40 @@ static void vt_vm_free(struct kvm *kvm)
 		tdx_vm_free(kvm);
 }
 
+static int vt_vcpu_precreate(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return 0;
+
+	return vmx_vcpu_precreate(kvm);
+}
+
+static int vt_vcpu_create(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_create(vcpu);
+
+	return vmx_vcpu_create(vcpu);
+}
+
+static void vt_vcpu_free(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_vcpu_free(vcpu);
+		return;
+	}
+
+	vmx_vcpu_free(vcpu);
+}
+
+static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_vcpu_reset(vcpu, init_event);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -100,10 +134,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vm_destroy = vt_vm_destroy,
 	.vm_free = vt_vm_free,
 
-	.vcpu_precreate = vmx_vcpu_precreate,
-	.vcpu_create = vmx_vcpu_create,
-	.vcpu_free = vmx_vcpu_free,
-	.vcpu_reset = vmx_vcpu_reset,
+	.vcpu_precreate = vt_vcpu_precreate,
+	.vcpu_create = vt_vcpu_create,
+	.vcpu_free = vt_vcpu_free,
+	.vcpu_reset = vt_vcpu_reset,
 
 	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
 	.vcpu_load = vmx_vcpu_load,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 17df857ae4c1..479ffb8f41c8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -416,6 +416,40 @@ int tdx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+int tdx_vcpu_create(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+	if (kvm_tdx->state != TD_STATE_INITIALIZED)
+		return -EIO;
+
+	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
+	if (!vcpu->arch.apic)
+		return -EINVAL;
+
+	fpstate_set_confidential(&vcpu->arch.guest_fpu);
+
+	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
+
+	vcpu->arch.cr0_guest_owned_bits = -1ul;
+	vcpu->arch.cr4_guest_owned_bits = -1ul;
+
+	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
+	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
+	vcpu->arch.guest_state_protected =
+		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTR_DEBUG);
+
+	if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
+		vcpu->arch.xfd_no_write_intercept = true;
+
+	return 0;
+}
+
+void tdx_vcpu_free(struct kvm_vcpu *vcpu)
+{
+	/* This is stub for now.  More logic will come. */
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index e7d5afce68f0..107c60ac94f4 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -122,12 +122,20 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
 int tdx_vm_init(struct kvm *kvm);
 void tdx_mmu_release_hkid(struct kvm *kvm);
 void tdx_vm_free(struct kvm *kvm);
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
+
+int tdx_vcpu_create(struct kvm_vcpu *vcpu);
+void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 #else
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
 static inline void tdx_vm_free(struct kvm *kvm) {}
+
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
+static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 23/25] KVM: TDX: Do TDX specific vcpu initialization
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (21 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 22/25] KVM: TDX: create/free TDX vcpu structure Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-30 19:00 ` [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID Rick Edgecombe
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre,
	Isaku Yamahata, Sean Christopherson, Adrian Hunter

From: Isaku Yamahata <isaku.yamahata@intel.com>

TD guest vcpu needs TDX specific initialization before running.  Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU, and implement the callback for it.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Drop dummy tdx_vcpu_reset() (Binbin)
 - Calculate nr_vcpu_tcds_pages on init (Nikolay)
 - Use vcpu_tdcx naming instead of just tdcx (Yuan)
 - No need for is_td_vcpu_created() (Rick)
 - No need for is_td_finalized() (Rick)
 - Export functions used (Binbin)
 - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
   wrappers (Kai)
 - Add TD state handling (Tony)
 - Clarify comment wrt is_hkid_assigned() in tdx_vcpu_free() (Yuan)
 - Fix error paths in tdx_td_vcpu_init() (Yuan)
 - Do not unnecessarily leak tdx->tdvpr_pa in tdx_vcpu_free() (Yuan)

uAPI breakout v1:
 - Support FEATURES0_TOPOLOGY_ENUM
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Remove WARN_ON_ONCE() in tdx_vcpu_free().
   WARN_ON_ONCE(vcpu->cpu != -1), WARN_ON_ONCE(tdx->tdvpx_pa),
   WARN_ON_ONCE(tdx->tdvpr_pa)
 - Remove KVM_BUG_ON() in tdx_vcpu_reset().
 - Remove duplicate "tdx->tdvpr_pa=" lines
 - Rename tdvpx to tdcx as it is confusing, follow spec change for same
   reason (Isaku)
 - Updates from seamcall overhaul (Kai)
 - Rename error->hw_error
 - Change using tdx_info to using exported 'tdx_sysinfo' pointer in
   tdx_td_vcpu_init().
 - Remove code to the old (non-existing) tdx_module_setup().
 - Use a new wrapper tdx_sysinfo_nr_tdcx_pages() to replace
   tdx_info->nr_tdcx_pages.
 - Combine the two for loops in tdx_td_vcpu_init() (Chao)
 - Add more line breaks into tdx_td_vcpu_init() for readability (Tony)
 - Drop Drop local tdcx_pa in tdx_td_vcpu_init() (Rick)
 - Drop Drop local tdvpr_pa in tdx_td_vcpu_init() (Rick)

v18:
 - Use tdh_sys_rd() instead of struct tdsysinfo_struct.
 - Rename tdx_reclaim_td_page() => tdx_reclaim_control_page()
 - Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
---
 arch/x86/include/asm/kvm-x86-ops.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   1 +
 arch/x86/include/uapi/asm/kvm.h    |   1 +
 arch/x86/kvm/vmx/main.c            |   9 ++
 arch/x86/kvm/vmx/tdx.c             | 182 ++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h             |  13 ++-
 arch/x86/kvm/vmx/tdx_arch.h        |   2 +
 arch/x86/kvm/vmx/x86_ops.h         |   4 +
 arch/x86/kvm/x86.c                 |   8 ++
 9 files changed, 219 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e7bd7867cb94..ec1b1b39c6b3 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -125,6 +125,7 @@ KVM_X86_OP(enable_smi_window)
 #endif
 KVM_X86_OP_OPTIONAL(dev_get_attr)
 KVM_X86_OP(mem_enc_ioctl)
+KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
 KVM_X86_OP_OPTIONAL(mem_enc_register_region)
 KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
 KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d8478e103f07..dfa89a5d15ef 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1827,6 +1827,7 @@ struct kvm_x86_ops {
 
 	int (*dev_get_attr)(u32 group, u64 attr, u64 *val);
 	int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
+	int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
 	int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 892e16bd7430..2cfec4b42b9d 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -930,6 +930,7 @@ struct kvm_hyperv_eventfd {
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
+	KVM_TDX_INIT_VCPU,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 0548d54eb055..d28ffddd766f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -106,6 +106,14 @@ static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	return tdx_vm_ioctl(kvm, argp);
 }
 
+static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	if (!is_td_vcpu(vcpu))
+		return -EINVAL;
+
+	return tdx_vcpu_ioctl(vcpu, argp);
+}
+
 #define VMX_REQUIRED_APICV_INHIBITS				\
 	(BIT(APICV_INHIBIT_REASON_DISABLED) |			\
 	 BIT(APICV_INHIBIT_REASON_ABSENT) |			\
@@ -260,6 +268,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.get_untagged_addr = vmx_get_untagged_addr,
 
 	.mem_enc_ioctl = vt_mem_enc_ioctl,
+	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 479ffb8f41c8..9008db6cf3b4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -419,6 +419,7 @@ int tdx_vm_init(struct kvm *kvm)
 int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 
 	if (kvm_tdx->state != TD_STATE_INITIALIZED)
 		return -EIO;
@@ -442,12 +443,42 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
 		vcpu->arch.xfd_no_write_intercept = true;
 
+	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
+
 	return 0;
 }
 
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
-	/* This is stub for now.  More logic will come. */
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int i;
+
+	/*
+	 * It is not possible to reclaim pages while hkid is assigned. It might
+	 * be assigned if:
+	 * 1. the TD VM is being destroyed but freeing hkid failed, in which
+	 * case the pages are leaked
+	 * 2. TD VCPU creation failed and this on the error path, in which case
+	 * there is nothing to do anyway
+	 */
+	if (is_hkid_assigned(kvm_tdx))
+		return;
+
+	if (tdx->tdcx_pa) {
+		for (i = 0; i < kvm_tdx->nr_vcpu_tdcx_pages; i++) {
+			if (tdx->tdcx_pa[i])
+				tdx_reclaim_control_page(tdx->tdcx_pa[i]);
+		}
+		kfree(tdx->tdcx_pa);
+		tdx->tdcx_pa = NULL;
+	}
+	if (tdx->tdvpr_pa) {
+		tdx_reclaim_control_page(tdx->tdvpr_pa);
+		tdx->tdvpr_pa = 0;
+	}
+
+	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
 }
 
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
@@ -657,6 +688,9 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 	tdr_pa = __pa(va);
 
 	kvm_tdx->nr_tdcs_pages = tdx_sysinfo->td_ctrl.tdcs_base_size / PAGE_SIZE;
+	/* TDVPS = TDVPR(4K page) + TDCX(multiple 4K pages), -1 for TDVPR. */
+	kvm_tdx->nr_vcpu_tdcx_pages = tdx_sysinfo->td_ctrl.tdvps_base_size / PAGE_SIZE - 1;
+
 	tdcs_pa = kcalloc(kvm_tdx->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
 			  GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!tdcs_pa)
@@ -936,6 +970,152 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	return r;
 }
 
+/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
+static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
+{
+	const struct tdx_sys_info_features *modinfo = &tdx_sysinfo->features;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	unsigned long va;
+	int ret, i;
+	u64 err;
+
+	va = __get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!va)
+		return -ENOMEM;
+	tdx->tdvpr_pa = __pa(va);
+
+	tdx->tdcx_pa = kcalloc(kvm_tdx->nr_vcpu_tdcx_pages, sizeof(*tdx->tdcx_pa),
+			       GFP_KERNEL_ACCOUNT);
+	if (!tdx->tdcx_pa) {
+		ret = -ENOMEM;
+		goto free_tdvpr;
+	}
+
+	for (i = 0; i < kvm_tdx->nr_vcpu_tdcx_pages; i++) {
+		va = __get_free_page(GFP_KERNEL_ACCOUNT);
+		if (!va) {
+			ret = -ENOMEM;
+			goto free_tdcx;
+		}
+		tdx->tdcx_pa[i] = __pa(va);
+	}
+
+	err = tdh_vp_create(kvm_tdx->tdr_pa, tdx->tdvpr_pa);
+	if (KVM_BUG_ON(err, vcpu->kvm)) {
+		ret = -EIO;
+		pr_tdx_error(TDH_VP_CREATE, err);
+		goto free_tdcx;
+	}
+
+	for (i = 0; i < kvm_tdx->nr_vcpu_tdcx_pages; i++) {
+		err = tdh_vp_addcx(tdx->tdvpr_pa, tdx->tdcx_pa[i]);
+		if (KVM_BUG_ON(err, vcpu->kvm)) {
+			pr_tdx_error(TDH_VP_ADDCX, err);
+			/*
+			 * Pages already added are reclaimed by the vcpu_free
+			 * method, but the rest are freed here.
+			 */
+			for (; i < kvm_tdx->nr_vcpu_tdcx_pages; i++) {
+				free_page((unsigned long)__va(tdx->tdcx_pa[i]));
+				tdx->tdcx_pa[i] = 0;
+			}
+			return -EIO;
+		}
+	}
+
+	if (modinfo->tdx_features0 & MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM)
+		err = tdh_vp_init_apicid(tdx->tdvpr_pa, vcpu_rcx, vcpu->vcpu_id);
+	else
+		err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
+
+	if (KVM_BUG_ON(err, vcpu->kvm)) {
+		pr_tdx_error(TDH_VP_INIT, err);
+		return -EIO;
+	}
+
+	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+	return 0;
+
+free_tdcx:
+	for (i = 0; i < kvm_tdx->nr_vcpu_tdcx_pages; i++) {
+		if (tdx->tdcx_pa[i])
+			free_page((unsigned long)__va(tdx->tdcx_pa[i]));
+		tdx->tdcx_pa[i] = 0;
+	}
+	kfree(tdx->tdcx_pa);
+	tdx->tdcx_pa = NULL;
+
+free_tdvpr:
+	if (tdx->tdvpr_pa)
+		free_page((unsigned long)__va(tdx->tdvpr_pa));
+	tdx->tdvpr_pa = 0;
+
+	return ret;
+}
+
+static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
+{
+	struct msr_data apic_base_msr;
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int ret;
+
+	if (cmd->flags)
+		return -EINVAL;
+
+	if (tdx->state != VCPU_TD_STATE_UNINITIALIZED)
+		return -EINVAL;
+
+	/*
+	 * As TDX requires X2APIC, set local apic mode to X2APIC.  User space
+	 * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
+	 * KVM_SET_CPUID2.  Otherwise kvm_set_apic_base() will fail.
+	 */
+	apic_base_msr = (struct msr_data) {
+		.host_initiated = true,
+		.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
+		(kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0),
+	};
+	if (kvm_set_apic_base(vcpu, &apic_base_msr))
+		return -EINVAL;
+
+	ret = tdx_td_vcpu_init(vcpu, (u64)cmd->data);
+	if (ret)
+		return ret;
+
+	tdx->state = VCPU_TD_STATE_INITIALIZED;
+
+	return 0;
+}
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct kvm_tdx_cmd cmd;
+	int ret;
+
+	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
+		return -EINVAL;
+
+	if (copy_from_user(&cmd, argp, sizeof(cmd)))
+		return -EFAULT;
+
+	if (cmd.hw_error)
+		return -EINVAL;
+
+	switch (cmd.id) {
+	case KVM_TDX_INIT_VCPU:
+		ret = tdx_vcpu_init(vcpu, &cmd);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
 static int tdx_online_cpu(unsigned int cpu)
 {
 	unsigned long flags;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 1fcb7c1b078d..1b78a7ea988e 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -27,15 +27,26 @@ struct kvm_tdx {
 	u64 xfam;
 	int hkid;
 	u8 nr_tdcs_pages;
+	u8 nr_vcpu_tdcx_pages;
 
 	u64 tsc_offset;
 
 	enum kvm_tdx_state state;
 };
 
+/* TDX module vCPU states */
+enum vcpu_tdx_state {
+	VCPU_TD_STATE_UNINITIALIZED = 0,
+	VCPU_TD_STATE_INITIALIZED,
+};
+
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
-	/* TDX specific members follow. */
+
+	unsigned long tdvpr_pa;
+	unsigned long *tdcx_pa;
+
+	enum vcpu_tdx_state state;
 };
 
 static inline bool is_td(struct kvm *kvm)
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 84af7666e958..9d41699e66a2 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -155,4 +155,6 @@ struct td_params {
 #define TDX_MIN_TSC_FREQUENCY_KHZ		(100 * 1000)
 #define TDX_MAX_TSC_FREQUENCY_KHZ		(10 * 1000 * 1000)
 
+#define MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM	BIT_ULL(20)
+
 #endif /* __KVM_X86_TDX_ARCH_H */
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 107c60ac94f4..4739891858ea 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -127,6 +127,8 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 #else
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
@@ -136,6 +138,8 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
 
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
+
+static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 95a10c7bc507..92de7ebf2cee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -698,6 +698,7 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	kvm_recalculate_apic_map(vcpu->kvm);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(kvm_set_apic_base);
 
 /*
  * Handle a fault on a hardware virtualization (VMX or SVM) instruction.
@@ -6308,6 +6309,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	case KVM_SET_DEVICE_ATTR:
 		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
 		break;
+	case KVM_MEMORY_ENCRYPT_OP:
+		r = -ENOTTY;
+		if (!kvm_x86_ops.vcpu_mem_enc_ioctl)
+			goto out;
+		r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp);
+		break;
 	default:
 		r = -EINVAL;
 	}
@@ -12663,6 +12670,7 @@ bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
 {
 	return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_is_reset_bsp);
 
 bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
 {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (22 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 23/25] KVM: TDX: Do TDX specific vcpu initialization Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-11-01  6:39   ` Binbin Wu
                     ` (2 more replies)
  2024-10-30 19:00 ` [PATCH v2 25/25] KVM: x86/mmu: Taking guest pa into consideration when calculate tdp level Rick Edgecombe
                   ` (3 subsequent siblings)
  27 siblings, 3 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre

From: Xiaoyao Li <xiaoyao.li@intel.com>

Implement an IOCTL to allow userspace to read the CPUID bit values for a
configured TD.

The TDX module doesn't provide the ability to set all CPUID bits. Instead
some are configured indirectly, or have fixed values. But it does allow
for the final resulting CPUID bits to be read. This information will be
useful for userspace to understand the configuration of the TD, and set
KVM's copy via KVM_SET_CPUID2.

To prevent userspace from starting to use features that might not have KVM
support yet, filter the reported values by KVM's support CPUID bits.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Improve error path for tdx_vcpu_get_cpuid() (Xu)
 - Drop unused cpuid in struct kvm_tdx (Xu)
 - Rip out cpuid bit filtering
 - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
   wrappers (Kai)
 - Add mmu.h for kvm_gfn_direct_bits() (Binbin)
 - Drop unused nr_premapped (Tao)
 - Fix formatting for tdx_vcpu_get_cpuid_leaf() (Tony)
 - Use helpers for phys_addr_bits (Paolo)

uAPI breakout v1:
 - New patch
---
 arch/x86/include/uapi/asm/kvm.h |   1 +
 arch/x86/kvm/vmx/tdx.c          | 167 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx_arch.h     |   5 +
 arch/x86/kvm/vmx/tdx_errno.h    |   1 +
 4 files changed, 174 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 2cfec4b42b9d..36fa03376581 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -931,6 +931,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
+	KVM_TDX_GET_CPUID,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9008db6cf3b4..1feb3307fd70 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,6 +2,7 @@
 #include <linux/cpu.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
+#include "mmu.h"
 #include "x86_ops.h"
 #include "tdx.h"
 
@@ -857,6 +858,94 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 	return ret;
 }
 
+static u64 tdx_td_metadata_field_read(struct kvm_tdx *tdx, u64 field_id,
+				      u64 *data)
+{
+	u64 err;
+
+	err = tdh_mng_rd(tdx->tdr_pa, field_id, data);
+
+	return err;
+}
+
+#define TDX_MD_UNREADABLE_LEAF_MASK	GENMASK(30, 7)
+#define TDX_MD_UNREADABLE_SUBLEAF_MASK	GENMASK(31, 7)
+
+static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf,
+			  bool sub_leaf_set, struct kvm_cpuid_entry2 *out)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	u64 field_id = TD_MD_FIELD_ID_CPUID_VALUES;
+	u64 ebx_eax, edx_ecx;
+	u64 err = 0;
+
+	if (sub_leaf & TDX_MD_UNREADABLE_LEAF_MASK ||
+	    sub_leaf_set & TDX_MD_UNREADABLE_SUBLEAF_MASK)
+		return -EINVAL;
+
+	/*
+	 * bit 23:17, REVSERVED: reserved, must be 0;
+	 * bit 16,    LEAF_31: leaf number bit 31;
+	 * bit 15:9,  LEAF_6_0: leaf number bits 6:0, leaf bits 30:7 are
+	 *                      implicitly 0;
+	 * bit 8,     SUBLEAF_NA: sub-leaf not applicable flag;
+	 * bit 7:1,   SUBLEAF_6_0: sub-leaf number bits 6:0. If SUBLEAF_NA is 1,
+	 *                         the SUBLEAF_6_0 is all-1.
+	 *                         sub-leaf bits 31:7 are implicitly 0;
+	 * bit 0,     ELEMENT_I: Element index within field;
+	 */
+	field_id |= ((leaf & 0x80000000) ? 1 : 0) << 16;
+	field_id |= (leaf & 0x7f) << 9;
+	if (sub_leaf_set)
+		field_id |= (sub_leaf & 0x7f) << 1;
+	else
+		field_id |= 0x1fe;
+
+	err = tdx_td_metadata_field_read(kvm_tdx, field_id, &ebx_eax);
+	if (err) //TODO check for specific errors
+		goto err_out;
+
+	out->eax = (u32) ebx_eax;
+	out->ebx = (u32) (ebx_eax >> 32);
+
+	field_id++;
+	err = tdx_td_metadata_field_read(kvm_tdx, field_id, &edx_ecx);
+	/*
+	 * It's weird that reading edx_ecx fails while reading ebx_eax
+	 * succeeded.
+	 */
+	if (WARN_ON_ONCE(err))
+		goto err_out;
+
+	out->ecx = (u32) edx_ecx;
+	out->edx = (u32) (edx_ecx >> 32);
+
+	out->function = leaf;
+	out->index = sub_leaf;
+	out->flags |= sub_leaf_set ? KVM_CPUID_FLAG_SIGNIFCANT_INDEX : 0;
+
+	/*
+	 * Work around missing support on old TDX modules, fetch
+	 * guest maxpa from gfn_direct_bits.
+	 */
+	if (leaf == 0x80000008) {
+		gpa_t gpa_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
+		unsigned int g_maxpa = __ffs(gpa_bits) + 1;
+
+		out->eax = tdx_set_guest_phys_addr_bits(out->eax, g_maxpa);
+	}
+
+	return 0;
+
+err_out:
+	out->eax = 0;
+	out->ebx = 0;
+	out->ecx = 0;
+	out->edx = 0;
+
+	return -EIO;
+}
+
 static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1055,6 +1144,81 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 	return ret;
 }
 
+/* Sometimes reads multipple subleafs. Return how many enties were written. */
+static int tdx_vcpu_get_cpuid_leaf(struct kvm_vcpu *vcpu, u32 leaf, int max_cnt,
+				   struct kvm_cpuid_entry2 *output_e)
+{
+	int i;
+
+	if (!max_cnt)
+		return 0;
+
+	/* First try without a subleaf */
+	if (!tdx_read_cpuid(vcpu, leaf, 0, false, output_e))
+		return 1;
+
+	/*
+	 * If the try without a subleaf failed, try reading subleafs until
+	 * failure. The TDX module only supports 6 bits of subleaf index.
+	 */
+	for (i = 0; i < 0b111111; i++) {
+		if (i > max_cnt)
+			goto out;
+
+		/* Keep reading subleafs until there is a failure. */
+		if (tdx_read_cpuid(vcpu, leaf, i, true, output_e))
+			return i;
+
+		output_e++;
+	}
+
+out:
+	return i;
+}
+
+static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_cpuid2 __user *output, *td_cpuid;
+	struct kvm_cpuid_entry2 *output_e;
+	int r = 0, i = 0, leaf;
+
+	output = u64_to_user_ptr(cmd->data);
+	td_cpuid = kzalloc(sizeof(*td_cpuid) +
+			sizeof(output->entries[0]) * KVM_MAX_CPUID_ENTRIES,
+			GFP_KERNEL);
+	if (!td_cpuid)
+		return -ENOMEM;
+
+	for (leaf = 0; leaf <= 0x1f; leaf++) {
+		output_e = &td_cpuid->entries[i];
+		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
+					     KVM_MAX_CPUID_ENTRIES - i - 1,
+					     output_e);
+	}
+
+	for (leaf = 0x80000000; leaf <= 0x80000008; leaf++) {
+		output_e = &td_cpuid->entries[i];
+		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
+					     KVM_MAX_CPUID_ENTRIES - i - 1,
+					     output_e);
+	}
+
+	td_cpuid->nent = i;
+
+	if (copy_to_user(output, td_cpuid, sizeof(*output))) {
+		r = -EFAULT;
+		goto out;
+	}
+	if (copy_to_user(output->entries, td_cpuid->entries,
+			 td_cpuid->nent * sizeof(struct kvm_cpuid_entry2)))
+		r = -EFAULT;
+
+out:
+	kfree(td_cpuid);
+
+	return r;
+}
+
 static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 {
 	struct msr_data apic_base_msr;
@@ -1108,6 +1272,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	case KVM_TDX_INIT_VCPU:
 		ret = tdx_vcpu_init(vcpu, &cmd);
 		break;
+	case KVM_TDX_GET_CPUID:
+		ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 9d41699e66a2..d80ec118834e 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -157,4 +157,9 @@ struct td_params {
 
 #define MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM	BIT_ULL(20)
 
+/*
+ * TD scope metadata field ID.
+ */
+#define TD_MD_FIELD_ID_CPUID_VALUES		0x9410000300000000ULL
+
 #endif /* __KVM_X86_TDX_ARCH_H */
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index dc3fa2a58c2c..f9dbb3a065cc 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -23,6 +23,7 @@
 #define TDX_FLUSHVP_NOT_DONE			0x8000082400000000ULL
 #define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
 #define TDX_EPT_ENTRY_STATE_INCORRECT		0xC0000B0D00000000ULL
+#define TDX_METADATA_FIELD_NOT_READABLE		0xC0000C0200000000ULL
 
 /*
  * TDX module operand ID, appears in 31:0 part of error code as
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2024-10-30 19:00 ` [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID Rick Edgecombe
@ 2024-11-01  6:39   ` Binbin Wu
  2024-11-01 16:03     ` Edgecombe, Rick P
  2025-01-09 11:07   ` Francesco Lavra
  2025-01-10  4:47   ` Xiaoyao Li
  2 siblings, 1 reply; 103+ messages in thread
From: Binbin Wu @ 2024-11-01  6:39 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, xiaoyao.li, reinette.chatre




On 10/31/2024 3:00 AM, Rick Edgecombe wrote:
[...]
> +
> +#define TDX_MD_UNREADABLE_LEAF_MASK	GENMASK(30, 7)
> +#define TDX_MD_UNREADABLE_SUBLEAF_MASK	GENMASK(31, 7)
> +
> +static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf,
> +			  bool sub_leaf_set, struct kvm_cpuid_entry2 *out)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	u64 field_id = TD_MD_FIELD_ID_CPUID_VALUES;
> +	u64 ebx_eax, edx_ecx;
> +	u64 err = 0;
> +
> +	if (sub_leaf & TDX_MD_UNREADABLE_LEAF_MASK ||
> +	    sub_leaf_set & TDX_MD_UNREADABLE_SUBLEAF_MASK)
> +		return -EINVAL;
It looks weird.
Should be the following?

+	if (leaf & TDX_MD_UNREADABLE_LEAF_MASK ||
+	    sub_leaf & TDX_MD_UNREADABLE_SUBLEAF_MASK)
+		return -EINVAL;


> +
> +	/*
> +	 * bit 23:17, REVSERVED: reserved, must be 0;
> +	 * bit 16,    LEAF_31: leaf number bit 31;
> +	 * bit 15:9,  LEAF_6_0: leaf number bits 6:0, leaf bits 30:7 are
> +	 *                      implicitly 0;
> +	 * bit 8,     SUBLEAF_NA: sub-leaf not applicable flag;
> +	 * bit 7:1,   SUBLEAF_6_0: sub-leaf number bits 6:0. If SUBLEAF_NA is 1,
> +	 *                         the SUBLEAF_6_0 is all-1.
> +	 *                         sub-leaf bits 31:7 are implicitly 0;
> +	 * bit 0,     ELEMENT_I: Element index within field;
> +	 */
> +	field_id |= ((leaf & 0x80000000) ? 1 : 0) << 16;
> +	field_id |= (leaf & 0x7f) << 9;
> +	if (sub_leaf_set)
> +		field_id |= (sub_leaf & 0x7f) << 1;
> +	else
> +		field_id |= 0x1fe;
> +
> +	err = tdx_td_metadata_field_read(kvm_tdx, field_id, &ebx_eax);
> +	if (err) //TODO check for specific errors
> +		goto err_out;
> +
> +	out->eax = (u32) ebx_eax;
> +	out->ebx = (u32) (ebx_eax >> 32);
> +
> +	field_id++;
> +	err = tdx_td_metadata_field_read(kvm_tdx, field_id, &edx_ecx);
> +	/*
> +	 * It's weird that reading edx_ecx fails while reading ebx_eax
> +	 * succeeded.
> +	 */
> +	if (WARN_ON_ONCE(err))
> +		goto err_out;
> +
> +	out->ecx = (u32) edx_ecx;
> +	out->edx = (u32) (edx_ecx >> 32);
> +
> +	out->function = leaf;
> +	out->index = sub_leaf;
> +	out->flags |= sub_leaf_set ? KVM_CPUID_FLAG_SIGNIFCANT_INDEX : 0;
> +
> +	/*
> +	 * Work around missing support on old TDX modules, fetch
> +	 * guest maxpa from gfn_direct_bits.
> +	 */
> +	if (leaf == 0x80000008) {
> +		gpa_t gpa_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> +		unsigned int g_maxpa = __ffs(gpa_bits) + 1;
> +
> +		out->eax = tdx_set_guest_phys_addr_bits(out->eax, g_maxpa);
> +	}
> +
> +	return 0;
> +
> +err_out:
> +	out->eax = 0;
> +	out->ebx = 0;
> +	out->ecx = 0;
> +	out->edx = 0;
> +
> +	return -EIO;
> +}
> +
>
[...]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2024-11-01  6:39   ` Binbin Wu
@ 2024-11-01 16:03     ` Edgecombe, Rick P
  0 siblings, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-01 16:03 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, binbin.wu@linux.intel.com
  Cc: Huang, Kai, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette

On Fri, 2024-11-01 at 14:39 +0800, Binbin Wu wrote:
> > +static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf,
> > +			  bool sub_leaf_set, struct kvm_cpuid_entry2 *out)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > +	u64 field_id = TD_MD_FIELD_ID_CPUID_VALUES;
> > +	u64 ebx_eax, edx_ecx;
> > +	u64 err = 0;
> > +
> > +	if (sub_leaf & TDX_MD_UNREADABLE_LEAF_MASK ||
> > +	    sub_leaf_set & TDX_MD_UNREADABLE_SUBLEAF_MASK)
> > +		return -EINVAL;
> It looks weird.
> Should be the following?
> 
> +	if (leaf & TDX_MD_UNREADABLE_LEAF_MASK ||
> +	    sub_leaf & TDX_MD_UNREADABLE_SUBLEAF_MASK)
> +		return -EINVAL;
> 

Yes, nice catch.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2024-10-30 19:00 ` [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID Rick Edgecombe
  2024-11-01  6:39   ` Binbin Wu
@ 2025-01-09 11:07   ` Francesco Lavra
  2025-01-10  4:29     ` Xiaoyao Li
  2025-01-10  4:47   ` Xiaoyao Li
  2 siblings, 1 reply; 103+ messages in thread
From: Francesco Lavra @ 2025-01-09 11:07 UTC (permalink / raw)
  To: rick.p.edgecombe
  Cc: isaku.yamahata, kai.huang, kvm, linux-kernel, pbonzini,
	reinette.chatre, seanjc, tony.lindgren, xiaoyao.li, yan.y.zhao

On 2024-10-30 at 19:00, Rick Edgecombe wrote:
> @@ -1055,6 +1144,81 @@ static int tdx_td_vcpu_init(struct kvm_vcpu
> *vcpu, u64 vcpu_rcx)
>  	return ret;
>  }
>  
> +/* Sometimes reads multipple subleafs. Return how many enties were
> written. */
> +static int tdx_vcpu_get_cpuid_leaf(struct kvm_vcpu *vcpu, u32 leaf,
> int max_cnt,
> +				   struct kvm_cpuid_entry2
> *output_e)
> +{
> +	int i;
> +
> +	if (!max_cnt)
> +		return 0;
> +
> +	/* First try without a subleaf */
> +	if (!tdx_read_cpuid(vcpu, leaf, 0, false, output_e))
> +		return 1;
> +
> +	/*
> +	 * If the try without a subleaf failed, try reading subleafs
> until
> +	 * failure. The TDX module only supports 6 bits of subleaf
> index.

It actually supports 7 bits, i.e. bits 6:0, so the limit below should
be 0b1111111.

> +	 */
> +	for (i = 0; i < 0b111111; i++) {
> +		if (i > max_cnt)
> +			goto out;

This will make this function return (max_cnt + 1) instead of max_cnt.
I think the code would be simpler if max_cnt was initialized to
min(max_cnt, 0x80) (because 0x7f is a supported subleaf index, as far
as I can tell), and the for() condition was changed to `i < max_cnt`.

> +		/* Keep reading subleafs until there is a failure.
> */
> +		if (tdx_read_cpuid(vcpu, leaf, i, true, output_e))
> +			return i;
> +
> +		output_e++;
> +	}
> +
> +out:
> +	return i;
> +}
> +
> +static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct
> kvm_tdx_cmd *cmd)
> +{
> +	struct kvm_cpuid2 __user *output, *td_cpuid;
> +	struct kvm_cpuid_entry2 *output_e;
> +	int r = 0, i = 0, leaf;
> +
> +	output = u64_to_user_ptr(cmd->data);
> +	td_cpuid = kzalloc(sizeof(*td_cpuid) +
> +			sizeof(output->entries[0]) *
> KVM_MAX_CPUID_ENTRIES,
> +			GFP_KERNEL);
> +	if (!td_cpuid)
> +		return -ENOMEM;
> +
> +	for (leaf = 0; leaf <= 0x1f; leaf++) {
> +		output_e = &td_cpuid->entries[i];
> +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
> +					     KVM_MAX_CPUID_ENTRIES -
> i - 1,

This should be KVM_MAX_CPUID_ENTRIES - i.

> +					     output_e);
> +	}
> +
> +	for (leaf = 0x80000000; leaf <= 0x80000008; leaf++) {
> +		output_e = &td_cpuid->entries[i];
> +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
> +					     KVM_MAX_CPUID_ENTRIES -
> i - 1,

Same here.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2025-01-09 11:07   ` Francesco Lavra
@ 2025-01-10  4:29     ` Xiaoyao Li
  2025-01-10 10:34       ` Francesco Lavra
  0 siblings, 1 reply; 103+ messages in thread
From: Xiaoyao Li @ 2025-01-10  4:29 UTC (permalink / raw)
  To: Francesco Lavra, rick.p.edgecombe
  Cc: isaku.yamahata, kai.huang, kvm, linux-kernel, pbonzini,
	reinette.chatre, seanjc, tony.lindgren, yan.y.zhao

On 1/9/2025 7:07 PM, Francesco Lavra wrote:
> On 2024-10-30 at 19:00, Rick Edgecombe wrote:
>> @@ -1055,6 +1144,81 @@ static int tdx_td_vcpu_init(struct kvm_vcpu
>> *vcpu, u64 vcpu_rcx)
>>   	return ret;
>>   }
>>   
>> +/* Sometimes reads multipple subleafs. Return how many enties were
>> written. */
>> +static int tdx_vcpu_get_cpuid_leaf(struct kvm_vcpu *vcpu, u32 leaf,
>> int max_cnt,
>> +				   struct kvm_cpuid_entry2
>> *output_e)
>> +{
>> +	int i;
>> +
>> +	if (!max_cnt)
>> +		return 0;
>> +
>> +	/* First try without a subleaf */
>> +	if (!tdx_read_cpuid(vcpu, leaf, 0, false, output_e))
>> +		return 1;
>> +
>> +	/*
>> +	 * If the try without a subleaf failed, try reading subleafs
>> until
>> +	 * failure. The TDX module only supports 6 bits of subleaf
>> index.
> 
> It actually supports 7 bits, i.e. bits 6:0, so the limit below should
> be 0b1111111.

Nice catch!

>> +	 */
>> +	for (i = 0; i < 0b111111; i++) {
>> +		if (i > max_cnt)
>> +			goto out;
> 
> This will make this function return (max_cnt + 1) instead of max_cnt.
> I think the code would be simpler if max_cnt was initialized to
> min(max_cnt, 0x80) (because 0x7f is a supported subleaf index, as far
> as I can tell), and the for() condition was changed to `i < max_cnt`.

Looks better.

>> +		/* Keep reading subleafs until there is a failure.
>> */
>> +		if (tdx_read_cpuid(vcpu, leaf, i, true, output_e))
>> +			return i;
>> +
>> +		output_e++;

here the output_e++ can overflow the buffer.

>> +	}
>> +
>> +out:
>> +	return i;
>> +}
>> +
>> +static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct
>> kvm_tdx_cmd *cmd)
>> +{
>> +	struct kvm_cpuid2 __user *output, *td_cpuid;
>> +	struct kvm_cpuid_entry2 *output_e;
>> +	int r = 0, i = 0, leaf;
>> +
>> +	output = u64_to_user_ptr(cmd->data);
>> +	td_cpuid = kzalloc(sizeof(*td_cpuid) +
>> +			sizeof(output->entries[0]) *
>> KVM_MAX_CPUID_ENTRIES,
>> +			GFP_KERNEL);
>> +	if (!td_cpuid)
>> +		return -ENOMEM;
>> +
>> +	for (leaf = 0; leaf <= 0x1f; leaf++) {
>> +		output_e = &td_cpuid->entries[i];
>> +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
>> +					     KVM_MAX_CPUID_ENTRIES -
>> i - 1,
> 
> This should be KVM_MAX_CPUID_ENTRIES - i.

Nice catch!

>> +					     output_e);
>> +	}
>> +
>> +	for (leaf = 0x80000000; leaf <= 0x80000008; leaf++) {
>> +		output_e = &td_cpuid->entries[i];
>> +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
>> +					     KVM_MAX_CPUID_ENTRIES -
>> i - 1,
> 
> Same here.
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2025-01-10  4:29     ` Xiaoyao Li
@ 2025-01-10 10:34       ` Francesco Lavra
  0 siblings, 0 replies; 103+ messages in thread
From: Francesco Lavra @ 2025-01-10 10:34 UTC (permalink / raw)
  To: Xiaoyao Li, rick.p.edgecombe
  Cc: isaku.yamahata, kai.huang, kvm, linux-kernel, pbonzini,
	reinette.chatre, seanjc, tony.lindgren, yan.y.zhao

On Fri, 2025-01-10 at 12:29 +0800, Xiaoyao Li wrote:
> On 1/9/2025 7:07 PM, Francesco Lavra wrote:
> > On 2024-10-30 at 19:00, Rick Edgecombe wrote:
> > > @@ -1055,6 +1144,81 @@ static int tdx_td_vcpu_init(struct
> > > kvm_vcpu
> > > *vcpu, u64 vcpu_rcx)
> > >         return ret;
> > >   }
> > >   
> > > +/* Sometimes reads multipple subleafs. Return how many enties
> > > were
> > > written. */
> > > +static int tdx_vcpu_get_cpuid_leaf(struct kvm_vcpu *vcpu, u32
> > > leaf,
> > > int max_cnt,
> > > +                                  struct kvm_cpuid_entry2
> > > *output_e)
> > > +{
> > > +       int i;
> > > +
> > > +       if (!max_cnt)
> > > +               return 0;
> > > +
> > > +       /* First try without a subleaf */
> > > +       if (!tdx_read_cpuid(vcpu, leaf, 0, false, output_e))
> > > +               return 1;
> > > +
> > > +       /*
> > > +        * If the try without a subleaf failed, try reading
> > > subleafs
> > > until
> > > +        * failure. The TDX module only supports 6 bits of
> > > subleaf
> > > index.
> > 
> > It actually supports 7 bits, i.e. bits 6:0, so the limit below
> > should
> > be 0b1111111.
> 
> Nice catch!
> 
> > > +        */
> > > +       for (i = 0; i < 0b111111; i++) {
> > > +               if (i > max_cnt)
> > > +                       goto out;
> > 
> > This will make this function return (max_cnt + 1) instead of
> > max_cnt.
> > I think the code would be simpler if max_cnt was initialized to
> > min(max_cnt, 0x80) (because 0x7f is a supported subleaf index, as
> > far
> > as I can tell), and the for() condition was changed to `i <
> > max_cnt`.
> 
> Looks better.

You could even simplify this function further by removing the 7-bit
limit altogether and relying on tdx_read_cpuid() returning failure when
the subleaf index is not supported (due to the
TDX_MD_UNREADABLE_SUBLEAF_MASK check).
> 
> > > +               /* Keep reading subleafs until there is a
> > > failure.
> > > */
> > > +               if (tdx_read_cpuid(vcpu, leaf, i, true,
> > > output_e))
> > > +                       return i;
> > > +
> > > +               output_e++;
> 
> here the output_e++ can overflow the buffer.

Not if the for() loop terminates when i reaches max_cnt.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2024-10-30 19:00 ` [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID Rick Edgecombe
  2024-11-01  6:39   ` Binbin Wu
  2025-01-09 11:07   ` Francesco Lavra
@ 2025-01-10  4:47   ` Xiaoyao Li
  2025-01-21 20:24     ` Edgecombe, Rick P
  2025-01-21 23:19     ` Edgecombe, Rick P
  2 siblings, 2 replies; 103+ messages in thread
From: Xiaoyao Li @ 2025-01-10  4:47 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, reinette.chatre

On 10/31/2024 3:00 AM, Rick Edgecombe wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> 
> Implement an IOCTL to allow userspace to read the CPUID bit values for a
> configured TD.
> 
> The TDX module doesn't provide the ability to set all CPUID bits. Instead
> some are configured indirectly, or have fixed values. But it does allow
> for the final resulting CPUID bits to be read. This information will be
> useful for userspace to understand the configuration of the TD, and set
> KVM's copy via KVM_SET_CPUID2.
> 
> To prevent userspace from starting to use features that might not have KVM
> support yet, filter the reported values by KVM's support CPUID bits.

This sentence is not implemented, we need drop it.

> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> uAPI breakout v2:
>   - Improve error path for tdx_vcpu_get_cpuid() (Xu)
>   - Drop unused cpuid in struct kvm_tdx (Xu)
>   - Rip out cpuid bit filtering
>   - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
>     wrappers (Kai)
>   - Add mmu.h for kvm_gfn_direct_bits() (Binbin)
>   - Drop unused nr_premapped (Tao)
>   - Fix formatting for tdx_vcpu_get_cpuid_leaf() (Tony)
>   - Use helpers for phys_addr_bits (Paolo)
> 
> uAPI breakout v1:
>   - New patch
> ---
>   arch/x86/include/uapi/asm/kvm.h |   1 +
>   arch/x86/kvm/vmx/tdx.c          | 167 ++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/tdx_arch.h     |   5 +
>   arch/x86/kvm/vmx/tdx_errno.h    |   1 +
>   4 files changed, 174 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 2cfec4b42b9d..36fa03376581 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -931,6 +931,7 @@ enum kvm_tdx_cmd_id {
>   	KVM_TDX_CAPABILITIES = 0,
>   	KVM_TDX_INIT_VM,
>   	KVM_TDX_INIT_VCPU,
> +	KVM_TDX_GET_CPUID,
>   
>   	KVM_TDX_CMD_NR_MAX,
>   };
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9008db6cf3b4..1feb3307fd70 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2,6 +2,7 @@
>   #include <linux/cpu.h>
>   #include <asm/tdx.h>
>   #include "capabilities.h"
> +#include "mmu.h"
>   #include "x86_ops.h"
>   #include "tdx.h"
>   
> @@ -857,6 +858,94 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
>   	return ret;
>   }
>   
> +static u64 tdx_td_metadata_field_read(struct kvm_tdx *tdx, u64 field_id,
> +				      u64 *data)
> +{
> +	u64 err;
> +
> +	err = tdh_mng_rd(tdx->tdr_pa, field_id, data);
> +
> +	return err;
> +}
> +
> +#define TDX_MD_UNREADABLE_LEAF_MASK	GENMASK(30, 7)
> +#define TDX_MD_UNREADABLE_SUBLEAF_MASK	GENMASK(31, 7)
> +
> +static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf,
> +			  bool sub_leaf_set, struct kvm_cpuid_entry2 *out)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	u64 field_id = TD_MD_FIELD_ID_CPUID_VALUES;
> +	u64 ebx_eax, edx_ecx;
> +	u64 err = 0;
> +
> +	if (sub_leaf & TDX_MD_UNREADABLE_LEAF_MASK ||
> +	    sub_leaf_set & TDX_MD_UNREADABLE_SUBLEAF_MASK)
> +		return -EINVAL;
> +
> +	/*
> +	 * bit 23:17, REVSERVED: reserved, must be 0;
> +	 * bit 16,    LEAF_31: leaf number bit 31;
> +	 * bit 15:9,  LEAF_6_0: leaf number bits 6:0, leaf bits 30:7 are
> +	 *                      implicitly 0;
> +	 * bit 8,     SUBLEAF_NA: sub-leaf not applicable flag;
> +	 * bit 7:1,   SUBLEAF_6_0: sub-leaf number bits 6:0. If SUBLEAF_NA is 1,
> +	 *                         the SUBLEAF_6_0 is all-1.
> +	 *                         sub-leaf bits 31:7 are implicitly 0;
> +	 * bit 0,     ELEMENT_I: Element index within field;
> +	 */
> +	field_id |= ((leaf & 0x80000000) ? 1 : 0) << 16;
> +	field_id |= (leaf & 0x7f) << 9;
> +	if (sub_leaf_set)
> +		field_id |= (sub_leaf & 0x7f) << 1;
> +	else
> +		field_id |= 0x1fe;
> +
> +	err = tdx_td_metadata_field_read(kvm_tdx, field_id, &ebx_eax);
> +	if (err) //TODO check for specific errors
> +		goto err_out;
> +
> +	out->eax = (u32) ebx_eax;
> +	out->ebx = (u32) (ebx_eax >> 32);
> +
> +	field_id++;
> +	err = tdx_td_metadata_field_read(kvm_tdx, field_id, &edx_ecx);
> +	/*
> +	 * It's weird that reading edx_ecx fails while reading ebx_eax
> +	 * succeeded.
> +	 */
> +	if (WARN_ON_ONCE(err))
> +		goto err_out;
> +
> +	out->ecx = (u32) edx_ecx;
> +	out->edx = (u32) (edx_ecx >> 32);
> +
> +	out->function = leaf;
> +	out->index = sub_leaf;
> +	out->flags |= sub_leaf_set ? KVM_CPUID_FLAG_SIGNIFCANT_INDEX : 0;
> +
> +	/*
> +	 * Work around missing support on old TDX modules, fetch
> +	 * guest maxpa from gfn_direct_bits.
> +	 */
> +	if (leaf == 0x80000008) {
> +		gpa_t gpa_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> +		unsigned int g_maxpa = __ffs(gpa_bits) + 1;
> +
> +		out->eax = tdx_set_guest_phys_addr_bits(out->eax, g_maxpa);
> +	}
> +
> +	return 0;
> +
> +err_out:
> +	out->eax = 0;
> +	out->ebx = 0;
> +	out->ecx = 0;
> +	out->edx = 0;
> +
> +	return -EIO;
> +}
> +
>   static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -1055,6 +1144,81 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
>   	return ret;
>   }
>   
> +/* Sometimes reads multipple subleafs. Return how many enties were written. */
> +static int tdx_vcpu_get_cpuid_leaf(struct kvm_vcpu *vcpu, u32 leaf, int max_cnt,
> +				   struct kvm_cpuid_entry2 *output_e)
> +{
> +	int i;
> +
> +	if (!max_cnt)
> +		return 0;
> +
> +	/* First try without a subleaf */
> +	if (!tdx_read_cpuid(vcpu, leaf, 0, false, output_e))
> +		return 1;
> +
> +	/*
> +	 * If the try without a subleaf failed, try reading subleafs until
> +	 * failure. The TDX module only supports 6 bits of subleaf index.
> +	 */
> +	for (i = 0; i < 0b111111; i++) {
> +		if (i > max_cnt)
> +			goto out;
> +
> +		/* Keep reading subleafs until there is a failure. */
> +		if (tdx_read_cpuid(vcpu, leaf, i, true, output_e))
> +			return i;
> +
> +		output_e++;
> +	}
> +
> +out:
> +	return i;
> +}
> +
> +static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> +{
> +	struct kvm_cpuid2 __user *output, *td_cpuid;
> +	struct kvm_cpuid_entry2 *output_e;
> +	int r = 0, i = 0, leaf;
> +
> +	output = u64_to_user_ptr(cmd->data);
> +	td_cpuid = kzalloc(sizeof(*td_cpuid) +
> +			sizeof(output->entries[0]) * KVM_MAX_CPUID_ENTRIES,
> +			GFP_KERNEL);
> +	if (!td_cpuid)
> +		return -ENOMEM;
> +
> +	for (leaf = 0; leaf <= 0x1f; leaf++) {

0x1f needs clarification here.

If it's going to use the maximum leaf KVM can support, it should be 0x24 
to align with __do_cpuid_func().

alternatively, it can use the EAX value of leaf 0 returned by TDX 
module. That is the value TDX module presents to the TD guest.

> +		output_e = &td_cpuid->entries[i];
> +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
> +					     KVM_MAX_CPUID_ENTRIES - i - 1,
> +					     output_e);
> +	}
> +
> +	for (leaf = 0x80000000; leaf <= 0x80000008; leaf++) {
> +		output_e = &td_cpuid->entries[i];
> +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
> +					     KVM_MAX_CPUID_ENTRIES - i - 1,
> +					     output_e);

Though what gets passed in for max_cnt is

   KVM_MAX_CPUID_ENTRIES - i - 1

tdx_vcpu_get_cpuid_leaf() can return "max_cnt+1", i.e., 
KVM_MAX_CPUID_ENTRIES - i.

Then, it makes next round i to be KVM_MAX_CPUID_ENTRIES, and

   output_e = &td_cpuid->entries[i];

will overflow the buffer and access illegal memory.

Similar issue inside tdx_vcpu_get_cpuid_leaf() as I replied in [*]

[*] 
https://lore.kernel.org/all/7574968a-f0e2-49d5-b740-2454a0f70bb6@intel.com/

> +	}
> +
> +	td_cpuid->nent = i;
> +
> +	if (copy_to_user(output, td_cpuid, sizeof(*output))) {
> +		r = -EFAULT;
> +		goto out;
> +	}
> +	if (copy_to_user(output->entries, td_cpuid->entries,
> +			 td_cpuid->nent * sizeof(struct kvm_cpuid_entry2)))
> +		r = -EFAULT;
> +
> +out:
> +	kfree(td_cpuid);
> +
> +	return r;
> +}
> +
>   static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>   {
>   	struct msr_data apic_base_msr;
> @@ -1108,6 +1272,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>   	case KVM_TDX_INIT_VCPU:
>   		ret = tdx_vcpu_init(vcpu, &cmd);
>   		break;
> +	case KVM_TDX_GET_CPUID:
> +		ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> index 9d41699e66a2..d80ec118834e 100644
> --- a/arch/x86/kvm/vmx/tdx_arch.h
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -157,4 +157,9 @@ struct td_params {
>   
>   #define MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM	BIT_ULL(20)
>   
> +/*
> + * TD scope metadata field ID.
> + */
> +#define TD_MD_FIELD_ID_CPUID_VALUES		0x9410000300000000ULL
> +
>   #endif /* __KVM_X86_TDX_ARCH_H */
> diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
> index dc3fa2a58c2c..f9dbb3a065cc 100644
> --- a/arch/x86/kvm/vmx/tdx_errno.h
> +++ b/arch/x86/kvm/vmx/tdx_errno.h
> @@ -23,6 +23,7 @@
>   #define TDX_FLUSHVP_NOT_DONE			0x8000082400000000ULL
>   #define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
>   #define TDX_EPT_ENTRY_STATE_INCORRECT		0xC0000B0D00000000ULL
> +#define TDX_METADATA_FIELD_NOT_READABLE		0xC0000C0200000000ULL
>   
>   /*
>    * TDX module operand ID, appears in 31:0 part of error code as


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2025-01-10  4:47   ` Xiaoyao Li
@ 2025-01-21 20:24     ` Edgecombe, Rick P
  2025-01-22  7:43       ` Xiaoyao Li
  2025-01-21 23:19     ` Edgecombe, Rick P
  1 sibling, 1 reply; 103+ messages in thread
From: Edgecombe, Rick P @ 2025-01-21 20:24 UTC (permalink / raw)
  To: Li, Xiaoyao, pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, kvm@vger.kernel.org, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Zhao, Yan Y,
	isaku.yamahata@gmail.com, tony.lindgren@linux.intel.com

On Fri, 2025-01-10 at 12:47 +0800, Xiaoyao Li wrote:
> 0x1f needs clarification here.
> 
> If it's going to use the maximum leaf KVM can support, it should be 0x24 
> to align with __do_cpuid_func().
> 
> alternatively, it can use the EAX value of leaf 0 returned by TDX 
> module. That is the value TDX module presents to the TD guest.
> 
> > +		output_e = &td_cpuid->entries[i];
> > +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
> > +					     KVM_MAX_CPUID_ENTRIES - i - 1,
> > +					     output_e);
> > +	}
> > +
> > +	for (leaf = 0x80000000; leaf <= 0x80000008; leaf++) {
> > +		output_e = &td_cpuid->entries[i];
> > +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
> > +					     KVM_MAX_CPUID_ENTRIES - i - 1,
> > +					     output_e);

Since we are not filtering by KVM supported features anymore, maybe just use the
max leaf for the host CPU, like:

@@ -2790,14 +2791,14 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu,
struct kvm_tdx_cmd *cmd)
        if (!td_cpuid)
                return -ENOMEM;
 
-       for (leaf = 0; leaf <= 0x1f; leaf++) {
+       for (leaf = 0; leaf <= boot_cpu_data.cpuid_level; leaf++) {
                output_e = &td_cpuid->entries[i];
                i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
                                             KVM_MAX_CPUID_ENTRIES - i - 1,
                                             output_e);
        }
 
-       for (leaf = 0x80000000; leaf <= 0x80000008; leaf++) {
+       for (leaf = 0x80000000; leaf <= boot_cpu_data.extended_cpuid_level;
leaf++) {
                output_e = &td_cpuid->entries[i];
                i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
                                             KVM_MAX_CPUID_ENTRIES - i - 1,


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2025-01-21 20:24     ` Edgecombe, Rick P
@ 2025-01-22  7:43       ` Xiaoyao Li
  2025-01-23 19:44         ` Edgecombe, Rick P
  0 siblings, 1 reply; 103+ messages in thread
From: Xiaoyao Li @ 2025-01-22  7:43 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, kvm@vger.kernel.org, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Zhao, Yan Y,
	isaku.yamahata@gmail.com, tony.lindgren@linux.intel.com

On 1/22/2025 4:24 AM, Edgecombe, Rick P wrote:
> On Fri, 2025-01-10 at 12:47 +0800, Xiaoyao Li wrote:
>> 0x1f needs clarification here.
>>
>> If it's going to use the maximum leaf KVM can support, it should be 0x24
>> to align with __do_cpuid_func().
>>
>> alternatively, it can use the EAX value of leaf 0 returned by TDX
>> module. That is the value TDX module presents to the TD guest.
>>
>>> +		output_e = &td_cpuid->entries[i];
>>> +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
>>> +					     KVM_MAX_CPUID_ENTRIES - i - 1,
>>> +					     output_e);
>>> +	}
>>> +
>>> +	for (leaf = 0x80000000; leaf <= 0x80000008; leaf++) {
>>> +		output_e = &td_cpuid->entries[i];
>>> +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
>>> +					     KVM_MAX_CPUID_ENTRIES - i - 1,
>>> +					     output_e);
> 
> Since we are not filtering by KVM supported features anymore, maybe just use the
> max leaf for the host CPU, like:

host value is not matched with the value returned by TDX module.
I.e., On my SPR machine, the boot_cpu_data.cpuid_level is 0x20, while 
TDX module returns 0x23. It at least fails to report the leaf 0x21 to 
userspace, which is a always valid leaf for TD guest.

> @@ -2790,14 +2791,14 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu,
> struct kvm_tdx_cmd *cmd)
>          if (!td_cpuid)
>                  return -ENOMEM;
>   
> -       for (leaf = 0; leaf <= 0x1f; leaf++) {
> +       for (leaf = 0; leaf <= boot_cpu_data.cpuid_level; leaf++) {
>                  output_e = &td_cpuid->entries[i];
>                  i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
>                                               KVM_MAX_CPUID_ENTRIES - i - 1,
>                                               output_e);
>          }
>   
> -       for (leaf = 0x80000000; leaf <= 0x80000008; leaf++) {
> +       for (leaf = 0x80000000; leaf <= boot_cpu_data.extended_cpuid_level;
> leaf++) {
>                  output_e = &td_cpuid->entries[i];
>                  i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
>                                               KVM_MAX_CPUID_ENTRIES - i - 1,
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2025-01-22  7:43       ` Xiaoyao Li
@ 2025-01-23 19:44         ` Edgecombe, Rick P
  0 siblings, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2025-01-23 19:44 UTC (permalink / raw)
  To: Li, Xiaoyao, pbonzini@redhat.com, seanjc@google.com
  Cc: isaku.yamahata@gmail.com, kvm@vger.kernel.org, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Huang, Kai, Zhao, Yan Y,
	tony.lindgren@linux.intel.com

On Wed, 2025-01-22 at 15:43 +0800, Xiaoyao Li wrote:
> > Since we are not filtering by KVM supported features anymore, maybe just use
> > the
> > max leaf for the host CPU, like:
> 
> host value is not matched with the value returned by TDX module.
> I.e., On my SPR machine, the boot_cpu_data.cpuid_level is 0x20, while 
> TDX module returns 0x23. It at least fails to report the leaf 0x21 to 
> userspace, which is a always valid leaf for TD guest.

Good point, we can use the cpuid level read from the TD.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID
  2025-01-10  4:47   ` Xiaoyao Li
  2025-01-21 20:24     ` Edgecombe, Rick P
@ 2025-01-21 23:19     ` Edgecombe, Rick P
  1 sibling, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2025-01-21 23:19 UTC (permalink / raw)
  To: Li, Xiaoyao, pbonzini@redhat.com, seanjc@google.com,
	francescolavra.fl@gmail.com
  Cc: Huang, Kai, kvm@vger.kernel.org, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Zhao, Yan Y,
	isaku.yamahata@gmail.com, tony.lindgren@linux.intel.com

On Fri, 2025-01-10 at 12:47 +0800, Xiaoyao Li wrote:
> > +		output_e = &td_cpuid->entries[i];
> > +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
> > +					     KVM_MAX_CPUID_ENTRIES - i - 1,
> > +					     output_e);
> > +	}
> > +
> > +	for (leaf = 0x80000000; leaf <= 0x80000008; leaf++) {
> > +		output_e = &td_cpuid->entries[i];
> > +		i += tdx_vcpu_get_cpuid_leaf(vcpu, leaf,
> > +					     KVM_MAX_CPUID_ENTRIES - i - 1,
> > +					     output_e);
> 
> Though what gets passed in for max_cnt is
> 
>    KVM_MAX_CPUID_ENTRIES - i - 1
> 
> tdx_vcpu_get_cpuid_leaf() can return "max_cnt+1", i.e., 
> KVM_MAX_CPUID_ENTRIES - i.
> 
> Then, it makes next round i to be KVM_MAX_CPUID_ENTRIES, and
> 
>    output_e = &td_cpuid->entries[i];
> 
> will overflow the buffer and access illegal memory.
> 
> Similar issue inside tdx_vcpu_get_cpuid_leaf() as I replied in [*]
> 
> [*] 
> https://lore.kernel.org/all/7574968a-f0e2-49d5-b740-2454a0f70bb6@intel.com/

Per Francesco's comment in the other thread, I'm not sure there is an off-by-one
bug here. But in any case the code is too sensitive to issues like that.

In line with Francesco's other comment to move the subleaf checking into
tdx_read_cpuid(), I just changed it to pass around the real index and check
KVM_MAX_CPUID_ENTRIES in tdx_read_cpuid() too. It seems less elegant but easier
to read.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 25/25] KVM: x86/mmu: Taking guest pa into consideration when calculate tdp level
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (23 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID Rick Edgecombe
@ 2024-10-30 19:00 ` Rick Edgecombe
  2024-10-31 19:21 ` [PATCH v2 00/25] TDX vCPU/VM creation Adrian Hunter
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 103+ messages in thread
From: Rick Edgecombe @ 2024-10-30 19:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: rick.p.edgecombe, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre

From: Xiaoyao Li <xiaoyao.li@intel.com>

For TDX, the maxpa (CPUID.0x80000008.EAX[7:0]) is fixed as native and
the max_gpa (CPUID.0x80000008.EAX[23:16]) is configurable and used
to configure the EPT level and GPAW.

Use max_gpa to determine the TDP level.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v2:
 - Use if else for cpuid_query_maxguestphyaddr() (Paolo)

uAPI breakout v1:
 - New patch
---
 arch/x86/kvm/cpuid.c   | 14 ++++++++++++++
 arch/x86/kvm/cpuid.h   |  1 +
 arch/x86/kvm/mmu/mmu.c |  9 ++++++++-
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 14be20e003f4..e7179ce8eadc 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -423,6 +423,20 @@ int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu)
 	return 36;
 }
 
+int cpuid_query_maxguestphyaddr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 0x80000000);
+	if (!best || best->eax < 0x80000008)
+		goto not_found;
+	best = kvm_find_cpuid_entry(vcpu, 0x80000008);
+	if (best)
+		return (best->eax >> 16) & 0xff;
+not_found:
+	return 0;
+}
+
 /*
  * This "raw" version returns the reserved GPA bits without any adjustments for
  * encryption technologies that usurp bits.  The raw mask should be used if and
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 00570227e2ae..61b839aa3548 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -37,6 +37,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
 u32 xstate_required_size(u64 xstate_bv, bool compacted);
 
 int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu);
+int cpuid_query_maxguestphyaddr(struct kvm_vcpu *vcpu);
 u64 kvm_vcpu_reserved_gpa_bits_raw(struct kvm_vcpu *vcpu);
 
 static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9a0fbec33984..2e253a488949 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5474,12 +5474,19 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
 
 static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 {
+	int maxpa;
+
+	if (vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM)
+		maxpa = cpuid_query_maxguestphyaddr(vcpu);
+	else
+		maxpa = cpuid_maxphyaddr(vcpu);
+
 	/* tdp_root_level is architecture forced level, use it if nonzero */
 	if (tdp_root_level)
 		return tdp_root_level;
 
 	/* Use 5-level TDP if and only if it's useful/necessary. */
-	if (max_tdp_level == 5 && cpuid_maxphyaddr(vcpu) <= 48)
+	if (max_tdp_level == 5 && maxpa <= 48)
 		return 4;
 
 	return max_tdp_level;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (24 preceding siblings ...)
  2024-10-30 19:00 ` [PATCH v2 25/25] KVM: x86/mmu: Taking guest pa into consideration when calculate tdp level Rick Edgecombe
@ 2024-10-31 19:21 ` Adrian Hunter
  2024-11-11  9:49   ` Tony Lindgren
  2024-11-12 21:26   ` Edgecombe, Rick P
  2024-12-10 18:22 ` Paolo Bonzini
  2024-12-23 16:25 ` Paolo Bonzini
  27 siblings, 2 replies; 103+ messages in thread
From: Adrian Hunter @ 2024-10-31 19:21 UTC (permalink / raw)
  To: Rick Edgecombe, pbonzini, seanjc
  Cc: yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, xiaoyao.li, reinette.chatre

On 30/10/24 21:00, Rick Edgecombe wrote:
> Here is v2 of TDX VM/vCPU creation series. As discussed earlier, non-nits 
> from v1[0] have been applied and it’s ready to hand off to Paolo. A few 
> items remain that may be worth further discussion:
>  - Disable CET/PT in tdx_get_supported_xfam(), as these features haven’t 
>    been been tested.

It seems for Intel PT we have no support for restoring host
state.  IA32_RTIT_* MSR preservation is Init(XFAM(8)) which means
the TDX Module sets the MSR to its RESET value after TD Enty/Exit.
So it seems to me XFAM(8) does need to be disabled until that is
supported.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2024-10-31 19:21 ` [PATCH v2 00/25] TDX vCPU/VM creation Adrian Hunter
@ 2024-11-11  9:49   ` Tony Lindgren
  2024-11-12  7:26     ` Adrian Hunter
  2024-11-12 21:26   ` Edgecombe, Rick P
  1 sibling, 1 reply; 103+ messages in thread
From: Tony Lindgren @ 2024-11-11  9:49 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Rick Edgecombe, pbonzini, seanjc, yan.y.zhao, isaku.yamahata,
	kai.huang, kvm, linux-kernel, xiaoyao.li, reinette.chatre

On Thu, Oct 31, 2024 at 09:21:29PM +0200, Adrian Hunter wrote:
> On 30/10/24 21:00, Rick Edgecombe wrote:
> > Here is v2 of TDX VM/vCPU creation series. As discussed earlier, non-nits 
> > from v1[0] have been applied and it’s ready to hand off to Paolo. A few 
> > items remain that may be worth further discussion:
> >  - Disable CET/PT in tdx_get_supported_xfam(), as these features haven’t 
> >    been been tested.
> 
> It seems for Intel PT we have no support for restoring host
> state.  IA32_RTIT_* MSR preservation is Init(XFAM(8)) which means
> the TDX Module sets the MSR to its RESET value after TD Enty/Exit.
> So it seems to me XFAM(8) does need to be disabled until that is
> supported.

So for now, we should remove the PT bit from tdx_get_supported_xfam(),
but can still keep it in tdx_restore_host_xsave_state()?

Then for save/restore, maybe we can just use the pt_guest_enter() and
pt_guest_exit() also for TDX. Some additional checks are needed for
the pt_mode though as the TDX module always clears the state if PT is
enabled. And the PT_MODE_SYSTEM will be missing TDX enter/exit data
but might be otherwise usable.

Regards,

Tony

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2024-11-11  9:49   ` Tony Lindgren
@ 2024-11-12  7:26     ` Adrian Hunter
  2024-11-12  9:57       ` Tony Lindgren
  0 siblings, 1 reply; 103+ messages in thread
From: Adrian Hunter @ 2024-11-12  7:26 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Rick Edgecombe, pbonzini, seanjc, yan.y.zhao, isaku.yamahata,
	kai.huang, kvm, linux-kernel, xiaoyao.li, reinette.chatre

On 11/11/24 11:49, Tony Lindgren wrote:
> On Thu, Oct 31, 2024 at 09:21:29PM +0200, Adrian Hunter wrote:
>> On 30/10/24 21:00, Rick Edgecombe wrote:
>>> Here is v2 of TDX VM/vCPU creation series. As discussed earlier, non-nits 
>>> from v1[0] have been applied and it’s ready to hand off to Paolo. A few 
>>> items remain that may be worth further discussion:
>>>  - Disable CET/PT in tdx_get_supported_xfam(), as these features haven’t 
>>>    been been tested.
>>
>> It seems for Intel PT we have no support for restoring host
>> state.  IA32_RTIT_* MSR preservation is Init(XFAM(8)) which means
>> the TDX Module sets the MSR to its RESET value after TD Enty/Exit.
>> So it seems to me XFAM(8) does need to be disabled until that is
>> supported.
> 
> So for now, we should remove the PT bit from tdx_get_supported_xfam(),
> but can still keep it in tdx_restore_host_xsave_state()?

Yes

> 
> Then for save/restore, maybe we can just use the pt_guest_enter() and
> pt_guest_exit() also for TDX. Some additional checks are needed for
> the pt_mode though as the TDX module always clears the state if PT is
> enabled. And the PT_MODE_SYSTEM will be missing TDX enter/exit data
> but might be otherwise usable.

pt_guest_enter() / pt_guest_exit() are not suitable for TDX.  pt_mode
is not relevant for TDX because the TDX guest is always hidden from the
host behind SEAM.  However, restoring host MSRs is not the only issue.

The TDX Module does not validate Intel PT CPUID leaf 0x14
(except it must be all zero if Intel PT is not supported
i.e. if XFAM bit 8 is zero).  For invalid MSR accesses by the guest,
the TDX Module will inject #GP.  Host VMM could provide valid CPUID
to avoid that, but it would also need to be valid for the destination
platform if migration was to be attempted.

Disabling Intel PT for TDX for now also avoids that issue.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2024-11-12  7:26     ` Adrian Hunter
@ 2024-11-12  9:57       ` Tony Lindgren
  0 siblings, 0 replies; 103+ messages in thread
From: Tony Lindgren @ 2024-11-12  9:57 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Rick Edgecombe, pbonzini, seanjc, yan.y.zhao, isaku.yamahata,
	kai.huang, kvm, linux-kernel, xiaoyao.li, reinette.chatre

On Tue, Nov 12, 2024 at 09:26:36AM +0200, Adrian Hunter wrote:
> On 11/11/24 11:49, Tony Lindgren wrote:
> > On Thu, Oct 31, 2024 at 09:21:29PM +0200, Adrian Hunter wrote:
> >> On 30/10/24 21:00, Rick Edgecombe wrote:
> >>> Here is v2 of TDX VM/vCPU creation series. As discussed earlier, non-nits 
> >>> from v1[0] have been applied and it’s ready to hand off to Paolo. A few 
> >>> items remain that may be worth further discussion:
> >>>  - Disable CET/PT in tdx_get_supported_xfam(), as these features haven’t 
> >>>    been been tested.
> >>
> >> It seems for Intel PT we have no support for restoring host
> >> state.  IA32_RTIT_* MSR preservation is Init(XFAM(8)) which means
> >> the TDX Module sets the MSR to its RESET value after TD Enty/Exit.
> >> So it seems to me XFAM(8) does need to be disabled until that is
> >> supported.
> > 
> > So for now, we should remove the PT bit from tdx_get_supported_xfam(),
> > but can still keep it in tdx_restore_host_xsave_state()?
> 
> Yes
> 
> > 
> > Then for save/restore, maybe we can just use the pt_guest_enter() and
> > pt_guest_exit() also for TDX. Some additional checks are needed for
> > the pt_mode though as the TDX module always clears the state if PT is
> > enabled. And the PT_MODE_SYSTEM will be missing TDX enter/exit data
> > but might be otherwise usable.
> 
> pt_guest_enter() / pt_guest_exit() are not suitable for TDX.  pt_mode
> is not relevant for TDX because the TDX guest is always hidden from the
> host behind SEAM.  However, restoring host MSRs is not the only issue.
> 
> The TDX Module does not validate Intel PT CPUID leaf 0x14
> (except it must be all zero if Intel PT is not supported
> i.e. if XFAM bit 8 is zero).  For invalid MSR accesses by the guest,
> the TDX Module will inject #GP.  Host VMM could provide valid CPUID
> to avoid that, but it would also need to be valid for the destination
> platform if migration was to be attempted.
> 
> Disabling Intel PT for TDX for now also avoids that issue.

OK thanks for the detailed explanation.

Regards,

Tony

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2024-10-31 19:21 ` [PATCH v2 00/25] TDX vCPU/VM creation Adrian Hunter
  2024-11-11  9:49   ` Tony Lindgren
@ 2024-11-12 21:26   ` Edgecombe, Rick P
  1 sibling, 0 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2024-11-12 21:26 UTC (permalink / raw)
  To: pbonzini@redhat.com, Hunter, Adrian, seanjc@google.com
  Cc: Huang, Kai, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette

On Thu, 2024-10-31 at 21:21 +0200, Adrian Hunter wrote:
> On 30/10/24 21:00, Rick Edgecombe wrote:
> > Here is v2 of TDX VM/vCPU creation series. As discussed earlier, non-nits 
> > from v1[0] have been applied and it’s ready to hand off to Paolo. A few 
> > items remain that may be worth further discussion:
> >   - Disable CET/PT in tdx_get_supported_xfam(), as these features haven’t 
> >     been been tested.
> 
> It seems for Intel PT we have no support for restoring host
> state.  IA32_RTIT_* MSR preservation is Init(XFAM(8)) which means
> the TDX Module sets the MSR to its RESET value after TD Enty/Exit.
> So it seems to me XFAM(8) does need to be disabled until that is
> supported.

Good point. Let's disable it and CET. We can try a fixup patch when these land
in kvm-coco-queue.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (25 preceding siblings ...)
  2024-10-31 19:21 ` [PATCH v2 00/25] TDX vCPU/VM creation Adrian Hunter
@ 2024-12-10 18:22 ` Paolo Bonzini
  2024-12-23 16:25 ` Paolo Bonzini
  27 siblings, 0 replies; 103+ messages in thread
From: Paolo Bonzini @ 2024-12-10 18:22 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: pbonzini, seanjc, yan.y.zhao, isaku.yamahata, kai.huang, kvm,
	linux-kernel, tony.lindgren, xiaoyao.li, reinette.chatre

Applied to kvm-coco-queue, thanks.  Tomorrow I will go through the
changes and review.

Paolo


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
                   ` (26 preceding siblings ...)
  2024-12-10 18:22 ` Paolo Bonzini
@ 2024-12-23 16:25 ` Paolo Bonzini
  2025-01-04  1:43   ` Edgecombe, Rick P
  27 siblings, 1 reply; 103+ messages in thread
From: Paolo Bonzini @ 2024-12-23 16:25 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, yan.y.zhao, isaku.yamahata, kai.huang, kvm, linux-kernel,
	tony.lindgren, xiaoyao.li, reinette.chatre

On Wed, Oct 30, 2024 at 8:01 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> Hi,
>
> Here is v2 of TDX VM/vCPU creation series. As discussed earlier, non-nits
> from v1[0] have been applied and it’s ready to hand off to Paolo. A few
> items remain that may be worth further discussion:
>  - Disable CET/PT in tdx_get_supported_xfam(), as these features haven’t
>    been been tested.
>  - The Retry loop around tdh_phymem_page_reclaim() in “KVM: TDX:
>    create/destroy VM structure” likely can be dropped.
>  - Drop support for TDX Module’s that don’t support
>    MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM. [1]
>  - Type-safety in to_vmx()/to_tdx(). [2]

To sum up:

removed:
04 replaced by add wrapper functions for SEAMCALLs subseries
06: not needed anymore, all logic for KeyID mgmt now in x86/virt/tdx
10: tdx_capabilities dropped, replaced mostly by 02
11: KVM_TDX_CAPABILITIES moved to patch 16
19: not needed anymore
20: was needed by patch 24
22: folded in other patches
24: left for later
25: left for later/for userspace

01/02:ok
03: need to change 32 to 128
04: ok
05/06/07/08/09/10: replaced with
https://lore.kernel.org/kvm/20241203010317.827803-2-rick.p.edgecombe@intel.com/
11: see the type safety comment above:
> The ugly part here is the type-unsafety of to_vmx/to_tdx.  We probably
> should add some "#pragma poison" of to_vmx/to_tdx: for example both can
> be poisoned in pmu_intel.c after the definition of
> vcpu_to_lbr_records(), while one of them can be poisoned in
> sgx.c/posted_intr.c/vmx.c/tdx.c.

12/13/14/15: ok
16/17: to review
18: not sure why the check against num_present_cpus() is needed?
19: ok
20: ok
21: ok

22: missing review comment from v1

> +     /* TDX only supports x2APIC, which requires an in-kernel local APIC. */
> +     if (!vcpu->arch.apic)
> +             return -EINVAL;

nit: Use kvm_apic_present()

23: ok

24: need to apply fix

-       if (sub_leaf & TDX_MD_UNREADABLE_LEAF_MASK ||
+       if (leaf & TDX_MD_UNREADABLE_LEAF_MASK ||

25: ok


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2024-12-23 16:25 ` Paolo Bonzini
@ 2025-01-04  1:43   ` Edgecombe, Rick P
  2025-01-05 21:32     ` Huang, Kai
                       ` (2 more replies)
  0 siblings, 3 replies; 103+ messages in thread
From: Edgecombe, Rick P @ 2025-01-04  1:43 UTC (permalink / raw)
  To: pbonzini@redhat.com
  Cc: seanjc@google.com, Huang, Kai, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org, Zhao, Yan Y,
	Chatre, Reinette

On Mon, 2024-12-23 at 17:25 +0100, Paolo Bonzini wrote:
> To sum up:
> 
> removed:
> 04 replaced by add wrapper functions for SEAMCALLs subseries
> 06: not needed anymore, all logic for KeyID mgmt now in x86/virt/tdx
> 10: tdx_capabilities dropped, replaced mostly by 02

Sorry, what is this? Not from patch 10 "x86/virt/tdx: Add SEAMCALL wrappers for
TDX flush operations". What was dropped from which patch?

> 11: KVM_TDX_CAPABILITIES moved to patch 16
> 19: not needed anymore

I guess this is not referring to "KVM: TDX: initialize VM with TDX specific
parameters", so not sure which one is dropped.

> 20: was needed by patch 24
> 22: folded in other patches

> 24: left for later
> 25: left for later/for userspace
Ok.

I'm can't figure out what these numbers correspond to, but kvm-coco-queue
doesn't seem to have dropped any patches yet, so maybe it will make more sense
when I can take a look at the refresh there.

> 
> 01/02:ok
> 03: need to change 32 to 128
> 04: ok
> 05/06/07/08/09/10: replaced with
> https://lore.kernel.org/kvm/20241203010317.827803-2-rick.p.edgecombe@intel.com/
> 11: see the type safety comment above:
> > The ugly part here is the type-unsafety of to_vmx/to_tdx.  We probably
> > should add some "#pragma poison" of to_vmx/to_tdx: for example both can
> > be poisoned in pmu_intel.c after the definition of
> > vcpu_to_lbr_records(), while one of them can be poisoned in
> > sgx.c/posted_intr.c/vmx.c/tdx.c.

I left it off because you said "Not a strict requirement though." and gave it a
RB tag. Other stuff seemed higher priority. We can look at some options for a
follow on patch if it lightens your load.

> 
> 12/13/14/15: ok
> 16/17: to review
> 18: not sure why the check against num_present_cpus() is needed?

The per-vm KVM_MAX_VCPUS will be min_t(int, kvm->max_vcpus, num_present_cpus()).
So if td_conf->max_vcpus_per_td < num_present_cpus(), then it might report
supporting more CPUs then actually supported by the TDX module.

As to why not just report td_conf->max_vcpus_per_td, that value is the max CPUs
that are supported by any platform the TDX module supports. So it is more about
what the TDX module supports, then what userspace cares about (how many vCPUs
they can use).

I think we could probably get by without the check and blame the TDX module if
it does something strange. It is seems safer ABI-wise to have the check. But we
are being a bit more cavalier around protecting against TDX supported CPUID bit
changes then originally planned, so the check here now seems inconsistent.

Let me flag Kai to confirm there was not some known violating configuration. He
explored a bunch of edge cases on this corner.

> 19: ok
> 20: ok
> 21: ok
> 
> 22: missing review comment from v1
> 
> > +     /* TDX only supports x2APIC, which requires an in-kernel local APIC. */
> > +     if (!vcpu->arch.apic)
> > +             return -EINVAL;
> 
> nit: Use kvm_apic_present()

Oops, nice catch.

> 
> 23: ok
> 
> 24: need to apply fix
> 
> -       if (sub_leaf & TDX_MD_UNREADABLE_LEAF_MASK ||
> +       if (leaf & TDX_MD_UNREADABLE_LEAF_MASK ||
> 
> 25: ok

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2025-01-04  1:43   ` Edgecombe, Rick P
@ 2025-01-05 21:32     ` Huang, Kai
  2025-01-07  7:37     ` Tony Lindgren
  2025-01-22  8:27     ` Tony Lindgren
  2 siblings, 0 replies; 103+ messages in thread
From: Huang, Kai @ 2025-01-05 21:32 UTC (permalink / raw)
  To: pbonzini@redhat.com, Edgecombe, Rick P
  Cc: isaku.yamahata@gmail.com, seanjc@google.com, Li, Xiaoyao,
	Chatre, Reinette, Zhao, Yan Y, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Sat, 2025-01-04 at 01:43 +0000, Edgecombe, Rick P wrote:
> > 18: not sure why the check against num_present_cpus() is needed?
> 
> The per-vm KVM_MAX_VCPUS will be min_t(int, kvm->max_vcpus, num_present_cpus()).
> So if td_conf->max_vcpus_per_td < num_present_cpus(), then it might report
> supporting more CPUs then actually supported by the TDX module.

Right.

> 
> As to why not just report td_conf->max_vcpus_per_td, that value is the max CPUs
> that are supported by any platform the TDX module supports. So it is more about
> what the TDX module supports, then what userspace cares about (how many vCPUs
> they can use).

Sean didn't want to make reporting maximum vcpus depend on the whims of TDX
module since this doesn't provide a predictable ABI:

https://lore.kernel.org/kvm/ZmzaqRy2zjvlsDfL@google.com/

> 
> I think we could probably get by without the check and blame the TDX module if
> it does something strange. It is seems safer ABI-wise to have the check. But we
> are being a bit more cavalier around protecting against TDX supported CPUID bit
> changes then originally planned, so the check here now seems inconsistent.
> 
> Let me flag Kai to confirm there was not some known violating configuration. He
> explored a bunch of edge cases on this corner.

In practice the "max_vcpu_per_td" will never be smaller than the maximum logical
CPUs that ALL the platforms that the module supports can possibly have.  I got
this from the TDX module guys, and I don't think there's any reason for the TDX
module to break this.

However from module ABI's perspective (from the JSON), it could be any value, so
I think we should have a sanity check.  I think this is also different from the
"array size of CPUID_CONFIGs" ABI breakage (assuming this is what you meant
"protecting TDX supported CPUID bits" above) since it is currently documented as
32 in the JSON.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2025-01-04  1:43   ` Edgecombe, Rick P
  2025-01-05 21:32     ` Huang, Kai
@ 2025-01-07  7:37     ` Tony Lindgren
  2025-01-07 12:41       ` Nikolay Borisov
  2025-01-22  8:27     ` Tony Lindgren
  2 siblings, 1 reply; 103+ messages in thread
From: Tony Lindgren @ 2025-01-07  7:37 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Huang, Kai, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette

On Sat, Jan 04, 2025 at 01:43:56AM +0000, Edgecombe, Rick P wrote:
> On Mon, 2024-12-23 at 17:25 +0100, Paolo Bonzini wrote:
> > 22: missing review comment from v1
> > 
> > > +     /* TDX only supports x2APIC, which requires an in-kernel local APIC. */
> > > +     if (!vcpu->arch.apic)
> > > +             return -EINVAL;
> > 
> > nit: Use kvm_apic_present()
> 
> Oops, nice catch.

Sorry this fell through. I made a patch for this earlier but missed it
while rebasing to a later dev branch and never sent it.

Below is a rebased version against the current KVM CoCo queue to fold
in if still needed. Sounds like this might be already dealt with in
Paolo's upcoming CoCo queue branch though.

Regards,

Tony

8< --------------------
From aac264e9923c15522baf9ae765b1d58165c24523 Mon Sep 17 00:00:00 2001
From: Tony Lindgren <tony.lindgren@linux.intel.com>
Date: Mon, 2 Sep 2024 13:52:20 +0300
Subject: [PATCH 1/1] KVM/TDX: Use kvm_apic_present() in tdx_vcpu_create()

Use kvm_apic_present() in tdx_vcpu_create(). We need to now export
apic_hw_disabled for kvm-intel to use it.

Suggested-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
 arch/x86/kvm/lapic.c   | 2 ++
 arch/x86/kvm/vmx/tdx.c | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index fcf3a8907196..2b83092eace2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -139,6 +139,8 @@ __read_mostly DEFINE_STATIC_KEY_FALSE(kvm_has_noapic_vcpu);
 EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu);
 
 __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_hw_disabled, HZ);
+EXPORT_SYMBOL_GPL(apic_hw_disabled);
+
 __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_sw_disabled, HZ);
 
 static inline int apic_enabled(struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d0dc3200fa37..6c68567d964d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -8,6 +8,7 @@
 #include "capabilities.h"
 #include "mmu.h"
 #include "x86_ops.h"
+#include "lapic.h"
 #include "tdx.h"
 #include "vmx.h"
 #include "mmu/spte.h"
@@ -674,7 +675,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 		return -EIO;
 
 	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
-	if (!vcpu->arch.apic)
+	if (!kvm_apic_present(vcpu))
 		return -EINVAL;
 
 	fpstate_set_confidential(&vcpu->arch.guest_fpu);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2025-01-07  7:37     ` Tony Lindgren
@ 2025-01-07 12:41       ` Nikolay Borisov
  2025-01-08  5:28         ` Tony Lindgren
  0 siblings, 1 reply; 103+ messages in thread
From: Nikolay Borisov @ 2025-01-07 12:41 UTC (permalink / raw)
  To: Tony Lindgren, Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Huang, Kai, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette



On 7.01.25 г. 9:37 ч., Tony Lindgren wrote:
> On Sat, Jan 04, 2025 at 01:43:56AM +0000, Edgecombe, Rick P wrote:
>> On Mon, 2024-12-23 at 17:25 +0100, Paolo Bonzini wrote:
>>> 22: missing review comment from v1
>>>
>>>> +     /* TDX only supports x2APIC, which requires an in-kernel local APIC. */
>>>> +     if (!vcpu->arch.apic)
>>>> +             return -EINVAL;
>>>
>>> nit: Use kvm_apic_present()
>>
>> Oops, nice catch.
> 
> Sorry this fell through. I made a patch for this earlier but missed it
> while rebasing to a later dev branch and never sent it.
> 
> Below is a rebased version against the current KVM CoCo queue to fold
> in if still needed. Sounds like this might be already dealt with in
> Paolo's upcoming CoCo queue branch though.
> 
> Regards,
> 
> Tony
> 
> 8< --------------------
>  From aac264e9923c15522baf9ae765b1d58165c24523 Mon Sep 17 00:00:00 2001
> From: Tony Lindgren <tony.lindgren@linux.intel.com>
> Date: Mon, 2 Sep 2024 13:52:20 +0300
> Subject: [PATCH 1/1] KVM/TDX: Use kvm_apic_present() in tdx_vcpu_create()
> 
> Use kvm_apic_present() in tdx_vcpu_create(). We need to now export
> apic_hw_disabled for kvm-intel to use it.
> 
> Suggested-by: Nikolay Borisov <nik.borisov@suse.com>
> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> ---
>   arch/x86/kvm/lapic.c   | 2 ++
>   arch/x86/kvm/vmx/tdx.c | 3 ++-
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index fcf3a8907196..2b83092eace2 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -139,6 +139,8 @@ __read_mostly DEFINE_STATIC_KEY_FALSE(kvm_has_noapic_vcpu);
>   EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu);
>   
>   __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_hw_disabled, HZ);
> +EXPORT_SYMBOL_GPL(apic_hw_disabled);

Is it really required to expose this symbol? apic_hw_disabled is defined 
as static inline in the header?

 > +>   __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_sw_disabled, 
HZ);
>   
>   static inline int apic_enabled(struct kvm_lapic *apic)
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index d0dc3200fa37..6c68567d964d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -8,6 +8,7 @@
>   #include "capabilities.h"
>   #include "mmu.h"
>   #include "x86_ops.h"
> +#include "lapic.h"
>   #include "tdx.h"
>   #include "vmx.h"
>   #include "mmu/spte.h"
> @@ -674,7 +675,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>   		return -EIO;
>   
>   	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
> -	if (!vcpu->arch.apic)
> +	if (!kvm_apic_present(vcpu))
>   		return -EINVAL;
>   
>   	fpstate_set_confidential(&vcpu->arch.guest_fpu);


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2025-01-07 12:41       ` Nikolay Borisov
@ 2025-01-08  5:28         ` Tony Lindgren
  2025-01-08 15:01           ` Sean Christopherson
  0 siblings, 1 reply; 103+ messages in thread
From: Tony Lindgren @ 2025-01-08  5:28 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com,
	Huang, Kai, Li, Xiaoyao, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Zhao, Yan Y,
	Chatre, Reinette

On Tue, Jan 07, 2025 at 02:41:51PM +0200, Nikolay Borisov wrote:
> On 7.01.25 г. 9:37 ч., Tony Lindgren wrote:
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -139,6 +139,8 @@ __read_mostly DEFINE_STATIC_KEY_FALSE(kvm_has_noapic_vcpu);
> >   EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu);
> >   __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_hw_disabled, HZ);
> > +EXPORT_SYMBOL_GPL(apic_hw_disabled);
> 
> Is it really required to expose this symbol? apic_hw_disabled is defined as
> static inline in the header?

For loadable modules yes, otherwise we'll get:

ERROR: modpost: "apic_hw_disabled" [arch/x86/kvm/kvm-intel.ko] undefined!

This is similar to the EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu) already
there.

Regards,

Tony

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2025-01-08  5:28         ` Tony Lindgren
@ 2025-01-08 15:01           ` Sean Christopherson
  2025-01-09  7:04             ` Tony Lindgren
  0 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2025-01-08 15:01 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Nikolay Borisov, Rick P Edgecombe, pbonzini@redhat.com, Kai Huang,
	Xiaoyao Li, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Yan Y Zhao,
	Reinette Chatre

On Wed, Jan 08, 2025, Tony Lindgren wrote:
> On Tue, Jan 07, 2025 at 02:41:51PM +0200, Nikolay Borisov wrote:
> > On 7.01.25 г. 9:37 ч., Tony Lindgren wrote:
> > > --- a/arch/x86/kvm/lapic.c
> > > +++ b/arch/x86/kvm/lapic.c
> > > @@ -139,6 +139,8 @@ __read_mostly DEFINE_STATIC_KEY_FALSE(kvm_has_noapic_vcpu);
> > >   EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu);
> > >   __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_hw_disabled, HZ);
> > > +EXPORT_SYMBOL_GPL(apic_hw_disabled);
> > 
> > Is it really required to expose this symbol? apic_hw_disabled is defined as
> > static inline in the header?

No, apic_hw_disabled can't be "static inline", because it's a variable, not a
function.

> For loadable modules yes, otherwise we'll get:
> 
> ERROR: modpost: "apic_hw_disabled" [arch/x86/kvm/kvm-intel.ko] undefined!
> 
> This is similar to the EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu) already
> there.

Heh, which is a hint that you're using the wrong helper.  TDX should check
lapic_in_kernel(), not kvm_apic_present().  The former verifies that local APIC
emulation/virtualization is handed in-kernel, i.e. by KVM.  The latter checks
that the local APIC is in-kernel *and* that the vCPU's local APIC is hardware
enabled, and checking that the local APIC is hardware enabled is unnecessary
and only works by sheer dumb luck.

The only reason kvm_create_lapic() stuffs the enable bit is to avoid toggling
the static key, which incurs costly IPIs to patch kernel text.  If
apic_hw_disabled were to be removed (which is somewhat seriously being considered),
this code would be deleted and TDX would break.

	/*
	 * Stuff the APIC ENABLE bit in lieu of temporarily incrementing
	 * apic_hw_disabled; the full RESET value is set by kvm_lapic_reset().
	 */
	vcpu->arch.apic_base = MSR_IA32_APICBASE_ENABLE;

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2025-01-08 15:01           ` Sean Christopherson
@ 2025-01-09  7:04             ` Tony Lindgren
  0 siblings, 0 replies; 103+ messages in thread
From: Tony Lindgren @ 2025-01-09  7:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nikolay Borisov, Rick P Edgecombe, pbonzini@redhat.com, Kai Huang,
	Xiaoyao Li, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Yan Y Zhao,
	Reinette Chatre

On Wed, Jan 08, 2025 at 07:01:01AM -0800, Sean Christopherson wrote:
> On Wed, Jan 08, 2025, Tony Lindgren wrote:
> > On Tue, Jan 07, 2025 at 02:41:51PM +0200, Nikolay Borisov wrote:
> > > On 7.01.25 г. 9:37 ч., Tony Lindgren wrote:
> > > > --- a/arch/x86/kvm/lapic.c
> > > > +++ b/arch/x86/kvm/lapic.c
> > > > @@ -139,6 +139,8 @@ __read_mostly DEFINE_STATIC_KEY_FALSE(kvm_has_noapic_vcpu);
> > > >   EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu);
> > > >   __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_hw_disabled, HZ);
> > > > +EXPORT_SYMBOL_GPL(apic_hw_disabled);
> > > 
> > > Is it really required to expose this symbol? apic_hw_disabled is defined as
> > > static inline in the header?
> 
> No, apic_hw_disabled can't be "static inline", because it's a variable, not a
> function.
> 
> > For loadable modules yes, otherwise we'll get:
> > 
> > ERROR: modpost: "apic_hw_disabled" [arch/x86/kvm/kvm-intel.ko] undefined!
> > 
> > This is similar to the EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu) already
> > there.
> 
> Heh, which is a hint that you're using the wrong helper.  TDX should check
> lapic_in_kernel(), not kvm_apic_present().  The former verifies that local APIC
> emulation/virtualization is handed in-kernel, i.e. by KVM.  The latter checks
> that the local APIC is in-kernel *and* that the vCPU's local APIC is hardware
> enabled, and checking that the local APIC is hardware enabled is unnecessary
> and only works by sheer dumb luck.

OK makes sense :)

> The only reason kvm_create_lapic() stuffs the enable bit is to avoid toggling
> the static key, which incurs costly IPIs to patch kernel text.  If
> apic_hw_disabled were to be removed (which is somewhat seriously being considered),
> this code would be deleted and TDX would break.
> 
> 	/*
> 	 * Stuff the APIC ENABLE bit in lieu of temporarily incrementing
> 	 * apic_hw_disabled; the full RESET value is set by kvm_lapic_reset().
> 	 */
> 	vcpu->arch.apic_base = MSR_IA32_APICBASE_ENABLE;

Thanks for the clarification. Updated patch below for reference in
case it's still needed.

Regards,

Tony

8< ----------------------
From 1e4b72fe4a69f0bdd7c8379315b97be79fb6cf8a Mon Sep 17 00:00:00 2001
From: Tony Lindgren <tony.lindgren@linux.intel.com>
Date: Mon, 2 Sep 2024 13:52:20 +0300
Subject: [PATCH 1/1] KVM/TDX: Use lapic_in_kernel() in tdx_vcpu_create()

Use lapic_in_kernel() in tdx_vcpu_create().

Suggested-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d0dc3200fa37..b905a7c9e2ff 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -8,6 +8,7 @@
 #include "capabilities.h"
 #include "mmu.h"
 #include "x86_ops.h"
+#include "lapic.h"
 #include "tdx.h"
 #include "vmx.h"
 #include "mmu/spte.h"
@@ -674,7 +675,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 		return -EIO;
 
 	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
-	if (!vcpu->arch.apic)
+	if (!lapic_in_kernel(vcpu))
 		return -EINVAL;
 
 	fpstate_set_confidential(&vcpu->arch.guest_fpu);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 00/25] TDX vCPU/VM creation
  2025-01-04  1:43   ` Edgecombe, Rick P
  2025-01-05 21:32     ` Huang, Kai
  2025-01-07  7:37     ` Tony Lindgren
@ 2025-01-22  8:27     ` Tony Lindgren
  2 siblings, 0 replies; 103+ messages in thread
From: Tony Lindgren @ 2025-01-22  8:27 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Huang, Kai, Li, Xiaoyao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, Zhao, Yan Y, Chatre, Reinette

On Sat, Jan 04, 2025 at 01:43:56AM +0000, Edgecombe, Rick P wrote:
> On Mon, 2024-12-23 at 17:25 +0100, Paolo Bonzini wrote:
> > 11: see the type safety comment above:
> > > The ugly part here is the type-unsafety of to_vmx/to_tdx.  We probably
> > > should add some "#pragma poison" of to_vmx/to_tdx: for example both can
> > > be poisoned in pmu_intel.c after the definition of
> > > vcpu_to_lbr_records(), while one of them can be poisoned in
> > > sgx.c/posted_intr.c/vmx.c/tdx.c.
> 
> I left it off because you said "Not a strict requirement though." and gave it a
> RB tag. Other stuff seemed higher priority. We can look at some options for a
> follow on patch if it lightens your load.

I suggest we do this:

- Make to_kvm_tdx() and to_tdx() private to tdx.c as they're only used
  in tdx.c

- Add pragma poison to_vmx at the top of tdx.c

- Add pragma poison to_vmx in pmu_intel.c after vcpu_to_lbr_records()

Other pragma poison to_vmx can be added as needed, but AFAIK there's
not need to add it for to_tdx().

Regards,

Tony

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2025-01-23 19:44 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-30 19:00 [PATCH v2 00/25] TDX vCPU/VM creation Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 01/25] x86/virt/tdx: Share the global metadata structure for KVM to use Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 02/25] KVM: TDX: Get TDX global information Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 03/25] x86/virt/tdx: Read essential global metadata for KVM Rick Edgecombe
2024-12-06  8:37   ` Xiaoyao Li
2024-12-06 16:13     ` Huang, Kai
2024-12-06 16:18       ` Huang, Kai
2024-12-06 16:24       ` Dave Hansen
2024-12-07  0:00         ` Huang, Kai
2024-12-12  0:31           ` Edgecombe, Rick P
2024-12-21  1:17             ` Huang, Kai
2024-12-21  1:07   ` [PATCH v2.1 " Kai Huang
2024-10-30 19:00 ` [PATCH v2 04/25] x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free TDX guest KeyID Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 05/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management Rick Edgecombe
2024-11-12 20:09   ` Dave Hansen
2024-11-14  0:01     ` Edgecombe, Rick P
2024-10-30 19:00 ` [PATCH v2 06/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation Rick Edgecombe
2024-11-12 20:17   ` Dave Hansen
2024-11-12 21:21     ` Edgecombe, Rick P
2024-11-12 21:40       ` Dave Hansen
2024-10-30 19:00 ` [PATCH v2 07/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX vCPU creation Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 08/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management Rick Edgecombe
2024-10-31  3:57   ` Yan Zhao
2024-10-31 18:57     ` Edgecombe, Rick P
2024-10-31 23:33       ` Huang, Kai
2024-11-13  0:20   ` Dave Hansen
2024-11-13 20:51     ` Edgecombe, Rick P
2024-11-13 21:08       ` Dave Hansen
2024-11-13 21:25         ` Huang, Kai
2024-11-13 22:01           ` Edgecombe, Rick P
2024-11-13 21:44         ` Edgecombe, Rick P
2024-11-13 21:50           ` Dave Hansen
2024-11-13 22:00             ` Edgecombe, Rick P
2024-11-14  0:21               ` Huang, Kai
2024-11-14  0:32                 ` Edgecombe, Rick P
2024-10-30 19:00 ` [PATCH v2 09/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access Rick Edgecombe
2025-01-05  9:45   ` Francesco Lavra
2025-01-06 18:59     ` Edgecombe, Rick P
2024-10-30 19:00 ` [PATCH v2 10/25] x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations Rick Edgecombe
2024-11-13  1:11   ` Dave Hansen
2024-11-13 21:18     ` Edgecombe, Rick P
2024-11-13 21:41       ` Dave Hansen
2024-11-13 21:48         ` Edgecombe, Rick P
2024-10-30 19:00 ` [PATCH v2 11/25] KVM: TDX: Add placeholders for TDX VM/vCPU structures Rick Edgecombe
2025-01-05 10:58   ` Francesco Lavra
2025-01-06 19:00     ` Edgecombe, Rick P
2025-01-22  7:52     ` Tony Lindgren
2024-10-30 19:00 ` [PATCH v2 12/25] KVM: TDX: Define TDX architectural definitions Rick Edgecombe
2024-10-30 22:38   ` Huang, Kai
2024-10-30 22:53     ` Huang, Kai
2024-10-30 19:00 ` [PATCH v2 13/25] KVM: TDX: Add TDX "architectural" error codes Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 14/25] KVM: TDX: Add helper functions to print TDX SEAMCALL error Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 15/25] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 16/25] KVM: TDX: Get system-wide info about TDX module on initialization Rick Edgecombe
2024-10-31  9:09   ` Binbin Wu
2024-10-31  9:18     ` Tony Lindgren
2024-10-31  9:22       ` Binbin Wu
2024-10-31  9:23     ` Xiaoyao Li
2024-10-31  9:37       ` Tony Lindgren
2024-10-31 14:27         ` Xiaoyao Li
2024-11-01  8:19           ` Tony Lindgren
2024-12-06  8:45   ` Xiaoyao Li
2024-12-10  9:35     ` Tony Lindgren
2025-01-08  2:34   ` Chao Gao
2025-01-08  5:41     ` Huang, Kai
2024-10-30 19:00 ` [PATCH v2 17/25] KVM: TDX: create/destroy VM structure Rick Edgecombe
2024-11-04  2:03   ` Chao Gao
2024-11-04  5:59     ` Tony Lindgren
2024-10-30 19:00 ` [PATCH v2 18/25] KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check Rick Edgecombe
2025-01-05 22:12   ` Huang, Kai
2025-01-06 19:09     ` Edgecombe, Rick P
2024-10-30 19:00 ` [PATCH v2 19/25] KVM: TDX: initialize VM with TDX specific parameters Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 20/25] KVM: TDX: Make pmu_intel.c ignore guest TD case Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 21/25] KVM: TDX: Don't offline the last cpu of one package when there's TDX guest Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 22/25] KVM: TDX: create/free TDX vcpu structure Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 23/25] KVM: TDX: Do TDX specific vcpu initialization Rick Edgecombe
2024-10-30 19:00 ` [PATCH v2 24/25] KVM: x86: Introduce KVM_TDX_GET_CPUID Rick Edgecombe
2024-11-01  6:39   ` Binbin Wu
2024-11-01 16:03     ` Edgecombe, Rick P
2025-01-09 11:07   ` Francesco Lavra
2025-01-10  4:29     ` Xiaoyao Li
2025-01-10 10:34       ` Francesco Lavra
2025-01-10  4:47   ` Xiaoyao Li
2025-01-21 20:24     ` Edgecombe, Rick P
2025-01-22  7:43       ` Xiaoyao Li
2025-01-23 19:44         ` Edgecombe, Rick P
2025-01-21 23:19     ` Edgecombe, Rick P
2024-10-30 19:00 ` [PATCH v2 25/25] KVM: x86/mmu: Taking guest pa into consideration when calculate tdp level Rick Edgecombe
2024-10-31 19:21 ` [PATCH v2 00/25] TDX vCPU/VM creation Adrian Hunter
2024-11-11  9:49   ` Tony Lindgren
2024-11-12  7:26     ` Adrian Hunter
2024-11-12  9:57       ` Tony Lindgren
2024-11-12 21:26   ` Edgecombe, Rick P
2024-12-10 18:22 ` Paolo Bonzini
2024-12-23 16:25 ` Paolo Bonzini
2025-01-04  1:43   ` Edgecombe, Rick P
2025-01-05 21:32     ` Huang, Kai
2025-01-07  7:37     ` Tony Lindgren
2025-01-07 12:41       ` Nikolay Borisov
2025-01-08  5:28         ` Tony Lindgren
2025-01-08 15:01           ` Sean Christopherson
2025-01-09  7:04             ` Tony Lindgren
2025-01-22  8:27     ` Tony Lindgren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox