[PATCH v16 00/13] Add Secure TSC support for SNP guests

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v16 00/13] Add Secure TSC support for SNP guests
@ 2025-01-06 12:46 Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 01/13] virt: sev-guest: Remove is_vmpck_empty() helper Nikunj A Dadhania
                   ` (12 more replies)
  0 siblings, 13 replies; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

This patchset is also available at:

  https://github.com/AMDESE/linux-kvm/tree/sectsc-guest-latest

and is based on tip/master

Overview
--------

Secure TSC allows guests to securely use RDTSC/RDTSCP instructions as the
parameters being used cannot be changed by hypervisor once the guest is
launched. More details in the AMD64 APM Vol 2, Section "Secure TSC".

In order to enable secure TSC, SEV-SNP guests need to send a TSC_INFO guest
message before the APs are booted. Details from the TSC_INFO response will
then be used to program the VMSA before the APs are brought up. See "SEV
Secure Nested Paging Firmware ABI Specification" document (currently at
https://www.amd.com/system/files/TechDocs/56860.pdf) section "TSC Info"

SEV-guest driver has the implementation for guest and AMD Security
Processor communication. As the TSC_INFO needs to be initialized during
early boot before APs are started, move the guest messaging code from
sev-guest driver to sev/core.c and provide well defined APIs to the
sev-guest driver.

Patches:
01-02: Prepatch
03-04: Patches moving SNP guest messaging code from SEV guest driver to
       SEV common code
05-10: SecureTSC enablement patches
11-12: Generic TSC improvements
   15: SecureTSC enablement.

Testing SecureTSC
-----------------

SecureTSC hypervisor patches based on top of SEV-SNP Guest MEMFD series:
https://github.com/AMDESE/linux-kvm/tree/sectsc-host-latest

QEMU changes:
https://github.com/nikunjad/qemu/tree/snp-securetsc-latest

QEMU commandline SEV-SNP with SecureTSC:

  qemu-system-x86_64 -cpu EPYC-Milan-v2 -smp 4 \
    -object memory-backend-memfd,id=ram1,size=1G,share=true,prealloc=false,reserve=false \
    -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,secure-tsc=on \
    -machine q35,confidential-guest-support=sev0,memory-backend=ram1 \
    ...

Changelog:
----------

v16:
* Rebased on latest tip/master
* prepatch: Remove is_vmpck_empty() and use memchr_inv() instead (Boris)
* prepatch: Uses GFP_KERNEL for snp_init_crypto() (Boris)
* Use memset during cleanup and fix memory leak (Boris/Francesco)
* Use tsp_resp with extra allocation and get rid of buf variable (Francesco)
* Drop RoBs and TBs, comment/commit message update
* Add switch case for #VC Secure TSC MSRs handling (Boris)
* Move CC_ATTR_GUEST_SNP_SECURE_TSC before host defines to group guest
  defines(Tom)
* Cache the TSC frequency in khz to avoid reading MSR and multiply every time
  (Tom)
* Drop the patch preventing FW_BUG for not running at P0 frequency (Boris)
* Limit clock rating upgrade to virtualized environments as part of
  CONFIG_PARAVIRT (Boris)
* Squash patch 9 and 12 to logically merge changes that switches to TSC from
  kvm-clock (Boris)

v15: https://lore.kernel.org/kvm/20241203090045.942078-1-nikunj@amd.com/
* Rebased on latest tip/master
* Update commits/comments (Boris)
* Add snp_msg_free() to free allocated buffers (Boris)
* Dynamically allocate buffers for sending TSC_INFO (Boris)
* Fix the build issue at patch#1 (Boris)
* Carve out tsc handling in __vc_handle_msr_tsc() (Boris)

Nikunj A Dadhania (13):
  virt: sev-guest: Remove is_vmpck_empty() helper
  virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL
  x86/sev: Carve out and export SNP guest messaging init routines
  x86/sev: Relocate SNP guest messaging routines to common code
  x86/sev: Add Secure TSC support for SNP guests
  x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
  x86/sev: Prevent GUEST_TSC_FREQ MSR interception for Secure TSC
    enabled guests
  x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled
    guests
  x86/sev: Mark Secure TSC as reliable clocksource
  x86/tsc: Switch Secure TSC guests away from kvm-clock
  x86/tsc: Upgrade TSC clocksource rating for guests
  x86/tsc: Switch to native sched clock
  x86/sev: Allow Secure TSC feature for SNP guests

 arch/x86/include/asm/msr-index.h        |   1 +
 arch/x86/include/asm/sev-common.h       |   1 +
 arch/x86/include/asm/sev.h              |  48 +-
 arch/x86/include/asm/svm.h              |   6 +-
 include/linux/cc_platform.h             |   8 +
 arch/x86/boot/compressed/sev.c          |   3 +-
 arch/x86/coco/core.c                    |   4 +
 arch/x86/coco/sev/core.c                | 652 +++++++++++++++++++++++-
 arch/x86/coco/sev/shared.c              |  10 +
 arch/x86/kernel/kvmclock.c              |  11 +
 arch/x86/kernel/tsc.c                   |  43 ++
 arch/x86/mm/mem_encrypt.c               |   2 +
 arch/x86/mm/mem_encrypt_amd.c           |   4 +
 drivers/virt/coco/sev-guest/sev-guest.c | 485 +-----------------
 arch/x86/Kconfig                        |   1 +
 drivers/virt/coco/sev-guest/Kconfig     |   1 -
 16 files changed, 791 insertions(+), 489 deletions(-)


base-commit: af2c8596bd2e455ae350ba1585bc938ee85aa38d
-- 
2.34.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v16 01/13] virt: sev-guest: Remove is_vmpck_empty() helper
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-07 18:38   ` Tom Lendacky
  2025-01-06 12:46 ` [PATCH v16 02/13] virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL Nikunj A Dadhania
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

Remove the is_vmpck_empty() helper function, which uses a local array
allocation to check if the VMPCK is empty. Replace it with memchr_inv() to
directly determine if the VMPCK is empty without additional memory
allocation.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 drivers/virt/coco/sev-guest/sev-guest.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/virt/coco/sev-guest/sev-guest.c b/drivers/virt/coco/sev-guest/sev-guest.c
index b699771be029..62328d0b2cb6 100644
--- a/drivers/virt/coco/sev-guest/sev-guest.c
+++ b/drivers/virt/coco/sev-guest/sev-guest.c
@@ -63,16 +63,6 @@ MODULE_PARM_DESC(vmpck_id, "The VMPCK ID to use when communicating with the PSP.
 /* Mutex to serialize the shared buffer access and command handling. */
 static DEFINE_MUTEX(snp_cmd_mutex);
 
-static bool is_vmpck_empty(struct snp_msg_desc *mdesc)
-{
-	char zero_key[VMPCK_KEY_LEN] = {0};
-
-	if (mdesc->vmpck)
-		return !memcmp(mdesc->vmpck, zero_key, VMPCK_KEY_LEN);
-
-	return true;
-}
-
 /*
  * If an error is received from the host or AMD Secure Processor (ASP) there
  * are two options. Either retry the exact same encrypted request or discontinue
@@ -335,7 +325,7 @@ static int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_r
 	guard(mutex)(&snp_cmd_mutex);
 
 	/* Check if the VMPCK is not empty */
-	if (is_vmpck_empty(mdesc)) {
+	if (!mdesc->vmpck || !memchr_inv(mdesc->vmpck, 0, VMPCK_KEY_LEN)) {
 		pr_err_ratelimited("VMPCK is disabled\n");
 		return -ENOTTY;
 	}
@@ -1024,7 +1014,7 @@ static int __init sev_guest_probe(struct platform_device *pdev)
 	}
 
 	/* Verify that VMPCK is not zero. */
-	if (is_vmpck_empty(mdesc)) {
+	if (!memchr_inv(mdesc->vmpck, 0, VMPCK_KEY_LEN)) {
 		dev_err(dev, "Empty VMPCK%d communication key\n", vmpck_id);
 		goto e_unmap;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 02/13] virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 01/13] virt: sev-guest: Remove is_vmpck_empty() helper Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-07 18:40   ` Tom Lendacky
  2025-01-06 12:46 ` [PATCH v16 03/13] x86/sev: Carve out and export SNP guest messaging init routines Nikunj A Dadhania
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL in the sev-guest driver code.
GFP_KERNEL_ACCOUNT is typically used for accounting untrusted userspace
allocations. After auditing the sev-guest code, the following changes are
necessary:

  * snp_init_crypto(): Use GFP_KERNEL as this is a trusted device probe
    path.

Retain GFP_KERNEL_ACCOUNT in the following cases for robustness and
specific path requirements:

  * alloc_shared_pages(): Although all allocations are limited, retain
    GFP_KERNEL_ACCOUNT for future robustness.

  * get_report() and get_ext_report(): These functions are on the unlocked
    ioctl path and should continue using GFP_KERNEL_ACCOUNT.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 drivers/virt/coco/sev-guest/sev-guest.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virt/coco/sev-guest/sev-guest.c b/drivers/virt/coco/sev-guest/sev-guest.c
index 62328d0b2cb6..250ce92d816b 100644
--- a/drivers/virt/coco/sev-guest/sev-guest.c
+++ b/drivers/virt/coco/sev-guest/sev-guest.c
@@ -141,7 +141,7 @@ static struct aesgcm_ctx *snp_init_crypto(u8 *key, size_t keylen)
 {
 	struct aesgcm_ctx *ctx;
 
-	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
 	if (!ctx)
 		return NULL;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 03/13] x86/sev: Carve out and export SNP guest messaging init routines
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 01/13] virt: sev-guest: Remove is_vmpck_empty() helper Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 02/13] virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 04/13] x86/sev: Relocate SNP guest messaging routines to common code Nikunj A Dadhania
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

Currently, the sev-guest driver is the only user of SNP guest messaging.
All routines for initializing SNP guest messaging are implemented within
the sev-guest driver and are not available during early boot. In
preparation for adding Secure TSC guest support, carve out APIs to allocate
and initialize the guest messaging descriptor context and make it part of
coco/sev/core.c. As there is no user of sev_guest_platform_data anymore,
remove the structure.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/include/asm/sev.h              |  13 +-
 arch/x86/coco/sev/core.c                | 183 ++++++++++++++++++++++-
 drivers/virt/coco/sev-guest/sev-guest.c | 185 +++---------------------
 arch/x86/Kconfig                        |   1 +
 drivers/virt/coco/sev-guest/Kconfig     |   1 -
 5 files changed, 208 insertions(+), 175 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 91f08af31078..db08d0ac90be 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -14,6 +14,7 @@
 #include <asm/insn.h>
 #include <asm/sev-common.h>
 #include <asm/coco.h>
+#include <asm/set_memory.h>
 
 #define GHCB_PROTOCOL_MIN	1ULL
 #define GHCB_PROTOCOL_MAX	2ULL
@@ -170,10 +171,6 @@ struct snp_guest_msg {
 	u8 payload[PAGE_SIZE - sizeof(struct snp_guest_msg_hdr)];
 } __packed;
 
-struct sev_guest_platform_data {
-	u64 secrets_gpa;
-};
-
 struct snp_guest_req {
 	void *req_buf;
 	size_t req_sz;
@@ -253,6 +250,7 @@ struct snp_msg_desc {
 
 	u32 *os_area_msg_seqno;
 	u8 *vmpck;
+	int vmpck_id;
 };
 
 /*
@@ -458,6 +456,10 @@ void set_pte_enc_mask(pte_t *kpte, unsigned long pfn, pgprot_t new_prot);
 void snp_kexec_finish(void);
 void snp_kexec_begin(void);
 
+int snp_msg_init(struct snp_msg_desc *mdesc, int vmpck_id);
+struct snp_msg_desc *snp_msg_alloc(void);
+void snp_msg_free(struct snp_msg_desc *mdesc);
+
 #else	/* !CONFIG_AMD_MEM_ENCRYPT */
 
 #define snp_vmpl 0
@@ -498,6 +500,9 @@ static inline int prepare_pte_enc(struct pte_enc_desc *d) { return 0; }
 static inline void set_pte_enc_mask(pte_t *kpte, unsigned long pfn, pgprot_t new_prot) { }
 static inline void snp_kexec_finish(void) { }
 static inline void snp_kexec_begin(void) { }
+static inline int snp_msg_init(struct snp_msg_desc *mdesc, int vmpck_id) { return -1; }
+static inline struct snp_msg_desc *snp_msg_alloc(void) { return NULL; }
+static inline void snp_msg_free(struct snp_msg_desc *mdesc) { }
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
 
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 499b41953e3c..93627a21945d 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -25,6 +25,7 @@
 #include <linux/psp-sev.h>
 #include <linux/dmi.h>
 #include <uapi/linux/sev-guest.h>
+#include <crypto/gcm.h>
 
 #include <asm/init.h>
 #include <asm/cpu_entry_area.h>
@@ -2575,15 +2576,9 @@ static struct platform_device sev_guest_device = {
 
 static int __init snp_init_platform_device(void)
 {
-	struct sev_guest_platform_data data;
-
 	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 		return -ENODEV;
 
-	data.secrets_gpa = secrets_pa;
-	if (platform_device_add_data(&sev_guest_device, &data, sizeof(data)))
-		return -ENODEV;
-
 	if (platform_device_register(&sev_guest_device))
 		return -ENODEV;
 
@@ -2662,3 +2657,179 @@ static int __init sev_sysfs_init(void)
 }
 arch_initcall(sev_sysfs_init);
 #endif // CONFIG_SYSFS
+
+static void free_shared_pages(void *buf, size_t sz)
+{
+	unsigned int npages = PAGE_ALIGN(sz) >> PAGE_SHIFT;
+	int ret;
+
+	if (!buf)
+		return;
+
+	ret = set_memory_encrypted((unsigned long)buf, npages);
+	if (ret) {
+		WARN_ONCE(ret, "failed to restore encryption mask (leak it)\n");
+		return;
+	}
+
+	__free_pages(virt_to_page(buf), get_order(sz));
+}
+
+static void *alloc_shared_pages(size_t sz)
+{
+	unsigned int npages = PAGE_ALIGN(sz) >> PAGE_SHIFT;
+	struct page *page;
+	int ret;
+
+	page = alloc_pages(GFP_KERNEL_ACCOUNT, get_order(sz));
+	if (!page)
+		return NULL;
+
+	ret = set_memory_decrypted((unsigned long)page_address(page), npages);
+	if (ret) {
+		pr_err("failed to mark page shared, ret=%d\n", ret);
+		__free_pages(page, get_order(sz));
+		return NULL;
+	}
+
+	return page_address(page);
+}
+
+static u8 *get_vmpck(int id, struct snp_secrets_page *secrets, u32 **seqno)
+{
+	u8 *key = NULL;
+
+	switch (id) {
+	case 0:
+		*seqno = &secrets->os_area.msg_seqno_0;
+		key = secrets->vmpck0;
+		break;
+	case 1:
+		*seqno = &secrets->os_area.msg_seqno_1;
+		key = secrets->vmpck1;
+		break;
+	case 2:
+		*seqno = &secrets->os_area.msg_seqno_2;
+		key = secrets->vmpck2;
+		break;
+	case 3:
+		*seqno = &secrets->os_area.msg_seqno_3;
+		key = secrets->vmpck3;
+		break;
+	default:
+		break;
+	}
+
+	return key;
+}
+
+static struct aesgcm_ctx *snp_init_crypto(u8 *key, size_t keylen)
+{
+	struct aesgcm_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	if (aesgcm_expandkey(ctx, key, keylen, AUTHTAG_LEN)) {
+		pr_err("Crypto context initialization failed\n");
+		kfree(ctx);
+		return NULL;
+	}
+
+	return ctx;
+}
+
+int snp_msg_init(struct snp_msg_desc *mdesc, int vmpck_id)
+{
+	/* Adjust the default VMPCK key based on the executing VMPL level */
+	if (vmpck_id == -1)
+		vmpck_id = snp_vmpl;
+
+	mdesc->vmpck = get_vmpck(vmpck_id, mdesc->secrets, &mdesc->os_area_msg_seqno);
+	if (!mdesc->vmpck) {
+		pr_err("Invalid VMPCK%d communication key\n", vmpck_id);
+		return -EINVAL;
+	}
+
+	/* Verify that VMPCK is not zero. */
+	if (!memchr_inv(mdesc->vmpck, 0, VMPCK_KEY_LEN)) {
+		pr_err("Empty VMPCK%d communication key\n", vmpck_id);
+		return -EINVAL;
+	}
+
+	mdesc->vmpck_id = vmpck_id;
+
+	mdesc->ctx = snp_init_crypto(mdesc->vmpck, VMPCK_KEY_LEN);
+	if (!mdesc->ctx)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(snp_msg_init);
+
+struct snp_msg_desc *snp_msg_alloc(void)
+{
+	struct snp_msg_desc *mdesc;
+	void __iomem *mem;
+
+	BUILD_BUG_ON(sizeof(struct snp_guest_msg) > PAGE_SIZE);
+
+	mdesc = kzalloc(sizeof(struct snp_msg_desc), GFP_KERNEL);
+	if (!mdesc)
+		return ERR_PTR(-ENOMEM);
+
+	mem = ioremap_encrypted(secrets_pa, PAGE_SIZE);
+	if (!mem)
+		goto e_free_mdesc;
+
+	mdesc->secrets = (__force struct snp_secrets_page *)mem;
+
+	/* Allocate the shared page used for the request and response message. */
+	mdesc->request = alloc_shared_pages(sizeof(struct snp_guest_msg));
+	if (!mdesc->request)
+		goto e_unmap;
+
+	mdesc->response = alloc_shared_pages(sizeof(struct snp_guest_msg));
+	if (!mdesc->response)
+		goto e_free_request;
+
+	mdesc->certs_data = alloc_shared_pages(SEV_FW_BLOB_MAX_SIZE);
+	if (!mdesc->certs_data)
+		goto e_free_response;
+
+	/* initial the input address for guest request */
+	mdesc->input.req_gpa = __pa(mdesc->request);
+	mdesc->input.resp_gpa = __pa(mdesc->response);
+	mdesc->input.data_gpa = __pa(mdesc->certs_data);
+
+	return mdesc;
+
+e_free_response:
+	free_shared_pages(mdesc->response, sizeof(struct snp_guest_msg));
+e_free_request:
+	free_shared_pages(mdesc->request, sizeof(struct snp_guest_msg));
+e_unmap:
+	iounmap(mem);
+e_free_mdesc:
+	kfree(mdesc);
+
+	return ERR_PTR(-ENOMEM);
+}
+EXPORT_SYMBOL_GPL(snp_msg_alloc);
+
+void snp_msg_free(struct snp_msg_desc *mdesc)
+{
+	if (!mdesc)
+		return;
+
+	kfree(mdesc->ctx);
+	free_shared_pages(mdesc->response, sizeof(struct snp_guest_msg));
+	free_shared_pages(mdesc->request, sizeof(struct snp_guest_msg));
+	free_shared_pages(mdesc->certs_data, SEV_FW_BLOB_MAX_SIZE);
+	iounmap((__force void __iomem *)mdesc->secrets);
+
+	memset(mdesc, 0, sizeof(*mdesc));
+	kfree(mdesc);
+}
+EXPORT_SYMBOL_GPL(snp_msg_free);
diff --git a/drivers/virt/coco/sev-guest/sev-guest.c b/drivers/virt/coco/sev-guest/sev-guest.c
index 250ce92d816b..d0f7233b1430 100644
--- a/drivers/virt/coco/sev-guest/sev-guest.c
+++ b/drivers/virt/coco/sev-guest/sev-guest.c
@@ -83,7 +83,7 @@ static DEFINE_MUTEX(snp_cmd_mutex);
 static void snp_disable_vmpck(struct snp_msg_desc *mdesc)
 {
 	pr_alert("Disabling VMPCK%d communication key to prevent IV reuse.\n",
-		  vmpck_id);
+		  mdesc->vmpck_id);
 	memzero_explicit(mdesc->vmpck, VMPCK_KEY_LEN);
 	mdesc->vmpck = NULL;
 }
@@ -137,23 +137,6 @@ static inline struct snp_guest_dev *to_snp_dev(struct file *file)
 	return container_of(dev, struct snp_guest_dev, misc);
 }
 
-static struct aesgcm_ctx *snp_init_crypto(u8 *key, size_t keylen)
-{
-	struct aesgcm_ctx *ctx;
-
-	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
-	if (!ctx)
-		return NULL;
-
-	if (aesgcm_expandkey(ctx, key, keylen, AUTHTAG_LEN)) {
-		pr_err("Crypto context initialization failed\n");
-		kfree(ctx);
-		return NULL;
-	}
-
-	return ctx;
-}
-
 static int verify_and_dec_payload(struct snp_msg_desc *mdesc, struct snp_guest_req *req)
 {
 	struct snp_guest_msg *resp_msg = &mdesc->secret_response;
@@ -404,7 +387,7 @@ static int get_report(struct snp_guest_dev *snp_dev, struct snp_guest_request_io
 
 	req.msg_version = arg->msg_version;
 	req.msg_type = SNP_MSG_REPORT_REQ;
-	req.vmpck_id = vmpck_id;
+	req.vmpck_id = mdesc->vmpck_id;
 	req.req_buf = report_req;
 	req.req_sz = sizeof(*report_req);
 	req.resp_buf = report_resp->data;
@@ -451,7 +434,7 @@ static int get_derived_key(struct snp_guest_dev *snp_dev, struct snp_guest_reque
 
 	req.msg_version = arg->msg_version;
 	req.msg_type = SNP_MSG_KEY_REQ;
-	req.vmpck_id = vmpck_id;
+	req.vmpck_id = mdesc->vmpck_id;
 	req.req_buf = derived_key_req;
 	req.req_sz = sizeof(*derived_key_req);
 	req.resp_buf = buf;
@@ -529,7 +512,7 @@ static int get_ext_report(struct snp_guest_dev *snp_dev, struct snp_guest_reques
 
 	req.msg_version = arg->msg_version;
 	req.msg_type = SNP_MSG_REPORT_REQ;
-	req.vmpck_id = vmpck_id;
+	req.vmpck_id = mdesc->vmpck_id;
 	req.req_buf = &report_req->data;
 	req.req_sz = sizeof(report_req->data);
 	req.resp_buf = report_resp->data;
@@ -606,76 +589,11 @@ static long snp_guest_ioctl(struct file *file, unsigned int ioctl, unsigned long
 	return ret;
 }
 
-static void free_shared_pages(void *buf, size_t sz)
-{
-	unsigned int npages = PAGE_ALIGN(sz) >> PAGE_SHIFT;
-	int ret;
-
-	if (!buf)
-		return;
-
-	ret = set_memory_encrypted((unsigned long)buf, npages);
-	if (ret) {
-		WARN_ONCE(ret, "failed to restore encryption mask (leak it)\n");
-		return;
-	}
-
-	__free_pages(virt_to_page(buf), get_order(sz));
-}
-
-static void *alloc_shared_pages(struct device *dev, size_t sz)
-{
-	unsigned int npages = PAGE_ALIGN(sz) >> PAGE_SHIFT;
-	struct page *page;
-	int ret;
-
-	page = alloc_pages(GFP_KERNEL_ACCOUNT, get_order(sz));
-	if (!page)
-		return NULL;
-
-	ret = set_memory_decrypted((unsigned long)page_address(page), npages);
-	if (ret) {
-		dev_err(dev, "failed to mark page shared, ret=%d\n", ret);
-		__free_pages(page, get_order(sz));
-		return NULL;
-	}
-
-	return page_address(page);
-}
-
 static const struct file_operations snp_guest_fops = {
 	.owner	= THIS_MODULE,
 	.unlocked_ioctl = snp_guest_ioctl,
 };
 
-static u8 *get_vmpck(int id, struct snp_secrets_page *secrets, u32 **seqno)
-{
-	u8 *key = NULL;
-
-	switch (id) {
-	case 0:
-		*seqno = &secrets->os_area.msg_seqno_0;
-		key = secrets->vmpck0;
-		break;
-	case 1:
-		*seqno = &secrets->os_area.msg_seqno_1;
-		key = secrets->vmpck1;
-		break;
-	case 2:
-		*seqno = &secrets->os_area.msg_seqno_2;
-		key = secrets->vmpck2;
-		break;
-	case 3:
-		*seqno = &secrets->os_area.msg_seqno_3;
-		key = secrets->vmpck3;
-		break;
-	default:
-		break;
-	}
-
-	return key;
-}
-
 struct snp_msg_report_resp_hdr {
 	u32 status;
 	u32 report_size;
@@ -969,13 +887,10 @@ static void unregister_sev_tsm(void *data)
 
 static int __init sev_guest_probe(struct platform_device *pdev)
 {
-	struct sev_guest_platform_data *data;
-	struct snp_secrets_page *secrets;
 	struct device *dev = &pdev->dev;
 	struct snp_guest_dev *snp_dev;
 	struct snp_msg_desc *mdesc;
 	struct miscdevice *misc;
-	void __iomem *mapping;
 	int ret;
 
 	BUILD_BUG_ON(sizeof(struct snp_guest_msg) > PAGE_SIZE);
@@ -983,115 +898,57 @@ static int __init sev_guest_probe(struct platform_device *pdev)
 	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 		return -ENODEV;
 
-	if (!dev->platform_data)
-		return -ENODEV;
-
-	data = (struct sev_guest_platform_data *)dev->platform_data;
-	mapping = ioremap_encrypted(data->secrets_gpa, PAGE_SIZE);
-	if (!mapping)
-		return -ENODEV;
-
-	secrets = (__force void *)mapping;
-
-	ret = -ENOMEM;
 	snp_dev = devm_kzalloc(&pdev->dev, sizeof(struct snp_guest_dev), GFP_KERNEL);
 	if (!snp_dev)
-		goto e_unmap;
-
-	mdesc = devm_kzalloc(&pdev->dev, sizeof(struct snp_msg_desc), GFP_KERNEL);
-	if (!mdesc)
-		goto e_unmap;
-
-	/* Adjust the default VMPCK key based on the executing VMPL level */
-	if (vmpck_id == -1)
-		vmpck_id = snp_vmpl;
+		return -ENOMEM;
 
-	ret = -EINVAL;
-	mdesc->vmpck = get_vmpck(vmpck_id, secrets, &mdesc->os_area_msg_seqno);
-	if (!mdesc->vmpck) {
-		dev_err(dev, "Invalid VMPCK%d communication key\n", vmpck_id);
-		goto e_unmap;
-	}
+	mdesc = snp_msg_alloc();
+	if (IS_ERR_OR_NULL(mdesc))
+		return -ENOMEM;
 
-	/* Verify that VMPCK is not zero. */
-	if (!memchr_inv(mdesc->vmpck, 0, VMPCK_KEY_LEN)) {
-		dev_err(dev, "Empty VMPCK%d communication key\n", vmpck_id);
-		goto e_unmap;
-	}
+	ret = snp_msg_init(mdesc, vmpck_id);
+	if (ret)
+		goto e_msg_init;
 
 	platform_set_drvdata(pdev, snp_dev);
 	snp_dev->dev = dev;
-	mdesc->secrets = secrets;
-
-	/* Allocate the shared page used for the request and response message. */
-	mdesc->request = alloc_shared_pages(dev, sizeof(struct snp_guest_msg));
-	if (!mdesc->request)
-		goto e_unmap;
-
-	mdesc->response = alloc_shared_pages(dev, sizeof(struct snp_guest_msg));
-	if (!mdesc->response)
-		goto e_free_request;
-
-	mdesc->certs_data = alloc_shared_pages(dev, SEV_FW_BLOB_MAX_SIZE);
-	if (!mdesc->certs_data)
-		goto e_free_response;
-
-	ret = -EIO;
-	mdesc->ctx = snp_init_crypto(mdesc->vmpck, VMPCK_KEY_LEN);
-	if (!mdesc->ctx)
-		goto e_free_cert_data;
 
 	misc = &snp_dev->misc;
 	misc->minor = MISC_DYNAMIC_MINOR;
 	misc->name = DEVICE_NAME;
 	misc->fops = &snp_guest_fops;
 
-	/* Initialize the input addresses for guest request */
-	mdesc->input.req_gpa = __pa(mdesc->request);
-	mdesc->input.resp_gpa = __pa(mdesc->response);
-	mdesc->input.data_gpa = __pa(mdesc->certs_data);
-
 	/* Set the privlevel_floor attribute based on the vmpck_id */
-	sev_tsm_ops.privlevel_floor = vmpck_id;
+	sev_tsm_ops.privlevel_floor = mdesc->vmpck_id;
 
 	ret = tsm_register(&sev_tsm_ops, snp_dev);
 	if (ret)
-		goto e_free_cert_data;
+		goto e_msg_init;
 
 	ret = devm_add_action_or_reset(&pdev->dev, unregister_sev_tsm, NULL);
 	if (ret)
-		goto e_free_cert_data;
+		goto e_msg_init;
 
 	ret =  misc_register(misc);
 	if (ret)
-		goto e_free_ctx;
+		goto e_msg_init;
 
 	snp_dev->msg_desc = mdesc;
-	dev_info(dev, "Initialized SEV guest driver (using VMPCK%d communication key)\n", vmpck_id);
+	dev_info(dev, "Initialized SEV guest driver (using VMPCK%d communication key)\n",
+		 mdesc->vmpck_id);
 	return 0;
 
-e_free_ctx:
-	kfree(mdesc->ctx);
-e_free_cert_data:
-	free_shared_pages(mdesc->certs_data, SEV_FW_BLOB_MAX_SIZE);
-e_free_response:
-	free_shared_pages(mdesc->response, sizeof(struct snp_guest_msg));
-e_free_request:
-	free_shared_pages(mdesc->request, sizeof(struct snp_guest_msg));
-e_unmap:
-	iounmap(mapping);
+e_msg_init:
+	snp_msg_free(mdesc);
+
 	return ret;
 }
 
 static void __exit sev_guest_remove(struct platform_device *pdev)
 {
 	struct snp_guest_dev *snp_dev = platform_get_drvdata(pdev);
-	struct snp_msg_desc *mdesc = snp_dev->msg_desc;
 
-	free_shared_pages(mdesc->certs_data, SEV_FW_BLOB_MAX_SIZE);
-	free_shared_pages(mdesc->response, sizeof(struct snp_guest_msg));
-	free_shared_pages(mdesc->request, sizeof(struct snp_guest_msg));
-	kfree(mdesc->ctx);
+	snp_msg_free(snp_dev->msg_desc);
 	misc_deregister(&snp_dev->misc);
 }
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e4e27d44dc2b..ba2ac635c2fc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1559,6 +1559,7 @@ config AMD_MEM_ENCRYPT
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
 	select UNACCEPTED_MEMORY
+	select CRYPTO_LIB_AESGCM
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/drivers/virt/coco/sev-guest/Kconfig b/drivers/virt/coco/sev-guest/Kconfig
index 0b772bd921d8..a6405ab6c2c3 100644
--- a/drivers/virt/coco/sev-guest/Kconfig
+++ b/drivers/virt/coco/sev-guest/Kconfig
@@ -2,7 +2,6 @@ config SEV_GUEST
 	tristate "AMD SEV Guest driver"
 	default m
 	depends on AMD_MEM_ENCRYPT
-	select CRYPTO_LIB_AESGCM
 	select TSM_REPORTS
 	help
 	  SEV-SNP firmware provides the guest a mechanism to communicate with
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 04/13] x86/sev: Relocate SNP guest messaging routines to common code
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (2 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 03/13] x86/sev: Carve out and export SNP guest messaging init routines Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

At present, the SEV guest driver exclusively handles SNP guest messaging.
All routines for sending guest messages are embedded within the guest
driver. To support Secure TSC, SEV-SNP guests must communicate with the AMD
Security Processor during early boot. However, these guest messaging
functions are not accessible during early boot since they are currently
part of the guest driver.

Hence, relocate the core SNP guest messaging functions to SEV common code
and provide an API for sending SNP guest messages.

No functional change, but just an export symbol added for
snp_send_guest_request() and dropped the export symbol on
snp_issue_guest_request() and made it static.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/include/asm/sev.h              |  14 +-
 arch/x86/coco/sev/core.c                | 294 +++++++++++++++++++++++-
 drivers/virt/coco/sev-guest/sev-guest.c | 292 -----------------------
 3 files changed, 298 insertions(+), 302 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index db08d0ac90be..0937ac7a96db 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -125,6 +125,9 @@ struct snp_req_data {
 #define AAD_LEN			48
 #define MSG_HDR_VER		1
 
+#define SNP_REQ_MAX_RETRY_DURATION      (60*HZ)
+#define SNP_REQ_RETRY_DELAY             (2*HZ)
+
 /* See SNP spec SNP_GUEST_REQUEST section for the structure */
 enum msg_type {
 	SNP_MSG_TYPE_INVALID = 0,
@@ -443,8 +446,6 @@ void snp_set_wakeup_secondary_cpu(void);
 bool snp_init(struct boot_params *bp);
 void __noreturn snp_abort(void);
 void snp_dmi_setup(void);
-int snp_issue_guest_request(struct snp_guest_req *req, struct snp_req_data *input,
-			    struct snp_guest_request_ioctl *rio);
 int snp_issue_svsm_attest_req(u64 call_id, struct svsm_call *call, struct svsm_attest_call *input);
 void snp_accept_memory(phys_addr_t start, phys_addr_t end);
 u64 snp_get_unsupported_features(u64 status);
@@ -459,6 +460,8 @@ void snp_kexec_begin(void);
 int snp_msg_init(struct snp_msg_desc *mdesc, int vmpck_id);
 struct snp_msg_desc *snp_msg_alloc(void);
 void snp_msg_free(struct snp_msg_desc *mdesc);
+int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
+			   struct snp_guest_request_ioctl *rio);
 
 #else	/* !CONFIG_AMD_MEM_ENCRYPT */
 
@@ -482,11 +485,6 @@ static inline void snp_set_wakeup_secondary_cpu(void) { }
 static inline bool snp_init(struct boot_params *bp) { return false; }
 static inline void snp_abort(void) { }
 static inline void snp_dmi_setup(void) { }
-static inline int snp_issue_guest_request(struct snp_guest_req *req, struct snp_req_data *input,
-					  struct snp_guest_request_ioctl *rio)
-{
-	return -ENOTTY;
-}
 static inline int snp_issue_svsm_attest_req(u64 call_id, struct svsm_call *call, struct svsm_attest_call *input)
 {
 	return -ENOTTY;
@@ -503,6 +501,8 @@ static inline void snp_kexec_begin(void) { }
 static inline int snp_msg_init(struct snp_msg_desc *mdesc, int vmpck_id) { return -1; }
 static inline struct snp_msg_desc *snp_msg_alloc(void) { return NULL; }
 static inline void snp_msg_free(struct snp_msg_desc *mdesc) { }
+static inline int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
+					 struct snp_guest_request_ioctl *rio) { return -ENODEV; }
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
 
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 93627a21945d..feb145df6bf7 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2504,8 +2504,8 @@ int snp_issue_svsm_attest_req(u64 call_id, struct svsm_call *call,
 }
 EXPORT_SYMBOL_GPL(snp_issue_svsm_attest_req);
 
-int snp_issue_guest_request(struct snp_guest_req *req, struct snp_req_data *input,
-			    struct snp_guest_request_ioctl *rio)
+static int snp_issue_guest_request(struct snp_guest_req *req, struct snp_req_data *input,
+				   struct snp_guest_request_ioctl *rio)
 {
 	struct ghcb_state state;
 	struct es_em_ctxt ctxt;
@@ -2567,7 +2567,6 @@ int snp_issue_guest_request(struct snp_guest_req *req, struct snp_req_data *inpu
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(snp_issue_guest_request);
 
 static struct platform_device sev_guest_device = {
 	.name		= "sev-guest",
@@ -2833,3 +2832,292 @@ void snp_msg_free(struct snp_msg_desc *mdesc)
 	kfree(mdesc);
 }
 EXPORT_SYMBOL_GPL(snp_msg_free);
+
+/* Mutex to serialize the shared buffer access and command handling. */
+static DEFINE_MUTEX(snp_cmd_mutex);
+
+/*
+ * If an error is received from the host or AMD Secure Processor (ASP) there
+ * are two options. Either retry the exact same encrypted request or discontinue
+ * using the VMPCK.
+ *
+ * This is because in the current encryption scheme GHCB v2 uses AES-GCM to
+ * encrypt the requests. The IV for this scheme is the sequence number. GCM
+ * cannot tolerate IV reuse.
+ *
+ * The ASP FW v1.51 only increments the sequence numbers on a successful
+ * guest<->ASP back and forth and only accepts messages at its exact sequence
+ * number.
+ *
+ * So if the sequence number were to be reused the encryption scheme is
+ * vulnerable. If the sequence number were incremented for a fresh IV the ASP
+ * will reject the request.
+ */
+static void snp_disable_vmpck(struct snp_msg_desc *mdesc)
+{
+	pr_alert("Disabling VMPCK%d communication key to prevent IV reuse.\n",
+		  mdesc->vmpck_id);
+	memzero_explicit(mdesc->vmpck, VMPCK_KEY_LEN);
+	mdesc->vmpck = NULL;
+}
+
+static inline u64 __snp_get_msg_seqno(struct snp_msg_desc *mdesc)
+{
+	u64 count;
+
+	lockdep_assert_held(&snp_cmd_mutex);
+
+	/* Read the current message sequence counter from secrets pages */
+	count = *mdesc->os_area_msg_seqno;
+
+	return count + 1;
+}
+
+/* Return a non-zero on success */
+static u64 snp_get_msg_seqno(struct snp_msg_desc *mdesc)
+{
+	u64 count = __snp_get_msg_seqno(mdesc);
+
+	/*
+	 * The message sequence counter for the SNP guest request is a  64-bit
+	 * value but the version 2 of GHCB specification defines a 32-bit storage
+	 * for it. If the counter exceeds the 32-bit value then return zero.
+	 * The caller should check the return value, but if the caller happens to
+	 * not check the value and use it, then the firmware treats zero as an
+	 * invalid number and will fail the  message request.
+	 */
+	if (count >= UINT_MAX) {
+		pr_err("request message sequence counter overflow\n");
+		return 0;
+	}
+
+	return count;
+}
+
+static void snp_inc_msg_seqno(struct snp_msg_desc *mdesc)
+{
+	/*
+	 * The counter is also incremented by the PSP, so increment it by 2
+	 * and save in secrets page.
+	 */
+	*mdesc->os_area_msg_seqno += 2;
+}
+
+static int verify_and_dec_payload(struct snp_msg_desc *mdesc, struct snp_guest_req *req)
+{
+	struct snp_guest_msg *resp_msg = &mdesc->secret_response;
+	struct snp_guest_msg *req_msg = &mdesc->secret_request;
+	struct snp_guest_msg_hdr *req_msg_hdr = &req_msg->hdr;
+	struct snp_guest_msg_hdr *resp_msg_hdr = &resp_msg->hdr;
+	struct aesgcm_ctx *ctx = mdesc->ctx;
+	u8 iv[GCM_AES_IV_SIZE] = {};
+
+	pr_debug("response [seqno %lld type %d version %d sz %d]\n",
+		 resp_msg_hdr->msg_seqno, resp_msg_hdr->msg_type, resp_msg_hdr->msg_version,
+		 resp_msg_hdr->msg_sz);
+
+	/* Copy response from shared memory to encrypted memory. */
+	memcpy(resp_msg, mdesc->response, sizeof(*resp_msg));
+
+	/* Verify that the sequence counter is incremented by 1 */
+	if (unlikely(resp_msg_hdr->msg_seqno != (req_msg_hdr->msg_seqno + 1)))
+		return -EBADMSG;
+
+	/* Verify response message type and version number. */
+	if (resp_msg_hdr->msg_type != (req_msg_hdr->msg_type + 1) ||
+	    resp_msg_hdr->msg_version != req_msg_hdr->msg_version)
+		return -EBADMSG;
+
+	/*
+	 * If the message size is greater than our buffer length then return
+	 * an error.
+	 */
+	if (unlikely((resp_msg_hdr->msg_sz + ctx->authsize) > req->resp_sz))
+		return -EBADMSG;
+
+	/* Decrypt the payload */
+	memcpy(iv, &resp_msg_hdr->msg_seqno, min(sizeof(iv), sizeof(resp_msg_hdr->msg_seqno)));
+	if (!aesgcm_decrypt(ctx, req->resp_buf, resp_msg->payload, resp_msg_hdr->msg_sz,
+			    &resp_msg_hdr->algo, AAD_LEN, iv, resp_msg_hdr->authtag))
+		return -EBADMSG;
+
+	return 0;
+}
+
+static int enc_payload(struct snp_msg_desc *mdesc, u64 seqno, struct snp_guest_req *req)
+{
+	struct snp_guest_msg *msg = &mdesc->secret_request;
+	struct snp_guest_msg_hdr *hdr = &msg->hdr;
+	struct aesgcm_ctx *ctx = mdesc->ctx;
+	u8 iv[GCM_AES_IV_SIZE] = {};
+
+	memset(msg, 0, sizeof(*msg));
+
+	hdr->algo = SNP_AEAD_AES_256_GCM;
+	hdr->hdr_version = MSG_HDR_VER;
+	hdr->hdr_sz = sizeof(*hdr);
+	hdr->msg_type = req->msg_type;
+	hdr->msg_version = req->msg_version;
+	hdr->msg_seqno = seqno;
+	hdr->msg_vmpck = req->vmpck_id;
+	hdr->msg_sz = req->req_sz;
+
+	/* Verify the sequence number is non-zero */
+	if (!hdr->msg_seqno)
+		return -ENOSR;
+
+	pr_debug("request [seqno %lld type %d version %d sz %d]\n",
+		 hdr->msg_seqno, hdr->msg_type, hdr->msg_version, hdr->msg_sz);
+
+	if (WARN_ON((req->req_sz + ctx->authsize) > sizeof(msg->payload)))
+		return -EBADMSG;
+
+	memcpy(iv, &hdr->msg_seqno, min(sizeof(iv), sizeof(hdr->msg_seqno)));
+	aesgcm_encrypt(ctx, msg->payload, req->req_buf, req->req_sz, &hdr->algo,
+		       AAD_LEN, iv, hdr->authtag);
+
+	return 0;
+}
+
+static int __handle_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
+				  struct snp_guest_request_ioctl *rio)
+{
+	unsigned long req_start = jiffies;
+	unsigned int override_npages = 0;
+	u64 override_err = 0;
+	int rc;
+
+retry_request:
+	/*
+	 * Call firmware to process the request. In this function the encrypted
+	 * message enters shared memory with the host. So after this call the
+	 * sequence number must be incremented or the VMPCK must be deleted to
+	 * prevent reuse of the IV.
+	 */
+	rc = snp_issue_guest_request(req, &mdesc->input, rio);
+	switch (rc) {
+	case -ENOSPC:
+		/*
+		 * If the extended guest request fails due to having too
+		 * small of a certificate data buffer, retry the same
+		 * guest request without the extended data request in
+		 * order to increment the sequence number and thus avoid
+		 * IV reuse.
+		 */
+		override_npages = mdesc->input.data_npages;
+		req->exit_code	= SVM_VMGEXIT_GUEST_REQUEST;
+
+		/*
+		 * Override the error to inform callers the given extended
+		 * request buffer size was too small and give the caller the
+		 * required buffer size.
+		 */
+		override_err = SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN);
+
+		/*
+		 * If this call to the firmware succeeds, the sequence number can
+		 * be incremented allowing for continued use of the VMPCK. If
+		 * there is an error reflected in the return value, this value
+		 * is checked further down and the result will be the deletion
+		 * of the VMPCK and the error code being propagated back to the
+		 * user as an ioctl() return code.
+		 */
+		goto retry_request;
+
+	/*
+	 * The host may return SNP_GUEST_VMM_ERR_BUSY if the request has been
+	 * throttled. Retry in the driver to avoid returning and reusing the
+	 * message sequence number on a different message.
+	 */
+	case -EAGAIN:
+		if (jiffies - req_start > SNP_REQ_MAX_RETRY_DURATION) {
+			rc = -ETIMEDOUT;
+			break;
+		}
+		schedule_timeout_killable(SNP_REQ_RETRY_DELAY);
+		goto retry_request;
+	}
+
+	/*
+	 * Increment the message sequence number. There is no harm in doing
+	 * this now because decryption uses the value stored in the response
+	 * structure and any failure will wipe the VMPCK, preventing further
+	 * use anyway.
+	 */
+	snp_inc_msg_seqno(mdesc);
+
+	if (override_err) {
+		rio->exitinfo2 = override_err;
+
+		/*
+		 * If an extended guest request was issued and the supplied certificate
+		 * buffer was not large enough, a standard guest request was issued to
+		 * prevent IV reuse. If the standard request was successful, return -EIO
+		 * back to the caller as would have originally been returned.
+		 */
+		if (!rc && override_err == SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN))
+			rc = -EIO;
+	}
+
+	if (override_npages)
+		mdesc->input.data_npages = override_npages;
+
+	return rc;
+}
+
+int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
+			   struct snp_guest_request_ioctl *rio)
+{
+	u64 seqno;
+	int rc;
+
+	guard(mutex)(&snp_cmd_mutex);
+
+	/* Check if the VMPCK is not empty */
+	if (!mdesc->vmpck || !memchr_inv(mdesc->vmpck, 0, VMPCK_KEY_LEN)) {
+		pr_err_ratelimited("VMPCK is disabled\n");
+		return -ENOTTY;
+	}
+
+	/* Get message sequence and verify that its a non-zero */
+	seqno = snp_get_msg_seqno(mdesc);
+	if (!seqno)
+		return -EIO;
+
+	/* Clear shared memory's response for the host to populate. */
+	memset(mdesc->response, 0, sizeof(struct snp_guest_msg));
+
+	/* Encrypt the userspace provided payload in mdesc->secret_request. */
+	rc = enc_payload(mdesc, seqno, req);
+	if (rc)
+		return rc;
+
+	/*
+	 * Write the fully encrypted request to the shared unencrypted
+	 * request page.
+	 */
+	memcpy(mdesc->request, &mdesc->secret_request, sizeof(mdesc->secret_request));
+
+	rc = __handle_guest_request(mdesc, req, rio);
+	if (rc) {
+		if (rc == -EIO &&
+		    rio->exitinfo2 == SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN))
+			return rc;
+
+		pr_alert("Detected error from ASP request. rc: %d, exitinfo2: 0x%llx\n",
+			 rc, rio->exitinfo2);
+
+		snp_disable_vmpck(mdesc);
+		return rc;
+	}
+
+	rc = verify_and_dec_payload(mdesc, req);
+	if (rc) {
+		pr_alert("Detected unexpected decode failure from ASP. rc: %d\n", rc);
+		snp_disable_vmpck(mdesc);
+		return rc;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(snp_send_guest_request);
diff --git a/drivers/virt/coco/sev-guest/sev-guest.c b/drivers/virt/coco/sev-guest/sev-guest.c
index d0f7233b1430..264b6523fe52 100644
--- a/drivers/virt/coco/sev-guest/sev-guest.c
+++ b/drivers/virt/coco/sev-guest/sev-guest.c
@@ -31,9 +31,6 @@
 
 #define DEVICE_NAME	"sev-guest"
 
-#define SNP_REQ_MAX_RETRY_DURATION	(60*HZ)
-#define SNP_REQ_RETRY_DELAY		(2*HZ)
-
 #define SVSM_MAX_RETRIES		3
 
 struct snp_guest_dev {
@@ -60,76 +57,6 @@ static int vmpck_id = -1;
 module_param(vmpck_id, int, 0444);
 MODULE_PARM_DESC(vmpck_id, "The VMPCK ID to use when communicating with the PSP.");
 
-/* Mutex to serialize the shared buffer access and command handling. */
-static DEFINE_MUTEX(snp_cmd_mutex);
-
-/*
- * If an error is received from the host or AMD Secure Processor (ASP) there
- * are two options. Either retry the exact same encrypted request or discontinue
- * using the VMPCK.
- *
- * This is because in the current encryption scheme GHCB v2 uses AES-GCM to
- * encrypt the requests. The IV for this scheme is the sequence number. GCM
- * cannot tolerate IV reuse.
- *
- * The ASP FW v1.51 only increments the sequence numbers on a successful
- * guest<->ASP back and forth and only accepts messages at its exact sequence
- * number.
- *
- * So if the sequence number were to be reused the encryption scheme is
- * vulnerable. If the sequence number were incremented for a fresh IV the ASP
- * will reject the request.
- */
-static void snp_disable_vmpck(struct snp_msg_desc *mdesc)
-{
-	pr_alert("Disabling VMPCK%d communication key to prevent IV reuse.\n",
-		  mdesc->vmpck_id);
-	memzero_explicit(mdesc->vmpck, VMPCK_KEY_LEN);
-	mdesc->vmpck = NULL;
-}
-
-static inline u64 __snp_get_msg_seqno(struct snp_msg_desc *mdesc)
-{
-	u64 count;
-
-	lockdep_assert_held(&snp_cmd_mutex);
-
-	/* Read the current message sequence counter from secrets pages */
-	count = *mdesc->os_area_msg_seqno;
-
-	return count + 1;
-}
-
-/* Return a non-zero on success */
-static u64 snp_get_msg_seqno(struct snp_msg_desc *mdesc)
-{
-	u64 count = __snp_get_msg_seqno(mdesc);
-
-	/*
-	 * The message sequence counter for the SNP guest request is a  64-bit
-	 * value but the version 2 of GHCB specification defines a 32-bit storage
-	 * for it. If the counter exceeds the 32-bit value then return zero.
-	 * The caller should check the return value, but if the caller happens to
-	 * not check the value and use it, then the firmware treats zero as an
-	 * invalid number and will fail the  message request.
-	 */
-	if (count >= UINT_MAX) {
-		pr_err("request message sequence counter overflow\n");
-		return 0;
-	}
-
-	return count;
-}
-
-static void snp_inc_msg_seqno(struct snp_msg_desc *mdesc)
-{
-	/*
-	 * The counter is also incremented by the PSP, so increment it by 2
-	 * and save in secrets page.
-	 */
-	*mdesc->os_area_msg_seqno += 2;
-}
-
 static inline struct snp_guest_dev *to_snp_dev(struct file *file)
 {
 	struct miscdevice *dev = file->private_data;
@@ -137,225 +64,6 @@ static inline struct snp_guest_dev *to_snp_dev(struct file *file)
 	return container_of(dev, struct snp_guest_dev, misc);
 }
 
-static int verify_and_dec_payload(struct snp_msg_desc *mdesc, struct snp_guest_req *req)
-{
-	struct snp_guest_msg *resp_msg = &mdesc->secret_response;
-	struct snp_guest_msg *req_msg = &mdesc->secret_request;
-	struct snp_guest_msg_hdr *req_msg_hdr = &req_msg->hdr;
-	struct snp_guest_msg_hdr *resp_msg_hdr = &resp_msg->hdr;
-	struct aesgcm_ctx *ctx = mdesc->ctx;
-	u8 iv[GCM_AES_IV_SIZE] = {};
-
-	pr_debug("response [seqno %lld type %d version %d sz %d]\n",
-		 resp_msg_hdr->msg_seqno, resp_msg_hdr->msg_type, resp_msg_hdr->msg_version,
-		 resp_msg_hdr->msg_sz);
-
-	/* Copy response from shared memory to encrypted memory. */
-	memcpy(resp_msg, mdesc->response, sizeof(*resp_msg));
-
-	/* Verify that the sequence counter is incremented by 1 */
-	if (unlikely(resp_msg_hdr->msg_seqno != (req_msg_hdr->msg_seqno + 1)))
-		return -EBADMSG;
-
-	/* Verify response message type and version number. */
-	if (resp_msg_hdr->msg_type != (req_msg_hdr->msg_type + 1) ||
-	    resp_msg_hdr->msg_version != req_msg_hdr->msg_version)
-		return -EBADMSG;
-
-	/*
-	 * If the message size is greater than our buffer length then return
-	 * an error.
-	 */
-	if (unlikely((resp_msg_hdr->msg_sz + ctx->authsize) > req->resp_sz))
-		return -EBADMSG;
-
-	/* Decrypt the payload */
-	memcpy(iv, &resp_msg_hdr->msg_seqno, min(sizeof(iv), sizeof(resp_msg_hdr->msg_seqno)));
-	if (!aesgcm_decrypt(ctx, req->resp_buf, resp_msg->payload, resp_msg_hdr->msg_sz,
-			    &resp_msg_hdr->algo, AAD_LEN, iv, resp_msg_hdr->authtag))
-		return -EBADMSG;
-
-	return 0;
-}
-
-static int enc_payload(struct snp_msg_desc *mdesc, u64 seqno, struct snp_guest_req *req)
-{
-	struct snp_guest_msg *msg = &mdesc->secret_request;
-	struct snp_guest_msg_hdr *hdr = &msg->hdr;
-	struct aesgcm_ctx *ctx = mdesc->ctx;
-	u8 iv[GCM_AES_IV_SIZE] = {};
-
-	memset(msg, 0, sizeof(*msg));
-
-	hdr->algo = SNP_AEAD_AES_256_GCM;
-	hdr->hdr_version = MSG_HDR_VER;
-	hdr->hdr_sz = sizeof(*hdr);
-	hdr->msg_type = req->msg_type;
-	hdr->msg_version = req->msg_version;
-	hdr->msg_seqno = seqno;
-	hdr->msg_vmpck = req->vmpck_id;
-	hdr->msg_sz = req->req_sz;
-
-	/* Verify the sequence number is non-zero */
-	if (!hdr->msg_seqno)
-		return -ENOSR;
-
-	pr_debug("request [seqno %lld type %d version %d sz %d]\n",
-		 hdr->msg_seqno, hdr->msg_type, hdr->msg_version, hdr->msg_sz);
-
-	if (WARN_ON((req->req_sz + ctx->authsize) > sizeof(msg->payload)))
-		return -EBADMSG;
-
-	memcpy(iv, &hdr->msg_seqno, min(sizeof(iv), sizeof(hdr->msg_seqno)));
-	aesgcm_encrypt(ctx, msg->payload, req->req_buf, req->req_sz, &hdr->algo,
-		       AAD_LEN, iv, hdr->authtag);
-
-	return 0;
-}
-
-static int __handle_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
-				  struct snp_guest_request_ioctl *rio)
-{
-	unsigned long req_start = jiffies;
-	unsigned int override_npages = 0;
-	u64 override_err = 0;
-	int rc;
-
-retry_request:
-	/*
-	 * Call firmware to process the request. In this function the encrypted
-	 * message enters shared memory with the host. So after this call the
-	 * sequence number must be incremented or the VMPCK must be deleted to
-	 * prevent reuse of the IV.
-	 */
-	rc = snp_issue_guest_request(req, &mdesc->input, rio);
-	switch (rc) {
-	case -ENOSPC:
-		/*
-		 * If the extended guest request fails due to having too
-		 * small of a certificate data buffer, retry the same
-		 * guest request without the extended data request in
-		 * order to increment the sequence number and thus avoid
-		 * IV reuse.
-		 */
-		override_npages = mdesc->input.data_npages;
-		req->exit_code	= SVM_VMGEXIT_GUEST_REQUEST;
-
-		/*
-		 * Override the error to inform callers the given extended
-		 * request buffer size was too small and give the caller the
-		 * required buffer size.
-		 */
-		override_err = SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN);
-
-		/*
-		 * If this call to the firmware succeeds, the sequence number can
-		 * be incremented allowing for continued use of the VMPCK. If
-		 * there is an error reflected in the return value, this value
-		 * is checked further down and the result will be the deletion
-		 * of the VMPCK and the error code being propagated back to the
-		 * user as an ioctl() return code.
-		 */
-		goto retry_request;
-
-	/*
-	 * The host may return SNP_GUEST_VMM_ERR_BUSY if the request has been
-	 * throttled. Retry in the driver to avoid returning and reusing the
-	 * message sequence number on a different message.
-	 */
-	case -EAGAIN:
-		if (jiffies - req_start > SNP_REQ_MAX_RETRY_DURATION) {
-			rc = -ETIMEDOUT;
-			break;
-		}
-		schedule_timeout_killable(SNP_REQ_RETRY_DELAY);
-		goto retry_request;
-	}
-
-	/*
-	 * Increment the message sequence number. There is no harm in doing
-	 * this now because decryption uses the value stored in the response
-	 * structure and any failure will wipe the VMPCK, preventing further
-	 * use anyway.
-	 */
-	snp_inc_msg_seqno(mdesc);
-
-	if (override_err) {
-		rio->exitinfo2 = override_err;
-
-		/*
-		 * If an extended guest request was issued and the supplied certificate
-		 * buffer was not large enough, a standard guest request was issued to
-		 * prevent IV reuse. If the standard request was successful, return -EIO
-		 * back to the caller as would have originally been returned.
-		 */
-		if (!rc && override_err == SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN))
-			rc = -EIO;
-	}
-
-	if (override_npages)
-		mdesc->input.data_npages = override_npages;
-
-	return rc;
-}
-
-static int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
-				  struct snp_guest_request_ioctl *rio)
-{
-	u64 seqno;
-	int rc;
-
-	guard(mutex)(&snp_cmd_mutex);
-
-	/* Check if the VMPCK is not empty */
-	if (!mdesc->vmpck || !memchr_inv(mdesc->vmpck, 0, VMPCK_KEY_LEN)) {
-		pr_err_ratelimited("VMPCK is disabled\n");
-		return -ENOTTY;
-	}
-
-	/* Get message sequence and verify that its a non-zero */
-	seqno = snp_get_msg_seqno(mdesc);
-	if (!seqno)
-		return -EIO;
-
-	/* Clear shared memory's response for the host to populate. */
-	memset(mdesc->response, 0, sizeof(struct snp_guest_msg));
-
-	/* Encrypt the userspace provided payload in mdesc->secret_request. */
-	rc = enc_payload(mdesc, seqno, req);
-	if (rc)
-		return rc;
-
-	/*
-	 * Write the fully encrypted request to the shared unencrypted
-	 * request page.
-	 */
-	memcpy(mdesc->request, &mdesc->secret_request,
-	       sizeof(mdesc->secret_request));
-
-	rc = __handle_guest_request(mdesc, req, rio);
-	if (rc) {
-		if (rc == -EIO &&
-		    rio->exitinfo2 == SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN))
-			return rc;
-
-		pr_alert("Detected error from ASP request. rc: %d, exitinfo2: 0x%llx\n",
-			 rc, rio->exitinfo2);
-
-		snp_disable_vmpck(mdesc);
-		return rc;
-	}
-
-	rc = verify_and_dec_payload(mdesc, req);
-	if (rc) {
-		pr_alert("Detected unexpected decode failure from ASP. rc: %d\n", rc);
-		snp_disable_vmpck(mdesc);
-		return rc;
-	}
-
-	return 0;
-}
-
 struct snp_req_resp {
 	sockptr_t req_data;
 	sockptr_t resp_data;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (3 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 04/13] x86/sev: Relocate SNP guest messaging routines to common code Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-07 10:42   ` Borislav Petkov
  2025-01-07 19:46   ` Tom Lendacky
  2025-01-06 12:46 ` [PATCH v16 06/13] x86/sev: Change TSC MSR behavior for Secure TSC enabled guests Nikunj A Dadhania
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

Add support for Secure TSC in SNP-enabled guests. Secure TSC allows guests
to securely use RDTSC/RDTSCP instructions, ensuring that the parameters
used cannot be altered by the hypervisor once the guest is launched.

Secure TSC-enabled guests need to query TSC information from the AMD
Security Processor. This communication channel is encrypted between the AMD
Security Processor and the guest, with the hypervisor acting merely as a
conduit to deliver the guest messages to the AMD Security Processor. Each
message is protected with AEAD (AES-256 GCM).

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/include/asm/sev-common.h |   1 +
 arch/x86/include/asm/sev.h        |  21 ++++++
 arch/x86/include/asm/svm.h        |   6 +-
 include/linux/cc_platform.h       |   8 +++
 arch/x86/coco/core.c              |   4 ++
 arch/x86/coco/sev/core.c          | 107 ++++++++++++++++++++++++++++++
 arch/x86/mm/mem_encrypt.c         |   2 +
 7 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 50f5666938c0..6ef92432a5ce 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -206,6 +206,7 @@ struct snp_psc_desc {
 #define GHCB_TERM_NO_SVSM		7	/* SVSM is not advertised in the secrets page */
 #define GHCB_TERM_SVSM_VMPL0		8	/* SVSM is present but has set VMPL to 0 */
 #define GHCB_TERM_SVSM_CAA		9	/* SVSM is present but CAA is not page aligned */
+#define GHCB_TERM_SECURE_TSC		10	/* Secure TSC initialization failed */
 
 #define GHCB_RESP_CODE(v)		((v) & GHCB_MSR_INFO_MASK)
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 0937ac7a96db..bdcdaac4df1c 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -146,6 +146,9 @@ enum msg_type {
 	SNP_MSG_VMRK_REQ,
 	SNP_MSG_VMRK_RSP,
 
+	SNP_MSG_TSC_INFO_REQ = 17,
+	SNP_MSG_TSC_INFO_RSP,
+
 	SNP_MSG_TYPE_MAX
 };
 
@@ -174,6 +177,21 @@ struct snp_guest_msg {
 	u8 payload[PAGE_SIZE - sizeof(struct snp_guest_msg_hdr)];
 } __packed;
 
+#define SNP_TSC_INFO_REQ_SZ	128
+
+struct snp_tsc_info_req {
+	u8 rsvd[SNP_TSC_INFO_REQ_SZ];
+} __packed;
+
+struct snp_tsc_info_resp {
+	u32 status;
+	u32 rsvd1;
+	u64 tsc_scale;
+	u64 tsc_offset;
+	u32 tsc_factor;
+	u8 rsvd2[100];
+} __packed;
+
 struct snp_guest_req {
 	void *req_buf;
 	size_t req_sz;
@@ -463,6 +481,8 @@ void snp_msg_free(struct snp_msg_desc *mdesc);
 int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
 			   struct snp_guest_request_ioctl *rio);
 
+void __init snp_secure_tsc_prepare(void);
+
 #else	/* !CONFIG_AMD_MEM_ENCRYPT */
 
 #define snp_vmpl 0
@@ -503,6 +523,7 @@ static inline struct snp_msg_desc *snp_msg_alloc(void) { return NULL; }
 static inline void snp_msg_free(struct snp_msg_desc *mdesc) { }
 static inline int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
 					 struct snp_guest_request_ioctl *rio) { return -ENODEV; }
+static inline void __init snp_secure_tsc_prepare(void) { }
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
 
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 2b59b9951c90..92e18798f197 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -417,7 +417,9 @@ struct sev_es_save_area {
 	u8 reserved_0x298[80];
 	u32 pkru;
 	u32 tsc_aux;
-	u8 reserved_0x2f0[24];
+	u64 tsc_scale;
+	u64 tsc_offset;
+	u8 reserved_0x300[8];
 	u64 rcx;
 	u64 rdx;
 	u64 rbx;
@@ -564,7 +566,7 @@ static inline void __unused_size_checks(void)
 	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x1c0);
 	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x248);
 	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x298);
-	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x2f0);
+	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x300);
 	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x320);
 	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x380);
 	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x3f0);
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index caa4b4430634..0bf7d33a1048 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -81,6 +81,14 @@ enum cc_attr {
 	 */
 	CC_ATTR_GUEST_SEV_SNP,
 
+	/**
+	 * @CC_ATTR_GUEST_SNP_SECURE_TSC: SNP Secure TSC is active.
+	 *
+	 * The platform/OS is running as a guest/virtual machine and actively
+	 * using AMD SEV-SNP Secure TSC feature.
+	 */
+	CC_ATTR_GUEST_SNP_SECURE_TSC,
+
 	/**
 	 * @CC_ATTR_HOST_SEV_SNP: AMD SNP enabled on the host.
 	 *
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 0f81f70aca82..715c2c09582f 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -97,6 +97,10 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
 	case CC_ATTR_GUEST_SEV_SNP:
 		return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
 
+	case CC_ATTR_GUEST_SNP_SECURE_TSC:
+		return (sev_status & MSR_AMD64_SEV_SNP_ENABLED) &&
+			(sev_status & MSR_AMD64_SNP_SECURE_TSC);
+
 	case CC_ATTR_HOST_SEV_SNP:
 		return cc_flags.host_sev_snp;
 
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index feb145df6bf7..00a0ac3baab7 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -96,6 +96,14 @@ static u64 sev_hv_features __ro_after_init;
 /* Secrets page physical address from the CC blob */
 static u64 secrets_pa __ro_after_init;
 
+/*
+ * For Secure TSC guests, the BSP fetches TSC_INFO using SNP guest messaging and
+ * initializes snp_tsc_scale and snp_tsc_offset. These values are replicated
+ * across the APs VMSA fields (TSC_SCALE and TSC_OFFSET).
+ */
+static u64 snp_tsc_scale __ro_after_init;
+static u64 snp_tsc_offset __ro_after_init;
+
 /* #VC handler runtime per-CPU data */
 struct sev_es_runtime_data {
 	struct ghcb ghcb_page;
@@ -1272,6 +1280,12 @@ static int wakeup_cpu_via_vmgexit(u32 apic_id, unsigned long start_ip)
 	vmsa->vmpl		= snp_vmpl;
 	vmsa->sev_features	= sev_status >> 2;
 
+	/* Populate AP's TSC scale/offset to get accurate TSC values. */
+	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
+		vmsa->tsc_scale = snp_tsc_scale;
+		vmsa->tsc_offset = snp_tsc_offset;
+	}
+
 	/* Switch the page over to a VMSA page now that it is initialized */
 	ret = snp_set_vmsa(vmsa, caa, apic_id, true);
 	if (ret) {
@@ -3121,3 +3135,96 @@ int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req
 	return 0;
 }
 EXPORT_SYMBOL_GPL(snp_send_guest_request);
+
+static int __init snp_get_tsc_info(void)
+{
+	struct snp_guest_request_ioctl *rio;
+	struct snp_tsc_info_resp *tsc_resp;
+	struct snp_tsc_info_req *tsc_req;
+	struct snp_msg_desc *mdesc;
+	struct snp_guest_req *req;
+	int rc = -ENOMEM;
+
+	tsc_req = kzalloc(sizeof(*tsc_req), GFP_KERNEL);
+	if (!tsc_req)
+		return rc;
+
+	/*
+	 * The intermediate response buffer is used while decrypting the
+	 * response payload. Make sure that it has enough space to cover
+	 * the authtag.
+	 */
+	tsc_resp = kzalloc(sizeof(*tsc_resp) + AUTHTAG_LEN, GFP_KERNEL);
+	if (!tsc_resp)
+		goto e_free_tsc_req;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		goto e_free_tsc_resp;
+
+	rio = kzalloc(sizeof(*rio), GFP_KERNEL);
+	if (!rio)
+		goto e_free_req;
+
+	mdesc = snp_msg_alloc();
+	if (IS_ERR_OR_NULL(mdesc))
+		goto e_free_rio;
+
+	rc = snp_msg_init(mdesc, snp_vmpl);
+	if (rc)
+		goto e_free_mdesc;
+
+	req->msg_version = MSG_HDR_VER;
+	req->msg_type = SNP_MSG_TSC_INFO_REQ;
+	req->vmpck_id = snp_vmpl;
+	req->req_buf = tsc_req;
+	req->req_sz = sizeof(*tsc_req);
+	req->resp_buf = (void *)tsc_resp;
+	req->resp_sz = sizeof(*tsc_resp) + AUTHTAG_LEN;
+	req->exit_code = SVM_VMGEXIT_GUEST_REQUEST;
+
+	rc = snp_send_guest_request(mdesc, req, rio);
+	if (rc)
+		goto e_request;
+
+	pr_debug("%s: response status 0x%x scale 0x%llx offset 0x%llx factor 0x%x\n",
+		 __func__, tsc_resp->status, tsc_resp->tsc_scale, tsc_resp->tsc_offset,
+		 tsc_resp->tsc_factor);
+
+	if (!tsc_resp->status) {
+		snp_tsc_scale = tsc_resp->tsc_scale;
+		snp_tsc_offset = tsc_resp->tsc_offset;
+	} else {
+		pr_err("Failed to get TSC info, response status 0x%x\n", tsc_resp->status);
+		rc = -EIO;
+	}
+
+e_request:
+	/* The response buffer contains sensitive data, explicitly clear it. */
+	memzero_explicit(tsc_resp, sizeof(*tsc_resp) + AUTHTAG_LEN);
+e_free_mdesc:
+	snp_msg_free(mdesc);
+e_free_rio:
+	kfree(rio);
+e_free_req:
+	kfree(req);
+ e_free_tsc_resp:
+	kfree(tsc_resp);
+e_free_tsc_req:
+	kfree(tsc_req);
+
+	return rc;
+}
+
+void __init snp_secure_tsc_prepare(void)
+{
+	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+		return;
+
+	if (snp_get_tsc_info()) {
+		pr_alert("Unable to retrieve Secure TSC info from ASP\n");
+		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_SECURE_TSC);
+	}
+
+	pr_debug("SecureTSC enabled");
+}
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 0a120d85d7bb..95bae74fdab2 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -94,6 +94,8 @@ void __init mem_encrypt_init(void)
 	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
 	swiotlb_update_mem_attributes();
 
+	snp_secure_tsc_prepare();
+
 	print_mem_encrypt_feature_info();
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 06/13] x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (4 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-07 20:09   ` Tom Lendacky
  2025-01-06 12:46 ` [PATCH v16 07/13] x86/sev: Prevent GUEST_TSC_FREQ MSR interception " Nikunj A Dadhania
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

Secure TSC enabled guests should not write to the MSR_IA32_TSC(10H)
register as the subsequent TSC value reads are undefined. For AMD platform,
MSR_IA32_TSC is intercepted by the hypervisor. MSR_IA32_TSC read/write
accesses should not exit to the hypervisor for such guests.

Accesses to MSR_IA32_TSC needs special handling in the #VC handler for the
guests with Secure TSC enabled. Writes to MSR_IA32_TSC should be ignored
and flagged once with a warning, and reads of MSR_IA32_TSC should return
the result of the RDTSC instruction.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/coco/sev/core.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 00a0ac3baab7..f49d3e97e170 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -1428,6 +1428,34 @@ static enum es_result __vc_handle_msr_caa(struct pt_regs *regs, bool write)
 	return ES_OK;
 }
 
+/*
+ * TSC related accesses should not exit to the hypervisor when a guest is
+ * executing with Secure TSC enabled, so special handling is required for
+ * accesses of MSR_IA32_TSC.
+ */
+static enum es_result __vc_handle_secure_tsc_msrs(struct pt_regs *regs, bool write)
+{
+	u64 tsc;
+
+	/*
+	 * Writes: Writing to MSR_IA32_TSC can cause subsequent reads of the TSC
+	 *         to return undefined values, so ignore all writes.
+	 *
+	 * Reads: Reads of MSR_IA32_TSC should return the current TSC value, use
+	 *        the value returned by rdtsc_ordered().
+	 */
+	if (write) {
+		WARN_ONCE(1, "TSC MSR writes are verboten!\n");
+		return ES_OK;
+	}
+
+	tsc = rdtsc_ordered();
+	regs->ax = lower_32_bits(tsc);
+	regs->dx = upper_32_bits(tsc);
+
+	return ES_OK;
+}
+
 static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 {
 	struct pt_regs *regs = ctxt->regs;
@@ -1437,8 +1465,17 @@ static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 	/* Is it a WRMSR? */
 	write = ctxt->insn.opcode.bytes[1] == 0x30;
 
-	if (regs->cx == MSR_SVSM_CAA)
+	switch (regs->cx) {
+	case MSR_SVSM_CAA:
 		return __vc_handle_msr_caa(regs, write);
+	case MSR_IA32_TSC:
+		if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
+			return __vc_handle_secure_tsc_msrs(regs, write);
+		else
+			break;
+	default:
+		break;
+	}
 
 	ghcb_set_rcx(ghcb, regs->cx);
 	if (write) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 07/13] x86/sev: Prevent GUEST_TSC_FREQ MSR interception for Secure TSC enabled guests
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (5 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 06/13] x86/sev: Change TSC MSR behavior for Secure TSC enabled guests Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 08/13] x86/sev: Prevent RDTSC/RDTSCP " Nikunj A Dadhania
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

The hypervisor should not be intercepting GUEST_TSC_FREQ MSR(0xcOO10134)
when Secure TSC is enabled. A #VC exception will be generated if the
GUEST_TSC_FREQ MSR is being intercepted. If this should occur and Secure
TSC is enabled, terminate guest execution.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/include/asm/msr-index.h |  1 +
 arch/x86/coco/sev/core.c         | 10 +++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3f3e2bc99162..9a71880eec07 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -608,6 +608,7 @@
 #define MSR_AMD_PERF_CTL		0xc0010062
 #define MSR_AMD_PERF_STATUS		0xc0010063
 #define MSR_AMD_PSTATE_DEF_BASE		0xc0010064
+#define MSR_AMD64_GUEST_TSC_FREQ	0xc0010134
 #define MSR_AMD64_OSVW_ID_LENGTH	0xc0010140
 #define MSR_AMD64_OSVW_STATUS		0xc0010141
 #define MSR_AMD_PPIN_CTL		0xc00102f0
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index f49d3e97e170..dbf4531c6271 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -1431,12 +1431,19 @@ static enum es_result __vc_handle_msr_caa(struct pt_regs *regs, bool write)
 /*
  * TSC related accesses should not exit to the hypervisor when a guest is
  * executing with Secure TSC enabled, so special handling is required for
- * accesses of MSR_IA32_TSC.
+ * accesses of MSR_IA32_TSC and MSR_AMD64_GUEST_TSC_FREQ.
  */
 static enum es_result __vc_handle_secure_tsc_msrs(struct pt_regs *regs, bool write)
 {
 	u64 tsc;
 
+	/*
+	 * GUEST_TSC_FREQ should not be intercepted when Secure TSC is enabled.
+	 * Terminate the SNP guest when the interception is enabled.
+	 */
+	if (regs->cx == MSR_AMD64_GUEST_TSC_FREQ)
+		return ES_VMM_ERROR;
+
 	/*
 	 * Writes: Writing to MSR_IA32_TSC can cause subsequent reads of the TSC
 	 *         to return undefined values, so ignore all writes.
@@ -1469,6 +1476,7 @@ static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 	case MSR_SVSM_CAA:
 		return __vc_handle_msr_caa(regs, write);
 	case MSR_IA32_TSC:
+	case MSR_AMD64_GUEST_TSC_FREQ:
 		if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
 			return __vc_handle_secure_tsc_msrs(regs, write);
 		else
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 08/13] x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled guests
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (6 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 07/13] x86/sev: Prevent GUEST_TSC_FREQ MSR interception " Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 09/13] x86/sev: Mark Secure TSC as reliable clocksource Nikunj A Dadhania
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

The hypervisor should not be intercepting RDTSC/RDTSCP when Secure TSC is
enabled. A #VC exception will be generated if the RDTSC/RDTSCP instructions
are being intercepted. If this should occur and Secure TSC is enabled,
guest execution should be terminated as the guest cannot rely on the TSC
value provided by the hypervisor.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Tested-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/coco/sev/shared.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/coco/sev/shared.c b/arch/x86/coco/sev/shared.c
index 96023bd978cc..2e4122f8aa6b 100644
--- a/arch/x86/coco/sev/shared.c
+++ b/arch/x86/coco/sev/shared.c
@@ -1141,6 +1141,16 @@ static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
 	bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
 	enum es_result ret;
 
+	/*
+	 * The hypervisor should not be intercepting RDTSC/RDTSCP when Secure
+	 * TSC is enabled. A #VC exception will be generated if the RDTSC/RDTSCP
+	 * instructions are being intercepted. If this should occur and Secure
+	 * TSC is enabled, guest execution should be terminated as the guest
+	 * cannot rely on the TSC value provided by the hypervisor.
+	 */
+	if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
+		return ES_VMM_ERROR;
+
 	ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
 	if (ret != ES_OK)
 		return ret;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 09/13] x86/sev: Mark Secure TSC as reliable clocksource
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (7 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 08/13] x86/sev: Prevent RDTSC/RDTSCP " Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 10/13] x86/tsc: Switch Secure TSC guests away from kvm-clock Nikunj A Dadhania
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

In SNP guest environment with Secure TSC enabled, unlike other clock
sources (such as HPET, ACPI timer, APIC, etc), the RDTSC instruction is
handled without causing a VM exit, resulting in minimal overhead and
jitters. Even when the host CPU's TSC is tampered with, the Secure TSC
enabled guest keeps on ticking forward. Hence, mark Secure TSC as the only
reliable clock source, bypassing unstable calibration.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Tested-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/mm/mem_encrypt_amd.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 774f9677458f..fa0bc52ef707 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -541,6 +541,10 @@ void __init sme_early_init(void)
 	 * kernel mapped.
 	 */
 	snp_update_svsm_ca();
+
+	/* Mark the TSC as reliable when Secure TSC is enabled */
+	if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
+		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
 }
 
 void __init mem_encrypt_free_decrypted_mem(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 10/13] x86/tsc: Switch Secure TSC guests away from kvm-clock
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (8 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 09/13] x86/sev: Mark Secure TSC as reliable clocksource Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-07 15:16   ` Borislav Petkov
  2025-01-06 12:46 ` [PATCH v16 11/13] x86/tsc: Upgrade TSC clocksource rating for guests Nikunj A Dadhania
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

Although the kernel switches over to a stable TSC clocksource instead of
kvm-clock, TSC frequency calibration still relies on the kvm-clock based
frequency calibration. This is due to kvmclock_init() unconditionally
updating the x86_platform's CPU and TSC callbacks.

For Secure TSC enabled guests, use the GUEST_TSC_FREQ MSR to discover the
TSC frequency instead of relying on kvm-clock based frequency calibration.
Override both CPU and TSC frequency calibration callbacks with
securetsc_get_tsc_khz(). Since the difference between CPU base and TSC
frequency does not apply in this case, the same callback is being used.

Additionally, warn users when kvm-clock is selected as the clocksource for
Secure TSC enabled guests. Users can change the clocksource to kvm-clock
using the sysfs interface while running on a Secure TSC enabled guest.
Switching to the hypervisor-controlled kvm-clock can lead to potential
security issues.

Taint the kernel and issue a warning to the user when the clocksource
switches to kvm-clock, ensuring they are aware of the change and its
implications.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/include/asm/sev.h |  2 ++
 arch/x86/coco/sev/core.c   | 21 +++++++++++++++++++++
 arch/x86/kernel/kvmclock.c | 11 +++++++++++
 arch/x86/kernel/tsc.c      |  4 ++++
 4 files changed, 38 insertions(+)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index bdcdaac4df1c..5d9685f92e5c 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -482,6 +482,7 @@ int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req
 			   struct snp_guest_request_ioctl *rio);
 
 void __init snp_secure_tsc_prepare(void);
+void __init snp_secure_tsc_init(void);
 
 #else	/* !CONFIG_AMD_MEM_ENCRYPT */
 
@@ -524,6 +525,7 @@ static inline void snp_msg_free(struct snp_msg_desc *mdesc) { }
 static inline int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
 					 struct snp_guest_request_ioctl *rio) { return -ENODEV; }
 static inline void __init snp_secure_tsc_prepare(void) { }
+static inline void __init snp_secure_tsc_init(void) { }
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
 
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index dbf4531c6271..9c971637e56b 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -103,6 +103,7 @@ static u64 secrets_pa __ro_after_init;
  */
 static u64 snp_tsc_scale __ro_after_init;
 static u64 snp_tsc_offset __ro_after_init;
+static u64 snp_tsc_freq_khz __ro_after_init;
 
 /* #VC handler runtime per-CPU data */
 struct sev_es_runtime_data {
@@ -3273,3 +3274,23 @@ void __init snp_secure_tsc_prepare(void)
 
 	pr_debug("SecureTSC enabled");
 }
+
+static unsigned long securetsc_get_tsc_khz(void)
+{
+	return snp_tsc_freq_khz;
+}
+
+void __init snp_secure_tsc_init(void)
+{
+	unsigned long long tsc_freq_mhz;
+
+	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+		return;
+
+	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+	rdmsrl(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
+	snp_tsc_freq_khz = (unsigned long)(tsc_freq_mhz * 1000);
+
+	x86_platform.calibrate_cpu = securetsc_get_tsc_khz;
+	x86_platform.calibrate_tsc = securetsc_get_tsc_khz;
+}
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5b2c15214a6b..960260a8d884 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -21,6 +21,7 @@
 #include <asm/hypervisor.h>
 #include <asm/x86_init.h>
 #include <asm/kvmclock.h>
+#include <asm/sev.h>
 
 static int kvmclock __initdata = 1;
 static int kvmclock_vsyscall __initdata = 1;
@@ -150,6 +151,16 @@ bool kvm_check_and_clear_guest_paused(void)
 
 static int kvm_cs_enable(struct clocksource *cs)
 {
+	/*
+	 * TSC clocksource should be used for a guest with Secure TSC enabled,
+	 * taint the kernel and warn when the user changes the clocksource to
+	 * kvm-clock.
+	 */
+	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
+		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+		WARN_ONCE(1, "For Secure TSC guest, changing the clocksource is not allowed!\n");
+	}
+
 	vclocks_set_used(VDSO_CLOCKMODE_PVCLOCK);
 	return 0;
 }
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a85594644e13..34dec0b72ea8 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -31,6 +31,7 @@
 #include <asm/i8259.h>
 #include <asm/topology.h>
 #include <asm/uv/uv.h>
+#include <asm/sev.h>
 
 unsigned int __read_mostly cpu_khz;	/* TSC clocks / usec, not used here */
 EXPORT_SYMBOL(cpu_khz);
@@ -1514,6 +1515,9 @@ void __init tsc_early_init(void)
 	/* Don't change UV TSC multi-chassis synchronization */
 	if (is_early_uv_system())
 		return;
+
+	snp_secure_tsc_init();
+
 	if (!determine_cpu_tsc_frequencies(true))
 		return;
 	tsc_enable_sched_clock();
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 11/13] x86/tsc: Upgrade TSC clocksource rating for guests
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (9 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 10/13] x86/tsc: Switch Secure TSC guests away from kvm-clock Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-07 17:51   ` Borislav Petkov
  2025-01-06 12:46 ` [PATCH v16 12/13] x86/tsc: Switch to native sched clock Nikunj A Dadhania
  2025-01-06 12:46 ` [PATCH v16 13/13] x86/sev: Allow Secure TSC feature for SNP guests Nikunj A Dadhania
  12 siblings, 1 reply; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl, Alexey Makhalov, Juergen Gross,
	Boris Ostrovsky

Hypervisor platform setup (x86_hyper_init::init_platform) routines register
their own PV clock sources (KVM, HyperV, and Xen) at different clock
ratings, resulting in PV clocksource being selected even when a stable TSC
clocksource is available. Upgrade the clock rating of the TSC early and
regular clocksource to prefer TSC over PV clock sources when TSC is
invariant, non-stop, and stable

Cc: Alexey Makhalov <alexey.makhalov@broadcom.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kernel/tsc.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 34dec0b72ea8..88d8bfceea04 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -274,10 +274,29 @@ bool using_native_sched_clock(void)
 {
 	return static_call_query(pv_sched_clock) == native_sched_clock;
 }
+
+/*
+ * Upgrade the clock rating for TSC early and regular clocksource when the
+ * underlying platform provides non-stop, invariant, and stable TSC. TSC
+ * early/regular clocksource will be preferred over other PV clock sources.
+ */
+static void __init upgrade_clock_rating(struct clocksource *tsc_early,
+					struct clocksource *tsc)
+{
+	if (cpu_feature_enabled(X86_FEATURE_HYPERVISOR) &&
+	    cpu_feature_enabled(X86_FEATURE_CONSTANT_TSC) &&
+	    cpu_feature_enabled(X86_FEATURE_NONSTOP_TSC) &&
+	    !tsc_unstable) {
+		tsc_early->rating = 449;
+		tsc->rating = 450;
+	}
+}
 #else
 u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
 
 bool using_native_sched_clock(void) { return true; }
+
+static void __init upgrade_clock_rating(struct clocksource *tsc_early, struct clocksource *tsc) { }
 #endif
 
 notrace u64 sched_clock(void)
@@ -1564,6 +1583,8 @@ void __init tsc_init(void)
 	if (tsc_clocksource_reliable || no_tsc_watchdog)
 		tsc_disable_clocksource_watchdog();
 
+	upgrade_clock_rating(&clocksource_tsc_early, &clocksource_tsc);
+
 	clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
 	detect_art();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (10 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 11/13] x86/tsc: Upgrade TSC clocksource rating for guests Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  2025-01-07 19:37   ` Borislav Petkov
  2025-01-06 12:46 ` [PATCH v16 13/13] x86/sev: Allow Secure TSC feature for SNP guests Nikunj A Dadhania
  12 siblings, 1 reply; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl, Alexey Makhalov, Juergen Gross,
	Boris Ostrovsky

Although the kernel switches over to stable TSC clocksource instead of
PV clocksource, the scheduler still keeps on using PV clocks as the
sched clock source. This is because KVM, Xen and VMWare, switch the
paravirt sched clock handler in their init routines. HyperV is the
only PV clock source that checks if the platform provides an invariant
TSC and does not switch to PV sched clock.

When switching back to stable TSC, restore the scheduler clock to
native_sched_clock().

As the clock selection happens in the stop machine context, schedule
delayed work to update the static_call()

Cc: Alexey Makhalov <alexey.makhalov@broadcom.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kernel/tsc.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 88d8bfceea04..fe7a0b1b7cfd 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -291,12 +291,26 @@ static void __init upgrade_clock_rating(struct clocksource *tsc_early,
 		tsc->rating = 450;
 	}
 }
+
+static void enable_native_sc_work(struct work_struct *work)
+{
+	pr_info("Using native sched clock\n");
+	paravirt_set_sched_clock(native_sched_clock);
+}
+static DECLARE_DELAYED_WORK(enable_native_sc, enable_native_sc_work);
+
+static void enable_native_sched_clock(void)
+{
+	if (!using_native_sched_clock())
+		schedule_delayed_work(&enable_native_sc, 0);
+}
 #else
 u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
 
 bool using_native_sched_clock(void) { return true; }
 
 static void __init upgrade_clock_rating(struct clocksource *tsc_early, struct clocksource *tsc) { }
+static void enable_native_sched_clock(void) { }
 #endif
 
 notrace u64 sched_clock(void)
@@ -1176,6 +1190,10 @@ static void tsc_cs_tick_stable(struct clocksource *cs)
 static int tsc_cs_enable(struct clocksource *cs)
 {
 	vclocks_set_used(VDSO_CLOCKMODE_TSC);
+
+	/* Restore native_sched_clock() when switching to TSC */
+	enable_native_sched_clock();
+
 	return 0;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v16 13/13] x86/sev: Allow Secure TSC feature for SNP guests
  2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
                   ` (11 preceding siblings ...)
  2025-01-06 12:46 ` [PATCH v16 12/13] x86/tsc: Switch to native sched clock Nikunj A Dadhania
@ 2025-01-06 12:46 ` Nikunj A Dadhania
  12 siblings, 0 replies; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-06 12:46 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj,
	francescolavra.fl

Now that all the required plumbing is done for enabling SNP Secure TSC
feature, add Secure TSC to SNP features present list.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Tested-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/boot/compressed/sev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index cd44e120fe53..bb55934c1cee 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -401,7 +401,8 @@ void do_boot_stage2_vc(struct pt_regs *regs, unsigned long exit_code)
  * by the guest kernel. As and when a new feature is implemented in the
  * guest kernel, a corresponding bit should be added to the mask.
  */
-#define SNP_FEATURES_PRESENT	MSR_AMD64_SNP_DEBUG_SWAP
+#define SNP_FEATURES_PRESENT	(MSR_AMD64_SNP_DEBUG_SWAP |	\
+				 MSR_AMD64_SNP_SECURE_TSC)
 
 u64 snp_get_unsupported_features(u64 status)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-06 12:46 ` [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests Nikunj A Dadhania
@ 2025-01-07 10:42   ` Borislav Petkov
  2025-01-07 11:43     ` Nikunj A. Dadhania
  2025-01-07 19:46   ` Tom Lendacky
  1 sibling, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-07 10:42 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl

On Mon, Jan 06, 2025 at 06:16:25PM +0530, Nikunj A Dadhania wrote:
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index 0f81f70aca82..715c2c09582f 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -97,6 +97,10 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
>  	case CC_ATTR_GUEST_SEV_SNP:
>  		return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
>  
> +	case CC_ATTR_GUEST_SNP_SECURE_TSC:
> +		return (sev_status & MSR_AMD64_SEV_SNP_ENABLED) &&

This is new here?

> +			(sev_status & MSR_AMD64_SNP_SECURE_TSC);
> +

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-07 10:42   ` Borislav Petkov
@ 2025-01-07 11:43     ` Nikunj A. Dadhania
  2025-01-07 12:37       ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Nikunj A. Dadhania @ 2025-01-07 11:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl



On 1/7/2025 4:12 PM, Borislav Petkov wrote:
> On Mon, Jan 06, 2025 at 06:16:25PM +0530, Nikunj A Dadhania wrote:
>> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
>> index 0f81f70aca82..715c2c09582f 100644
>> --- a/arch/x86/coco/core.c
>> +++ b/arch/x86/coco/core.c
>> @@ -97,6 +97,10 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
>>  	case CC_ATTR_GUEST_SEV_SNP:
>>  		return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
>>  
>> +	case CC_ATTR_GUEST_SNP_SECURE_TSC:
>> +		return (sev_status & MSR_AMD64_SEV_SNP_ENABLED) &&
> 
> This is new here?

Yes, this was suggested by Tom here [1]

>> +			(sev_status & MSR_AMD64_SNP_SECURE_TSC);
>> +
> 

Regards
Nikunj

1. https://lore.kernel.org/all/bad7406d-4a0b-d871-cc02-3ffb9e9185ba@amd.com 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-07 11:43     ` Nikunj A. Dadhania
@ 2025-01-07 12:37       ` Borislav Petkov
  2025-01-07 18:53         ` Tom Lendacky
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-07 12:37 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl

On Tue, Jan 07, 2025 at 05:13:03PM +0530, Nikunj A. Dadhania wrote:
> >> +	case CC_ATTR_GUEST_SNP_SECURE_TSC:
> >> +		return (sev_status & MSR_AMD64_SEV_SNP_ENABLED) &&
> > 
> > This is new here?
> 
> Yes, this was suggested by Tom here [1]

Either of you care to explain why this is needed?

> >> +			(sev_status & MSR_AMD64_SNP_SECURE_TSC);

I would strongly assume that whatever sets MSR_AMD64_SNP_SECURE_TSC will have
checked/set MSR_AMD64_SEV_SNP_ENABLED already.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 10/13] x86/tsc: Switch Secure TSC guests away from kvm-clock
  2025-01-06 12:46 ` [PATCH v16 10/13] x86/tsc: Switch Secure TSC guests away from kvm-clock Nikunj A Dadhania
@ 2025-01-07 15:16   ` Borislav Petkov
  2025-01-08 10:45     ` Nikunj A. Dadhania
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-07 15:16 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl

On Mon, Jan 06, 2025 at 06:16:30PM +0530, Nikunj A Dadhania wrote:
>  static int kvm_cs_enable(struct clocksource *cs)
>  {
> +	/*
> +	 * TSC clocksource should be used for a guest with Secure TSC enabled,
> +	 * taint the kernel and warn when the user changes the clocksource to
> +	 * kvm-clock.
> +	 */
> +	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
> +		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> +		WARN_ONCE(1, "For Secure TSC guest, changing the clocksource is not allowed!\n");

So this thing is trying to state that changing the clocksource is not allowed
but it still allows it. Why not simply do this:

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 960260a8d884..d8fef3a65a35 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -151,14 +151,10 @@ bool kvm_check_and_clear_guest_paused(void)
 
 static int kvm_cs_enable(struct clocksource *cs)
 {
-	/*
-	 * TSC clocksource should be used for a guest with Secure TSC enabled,
-	 * taint the kernel and warn when the user changes the clocksource to
-	 * kvm-clock.
-	 */
+	/* Only the TSC should be used in a Secure TSC guest. */
 	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
-		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
-		WARN_ONCE(1, "For Secure TSC guest, changing the clocksource is not allowed!\n");
+		WARN_ONCE(1, "Secure TSC guest, changing the clocksource is not allowed!\n");
+		return 1;
 	}
 
 	vclocks_set_used(VDSO_CLOCKMODE_PVCLOCK);

?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 11/13] x86/tsc: Upgrade TSC clocksource rating for guests
  2025-01-06 12:46 ` [PATCH v16 11/13] x86/tsc: Upgrade TSC clocksource rating for guests Nikunj A Dadhania
@ 2025-01-07 17:51   ` Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2025-01-07 17:51 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl, Alexey Makhalov,
	Juergen Gross, Boris Ostrovsky

On Mon, Jan 06, 2025 at 06:16:31PM +0530, Nikunj A Dadhania wrote:
> Hypervisor platform setup (x86_hyper_init::init_platform) routines register
> their own PV clock sources (KVM, HyperV, and Xen) at different clock
> ratings, resulting in PV clocksource being selected even when a stable TSC
> clocksource is available. Upgrade the clock rating of the TSC early and
> regular clocksource to prefer TSC over PV clock sources when TSC is
> invariant, non-stop, and stable
> 
> Cc: Alexey Makhalov <alexey.makhalov@broadcom.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> ---
>  arch/x86/kernel/tsc.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)

This needs to make it perfectly clear that it is about virt and not in
general:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index e98b7e585c1c..3741d097d925 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -276,14 +276,16 @@ bool using_native_sched_clock(void)
 
 /*
  * Upgrade the clock rating for TSC early and regular clocksource when the
- * underlying platform provides non-stop, invariant, and stable TSC. TSC
+ * underlying guest is provided a non-stop, invariant, and stable TSC. TSC
  * early/regular clocksource will be preferred over other PV clock sources.
  */
-static void __init upgrade_clock_rating(struct clocksource *tsc_early,
-					struct clocksource *tsc)
+static void __init virt_upgrade_clock_rating(struct clocksource *tsc_early,
+					     struct clocksource *tsc)
 {
-	if (cpu_feature_enabled(X86_FEATURE_HYPERVISOR) &&
-	    cpu_feature_enabled(X86_FEATURE_CONSTANT_TSC) &&
+	if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
+		return;
+
+	if (cpu_feature_enabled(X86_FEATURE_CONSTANT_TSC) &&
 	    cpu_feature_enabled(X86_FEATURE_NONSTOP_TSC) &&
 	    !tsc_unstable) {
 		tsc_early->rating = 449;
@@ -295,7 +297,7 @@ u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
 
 bool using_native_sched_clock(void) { return true; }
 
-static void __init upgrade_clock_rating(struct clocksource *tsc_early, struct clocksource *tsc) { }
+static void __init virt_upgrade_clock_rating(struct clocksource *tsc_early, struct clocksource *tsc) { }
 #endif
 
 notrace u64 sched_clock(void)
@@ -1584,7 +1586,7 @@ void __init tsc_init(void)
 	if (tsc_clocksource_reliable || no_tsc_watchdog)
 		tsc_disable_clocksource_watchdog();
 
-	upgrade_clock_rating(&clocksource_tsc_early, &clocksource_tsc);
+	virt_upgrade_clock_rating(&clocksource_tsc_early, &clocksource_tsc);
 
 	clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
 	detect_art();


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 01/13] virt: sev-guest: Remove is_vmpck_empty() helper
  2025-01-06 12:46 ` [PATCH v16 01/13] virt: sev-guest: Remove is_vmpck_empty() helper Nikunj A Dadhania
@ 2025-01-07 18:38   ` Tom Lendacky
  0 siblings, 0 replies; 49+ messages in thread
From: Tom Lendacky @ 2025-01-07 18:38 UTC (permalink / raw)
  To: Nikunj A Dadhania, linux-kernel, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini,
	francescolavra.fl

On 1/6/25 06:46, Nikunj A Dadhania wrote:
> Remove the is_vmpck_empty() helper function, which uses a local array
> allocation to check if the VMPCK is empty. Replace it with memchr_inv() to
> directly determine if the VMPCK is empty without additional memory
> allocation.
> 
> Suggested-by: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> ---
>  drivers/virt/coco/sev-guest/sev-guest.c | 14 ++------------
>  1 file changed, 2 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/virt/coco/sev-guest/sev-guest.c b/drivers/virt/coco/sev-guest/sev-guest.c
> index b699771be029..62328d0b2cb6 100644
> --- a/drivers/virt/coco/sev-guest/sev-guest.c
> +++ b/drivers/virt/coco/sev-guest/sev-guest.c
> @@ -63,16 +63,6 @@ MODULE_PARM_DESC(vmpck_id, "The VMPCK ID to use when communicating with the PSP.
>  /* Mutex to serialize the shared buffer access and command handling. */
>  static DEFINE_MUTEX(snp_cmd_mutex);
>  
> -static bool is_vmpck_empty(struct snp_msg_desc *mdesc)
> -{
> -	char zero_key[VMPCK_KEY_LEN] = {0};
> -
> -	if (mdesc->vmpck)
> -		return !memcmp(mdesc->vmpck, zero_key, VMPCK_KEY_LEN);
> -
> -	return true;
> -}

I still like the helper, but just using memchr_inv() inside it instead,
e.g.:

return !mdesc->vmpck || !memchr_inv(mdesc->vmpck, 0, VMPCK_KEY_LEN);

But either way works for me.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

> -
>  /*
>   * If an error is received from the host or AMD Secure Processor (ASP) there
>   * are two options. Either retry the exact same encrypted request or discontinue
> @@ -335,7 +325,7 @@ static int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_r
>  	guard(mutex)(&snp_cmd_mutex);
>  
>  	/* Check if the VMPCK is not empty */
> -	if (is_vmpck_empty(mdesc)) {
> +	if (!mdesc->vmpck || !memchr_inv(mdesc->vmpck, 0, VMPCK_KEY_LEN)) {
>  		pr_err_ratelimited("VMPCK is disabled\n");
>  		return -ENOTTY;
>  	}
> @@ -1024,7 +1014,7 @@ static int __init sev_guest_probe(struct platform_device *pdev)
>  	}
>  
>  	/* Verify that VMPCK is not zero. */
> -	if (is_vmpck_empty(mdesc)) {
> +	if (!memchr_inv(mdesc->vmpck, 0, VMPCK_KEY_LEN)) {
>  		dev_err(dev, "Empty VMPCK%d communication key\n", vmpck_id);
>  		goto e_unmap;
>  	}

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 02/13] virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL
  2025-01-06 12:46 ` [PATCH v16 02/13] virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL Nikunj A Dadhania
@ 2025-01-07 18:40   ` Tom Lendacky
  0 siblings, 0 replies; 49+ messages in thread
From: Tom Lendacky @ 2025-01-07 18:40 UTC (permalink / raw)
  To: Nikunj A Dadhania, linux-kernel, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini,
	francescolavra.fl

On 1/6/25 06:46, Nikunj A Dadhania wrote:
> Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL in the sev-guest driver code.
> GFP_KERNEL_ACCOUNT is typically used for accounting untrusted userspace
> allocations. After auditing the sev-guest code, the following changes are
> necessary:
> 
>   * snp_init_crypto(): Use GFP_KERNEL as this is a trusted device probe
>     path.
> 
> Retain GFP_KERNEL_ACCOUNT in the following cases for robustness and
> specific path requirements:
> 
>   * alloc_shared_pages(): Although all allocations are limited, retain
>     GFP_KERNEL_ACCOUNT for future robustness.
> 
>   * get_report() and get_ext_report(): These functions are on the unlocked
>     ioctl path and should continue using GFP_KERNEL_ACCOUNT.
> 
> Suggested-by: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>  drivers/virt/coco/sev-guest/sev-guest.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/virt/coco/sev-guest/sev-guest.c b/drivers/virt/coco/sev-guest/sev-guest.c
> index 62328d0b2cb6..250ce92d816b 100644
> --- a/drivers/virt/coco/sev-guest/sev-guest.c
> +++ b/drivers/virt/coco/sev-guest/sev-guest.c
> @@ -141,7 +141,7 @@ static struct aesgcm_ctx *snp_init_crypto(u8 *key, size_t keylen)
>  {
>  	struct aesgcm_ctx *ctx;
>  
> -	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>  	if (!ctx)
>  		return NULL;
>  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-07 12:37       ` Borislav Petkov
@ 2025-01-07 18:53         ` Tom Lendacky
  2025-01-07 19:18           ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Tom Lendacky @ 2025-01-07 18:53 UTC (permalink / raw)
  To: Borislav Petkov, Nikunj A. Dadhania
  Cc: linux-kernel, x86, kvm, mingo, tglx, dave.hansen, pgonda, seanjc,
	pbonzini, francescolavra.fl

On 1/7/25 06:37, Borislav Petkov wrote:
> On Tue, Jan 07, 2025 at 05:13:03PM +0530, Nikunj A. Dadhania wrote:
>>>> +	case CC_ATTR_GUEST_SNP_SECURE_TSC:
>>>> +		return (sev_status & MSR_AMD64_SEV_SNP_ENABLED) &&
>>>
>>> This is new here?
>>
>> Yes, this was suggested by Tom here [1]
> 
> Either of you care to explain why this is needed?
> 
>>>> +			(sev_status & MSR_AMD64_SNP_SECURE_TSC);
> 
> I would strongly assume that whatever sets MSR_AMD64_SNP_SECURE_TSC will have
> checked/set MSR_AMD64_SEV_SNP_ENABLED already.

Yes, but from a readability point of view this makes it perfectly clear
that Secure TSC is only for SNP guests.

It's not on a fast path, so I don't think it hurts anything by having
the extra check. But if you prefer the previous check, I'm ok with that.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-07 18:53         ` Tom Lendacky
@ 2025-01-07 19:18           ` Borislav Petkov
  2025-01-08  7:47             ` Nikunj A. Dadhania
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-07 19:18 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Nikunj A. Dadhania, linux-kernel, x86, kvm, mingo, tglx,
	dave.hansen, pgonda, seanjc, pbonzini, francescolavra.fl

On Tue, Jan 07, 2025 at 12:53:26PM -0600, Tom Lendacky wrote:
> Yes, but from a readability point of view this makes it perfectly clear
> that Secure TSC is only for SNP guests.

That would mean that we need to check SNP with every SNP-specific feature
which would be just silly. And all SNP feature bits have "SNP" in the same
so...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-06 12:46 ` [PATCH v16 12/13] x86/tsc: Switch to native sched clock Nikunj A Dadhania
@ 2025-01-07 19:37   ` Borislav Petkov
  2025-01-08  5:20     ` Nikunj A. Dadhania
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-07 19:37 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl, Alexey Makhalov,
	Juergen Gross, Boris Ostrovsky

On Mon, Jan 06, 2025 at 06:16:32PM +0530, Nikunj A Dadhania wrote:
> Although the kernel switches over to stable TSC clocksource instead of
> PV clocksource, the scheduler still keeps on using PV clocks as the
> sched clock source. This is because KVM, Xen and VMWare, switch the
> paravirt sched clock handler in their init routines. HyperV is the
> only PV clock source that checks if the platform provides an invariant
> TSC and does not switch to PV sched clock.

So this below looks like some desperate hackery. Are you doing this because
kvm_sched_clock_init() does

	paravirt_set_sched_clock(kvm_sched_clock_read);

?

If so, why don't you simply prevent that on STSC guests?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-06 12:46 ` [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests Nikunj A Dadhania
  2025-01-07 10:42   ` Borislav Petkov
@ 2025-01-07 19:46   ` Tom Lendacky
  2025-01-07 19:53     ` Borislav Petkov
  1 sibling, 1 reply; 49+ messages in thread
From: Tom Lendacky @ 2025-01-07 19:46 UTC (permalink / raw)
  To: Nikunj A Dadhania, linux-kernel, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini,
	francescolavra.fl

On 1/6/25 06:46, Nikunj A Dadhania wrote:
> Add support for Secure TSC in SNP-enabled guests. Secure TSC allows guests
> to securely use RDTSC/RDTSCP instructions, ensuring that the parameters
> used cannot be altered by the hypervisor once the guest is launched.
> 
> Secure TSC-enabled guests need to query TSC information from the AMD
> Security Processor. This communication channel is encrypted between the AMD
> Security Processor and the guest, with the hypervisor acting merely as a
> conduit to deliver the guest messages to the AMD Security Processor. Each
> message is protected with AEAD (AES-256 GCM).
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

Although if you have to re-spin, it might make sense to use the __free()
support to reduce the number of gotos and labels in snp_get_tsc_info().
You could also use kfree_sensitive() to automatically do the
memzero_explicit(), too.

> ---
>  arch/x86/include/asm/sev-common.h |   1 +
>  arch/x86/include/asm/sev.h        |  21 ++++++
>  arch/x86/include/asm/svm.h        |   6 +-
>  include/linux/cc_platform.h       |   8 +++
>  arch/x86/coco/core.c              |   4 ++
>  arch/x86/coco/sev/core.c          | 107 ++++++++++++++++++++++++++++++
>  arch/x86/mm/mem_encrypt.c         |   2 +
>  7 files changed, 147 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index 50f5666938c0..6ef92432a5ce 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -206,6 +206,7 @@ struct snp_psc_desc {
>  #define GHCB_TERM_NO_SVSM		7	/* SVSM is not advertised in the secrets page */
>  #define GHCB_TERM_SVSM_VMPL0		8	/* SVSM is present but has set VMPL to 0 */
>  #define GHCB_TERM_SVSM_CAA		9	/* SVSM is present but CAA is not page aligned */
> +#define GHCB_TERM_SECURE_TSC		10	/* Secure TSC initialization failed */
>  
>  #define GHCB_RESP_CODE(v)		((v) & GHCB_MSR_INFO_MASK)
>  
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index 0937ac7a96db..bdcdaac4df1c 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -146,6 +146,9 @@ enum msg_type {
>  	SNP_MSG_VMRK_REQ,
>  	SNP_MSG_VMRK_RSP,
>  
> +	SNP_MSG_TSC_INFO_REQ = 17,
> +	SNP_MSG_TSC_INFO_RSP,
> +
>  	SNP_MSG_TYPE_MAX
>  };
>  
> @@ -174,6 +177,21 @@ struct snp_guest_msg {
>  	u8 payload[PAGE_SIZE - sizeof(struct snp_guest_msg_hdr)];
>  } __packed;
>  
> +#define SNP_TSC_INFO_REQ_SZ	128
> +
> +struct snp_tsc_info_req {
> +	u8 rsvd[SNP_TSC_INFO_REQ_SZ];
> +} __packed;
> +
> +struct snp_tsc_info_resp {
> +	u32 status;
> +	u32 rsvd1;
> +	u64 tsc_scale;
> +	u64 tsc_offset;
> +	u32 tsc_factor;
> +	u8 rsvd2[100];
> +} __packed;
> +
>  struct snp_guest_req {
>  	void *req_buf;
>  	size_t req_sz;
> @@ -463,6 +481,8 @@ void snp_msg_free(struct snp_msg_desc *mdesc);
>  int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
>  			   struct snp_guest_request_ioctl *rio);
>  
> +void __init snp_secure_tsc_prepare(void);
> +
>  #else	/* !CONFIG_AMD_MEM_ENCRYPT */
>  
>  #define snp_vmpl 0
> @@ -503,6 +523,7 @@ static inline struct snp_msg_desc *snp_msg_alloc(void) { return NULL; }
>  static inline void snp_msg_free(struct snp_msg_desc *mdesc) { }
>  static inline int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req,
>  					 struct snp_guest_request_ioctl *rio) { return -ENODEV; }
> +static inline void __init snp_secure_tsc_prepare(void) { }
>  
>  #endif	/* CONFIG_AMD_MEM_ENCRYPT */
>  
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 2b59b9951c90..92e18798f197 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -417,7 +417,9 @@ struct sev_es_save_area {
>  	u8 reserved_0x298[80];
>  	u32 pkru;
>  	u32 tsc_aux;
> -	u8 reserved_0x2f0[24];
> +	u64 tsc_scale;
> +	u64 tsc_offset;
> +	u8 reserved_0x300[8];
>  	u64 rcx;
>  	u64 rdx;
>  	u64 rbx;
> @@ -564,7 +566,7 @@ static inline void __unused_size_checks(void)
>  	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x1c0);
>  	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x248);
>  	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x298);
> -	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x2f0);
> +	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x300);
>  	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x320);
>  	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x380);
>  	BUILD_BUG_RESERVED_OFFSET(sev_es_save_area, 0x3f0);
> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> index caa4b4430634..0bf7d33a1048 100644
> --- a/include/linux/cc_platform.h
> +++ b/include/linux/cc_platform.h
> @@ -81,6 +81,14 @@ enum cc_attr {
>  	 */
>  	CC_ATTR_GUEST_SEV_SNP,
>  
> +	/**
> +	 * @CC_ATTR_GUEST_SNP_SECURE_TSC: SNP Secure TSC is active.
> +	 *
> +	 * The platform/OS is running as a guest/virtual machine and actively
> +	 * using AMD SEV-SNP Secure TSC feature.
> +	 */
> +	CC_ATTR_GUEST_SNP_SECURE_TSC,
> +
>  	/**
>  	 * @CC_ATTR_HOST_SEV_SNP: AMD SNP enabled on the host.
>  	 *
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index 0f81f70aca82..715c2c09582f 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -97,6 +97,10 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
>  	case CC_ATTR_GUEST_SEV_SNP:
>  		return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
>  
> +	case CC_ATTR_GUEST_SNP_SECURE_TSC:
> +		return (sev_status & MSR_AMD64_SEV_SNP_ENABLED) &&
> +			(sev_status & MSR_AMD64_SNP_SECURE_TSC);
> +
>  	case CC_ATTR_HOST_SEV_SNP:
>  		return cc_flags.host_sev_snp;
>  
> diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
> index feb145df6bf7..00a0ac3baab7 100644
> --- a/arch/x86/coco/sev/core.c
> +++ b/arch/x86/coco/sev/core.c
> @@ -96,6 +96,14 @@ static u64 sev_hv_features __ro_after_init;
>  /* Secrets page physical address from the CC blob */
>  static u64 secrets_pa __ro_after_init;
>  
> +/*
> + * For Secure TSC guests, the BSP fetches TSC_INFO using SNP guest messaging and
> + * initializes snp_tsc_scale and snp_tsc_offset. These values are replicated
> + * across the APs VMSA fields (TSC_SCALE and TSC_OFFSET).
> + */
> +static u64 snp_tsc_scale __ro_after_init;
> +static u64 snp_tsc_offset __ro_after_init;
> +
>  /* #VC handler runtime per-CPU data */
>  struct sev_es_runtime_data {
>  	struct ghcb ghcb_page;
> @@ -1272,6 +1280,12 @@ static int wakeup_cpu_via_vmgexit(u32 apic_id, unsigned long start_ip)
>  	vmsa->vmpl		= snp_vmpl;
>  	vmsa->sev_features	= sev_status >> 2;
>  
> +	/* Populate AP's TSC scale/offset to get accurate TSC values. */
> +	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
> +		vmsa->tsc_scale = snp_tsc_scale;
> +		vmsa->tsc_offset = snp_tsc_offset;
> +	}
> +
>  	/* Switch the page over to a VMSA page now that it is initialized */
>  	ret = snp_set_vmsa(vmsa, caa, apic_id, true);
>  	if (ret) {
> @@ -3121,3 +3135,96 @@ int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(snp_send_guest_request);
> +
> +static int __init snp_get_tsc_info(void)
> +{
> +	struct snp_guest_request_ioctl *rio;
> +	struct snp_tsc_info_resp *tsc_resp;
> +	struct snp_tsc_info_req *tsc_req;
> +	struct snp_msg_desc *mdesc;
> +	struct snp_guest_req *req;
> +	int rc = -ENOMEM;
> +
> +	tsc_req = kzalloc(sizeof(*tsc_req), GFP_KERNEL);
> +	if (!tsc_req)
> +		return rc;
> +
> +	/*
> +	 * The intermediate response buffer is used while decrypting the
> +	 * response payload. Make sure that it has enough space to cover
> +	 * the authtag.
> +	 */
> +	tsc_resp = kzalloc(sizeof(*tsc_resp) + AUTHTAG_LEN, GFP_KERNEL);
> +	if (!tsc_resp)
> +		goto e_free_tsc_req;
> +
> +	req = kzalloc(sizeof(*req), GFP_KERNEL);
> +	if (!req)
> +		goto e_free_tsc_resp;
> +
> +	rio = kzalloc(sizeof(*rio), GFP_KERNEL);
> +	if (!rio)
> +		goto e_free_req;
> +
> +	mdesc = snp_msg_alloc();
> +	if (IS_ERR_OR_NULL(mdesc))
> +		goto e_free_rio;
> +
> +	rc = snp_msg_init(mdesc, snp_vmpl);
> +	if (rc)
> +		goto e_free_mdesc;
> +
> +	req->msg_version = MSG_HDR_VER;
> +	req->msg_type = SNP_MSG_TSC_INFO_REQ;
> +	req->vmpck_id = snp_vmpl;
> +	req->req_buf = tsc_req;
> +	req->req_sz = sizeof(*tsc_req);
> +	req->resp_buf = (void *)tsc_resp;
> +	req->resp_sz = sizeof(*tsc_resp) + AUTHTAG_LEN;
> +	req->exit_code = SVM_VMGEXIT_GUEST_REQUEST;
> +
> +	rc = snp_send_guest_request(mdesc, req, rio);
> +	if (rc)
> +		goto e_request;
> +
> +	pr_debug("%s: response status 0x%x scale 0x%llx offset 0x%llx factor 0x%x\n",
> +		 __func__, tsc_resp->status, tsc_resp->tsc_scale, tsc_resp->tsc_offset,
> +		 tsc_resp->tsc_factor);
> +
> +	if (!tsc_resp->status) {
> +		snp_tsc_scale = tsc_resp->tsc_scale;
> +		snp_tsc_offset = tsc_resp->tsc_offset;
> +	} else {
> +		pr_err("Failed to get TSC info, response status 0x%x\n", tsc_resp->status);
> +		rc = -EIO;
> +	}
> +
> +e_request:
> +	/* The response buffer contains sensitive data, explicitly clear it. */
> +	memzero_explicit(tsc_resp, sizeof(*tsc_resp) + AUTHTAG_LEN);
> +e_free_mdesc:
> +	snp_msg_free(mdesc);
> +e_free_rio:
> +	kfree(rio);
> +e_free_req:
> +	kfree(req);
> + e_free_tsc_resp:
> +	kfree(tsc_resp);
> +e_free_tsc_req:
> +	kfree(tsc_req);
> +
> +	return rc;
> +}
> +
> +void __init snp_secure_tsc_prepare(void)
> +{
> +	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> +		return;
> +
> +	if (snp_get_tsc_info()) {
> +		pr_alert("Unable to retrieve Secure TSC info from ASP\n");
> +		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_SECURE_TSC);
> +	}
> +
> +	pr_debug("SecureTSC enabled");
> +}
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 0a120d85d7bb..95bae74fdab2 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -94,6 +94,8 @@ void __init mem_encrypt_init(void)
>  	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
>  	swiotlb_update_mem_attributes();
>  
> +	snp_secure_tsc_prepare();
> +
>  	print_mem_encrypt_feature_info();
>  }
>  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-07 19:46   ` Tom Lendacky
@ 2025-01-07 19:53     ` Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2025-01-07 19:53 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Nikunj A Dadhania, linux-kernel, x86, kvm, mingo, tglx,
	dave.hansen, pgonda, seanjc, pbonzini, francescolavra.fl

On Tue, Jan 07, 2025 at 01:46:49PM -0600, Tom Lendacky wrote:
> Although if you have to re-spin, it might make sense to use the __free()
> support to reduce the number of gotos and labels in snp_get_tsc_info().
> You could also use kfree_sensitive() to automatically do the
> memzero_explicit(), too.

Let's keep that for future cleanups, pls, when there's time.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 06/13] x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
  2025-01-06 12:46 ` [PATCH v16 06/13] x86/sev: Change TSC MSR behavior for Secure TSC enabled guests Nikunj A Dadhania
@ 2025-01-07 20:09   ` Tom Lendacky
  0 siblings, 0 replies; 49+ messages in thread
From: Tom Lendacky @ 2025-01-07 20:09 UTC (permalink / raw)
  To: Nikunj A Dadhania, linux-kernel, bp, x86
  Cc: kvm, mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini,
	francescolavra.fl

On 1/6/25 06:46, Nikunj A Dadhania wrote:
> Secure TSC enabled guests should not write to the MSR_IA32_TSC(10H)
> register as the subsequent TSC value reads are undefined. For AMD platform,
> MSR_IA32_TSC is intercepted by the hypervisor. MSR_IA32_TSC read/write
> accesses should not exit to the hypervisor for such guests.
> 
> Accesses to MSR_IA32_TSC needs special handling in the #VC handler for the
> guests with Secure TSC enabled. Writes to MSR_IA32_TSC should be ignored
> and flagged once with a warning, and reads of MSR_IA32_TSC should return
> the result of the RDTSC instruction.
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>  arch/x86/coco/sev/core.c | 39 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
> index 00a0ac3baab7..f49d3e97e170 100644
> --- a/arch/x86/coco/sev/core.c
> +++ b/arch/x86/coco/sev/core.c
> @@ -1428,6 +1428,34 @@ static enum es_result __vc_handle_msr_caa(struct pt_regs *regs, bool write)
>  	return ES_OK;
>  }
>  
> +/*
> + * TSC related accesses should not exit to the hypervisor when a guest is
> + * executing with Secure TSC enabled, so special handling is required for
> + * accesses of MSR_IA32_TSC.
> + */
> +static enum es_result __vc_handle_secure_tsc_msrs(struct pt_regs *regs, bool write)
> +{
> +	u64 tsc;
> +
> +	/*
> +	 * Writes: Writing to MSR_IA32_TSC can cause subsequent reads of the TSC
> +	 *         to return undefined values, so ignore all writes.
> +	 *
> +	 * Reads: Reads of MSR_IA32_TSC should return the current TSC value, use
> +	 *        the value returned by rdtsc_ordered().
> +	 */
> +	if (write) {
> +		WARN_ONCE(1, "TSC MSR writes are verboten!\n");
> +		return ES_OK;
> +	}
> +
> +	tsc = rdtsc_ordered();
> +	regs->ax = lower_32_bits(tsc);
> +	regs->dx = upper_32_bits(tsc);
> +
> +	return ES_OK;
> +}
> +
>  static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
>  {
>  	struct pt_regs *regs = ctxt->regs;
> @@ -1437,8 +1465,17 @@ static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
>  	/* Is it a WRMSR? */
>  	write = ctxt->insn.opcode.bytes[1] == 0x30;
>  
> -	if (regs->cx == MSR_SVSM_CAA)
> +	switch (regs->cx) {
> +	case MSR_SVSM_CAA:
>  		return __vc_handle_msr_caa(regs, write);
> +	case MSR_IA32_TSC:
> +		if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
> +			return __vc_handle_secure_tsc_msrs(regs, write);
> +		else
> +			break;
> +	default:
> +		break;
> +	}
>  
>  	ghcb_set_rcx(ghcb, regs->cx);
>  	if (write) {

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-07 19:37   ` Borislav Petkov
@ 2025-01-08  5:20     ` Nikunj A. Dadhania
  2025-01-08  8:22       ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Nikunj A. Dadhania @ 2025-01-08  5:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl, Alexey Makhalov,
	Juergen Gross, Boris Ostrovsky



On 1/8/2025 1:07 AM, Borislav Petkov wrote:
> On Mon, Jan 06, 2025 at 06:16:32PM +0530, Nikunj A Dadhania wrote:
>> Although the kernel switches over to stable TSC clocksource instead of
>> PV clocksource, the scheduler still keeps on using PV clocks as the
>> sched clock source. This is because KVM, Xen and VMWare, switch the
>> paravirt sched clock handler in their init routines. HyperV is the
>> only PV clock source that checks if the platform provides an invariant
>> TSC and does not switch to PV sched clock.
>
> So this below looks like some desperate hackery. Are you doing this because
> kvm_sched_clock_init() does
>
> 	paravirt_set_sched_clock(kvm_sched_clock_read);
>
> ?

All guests(including STSC) running on KVM or other hypervisors like Xen/VMWare
uses the regular TSC when the guest is booted with TscInvariant bit set. But it
doesn't switch the sched clock to use TSC which it should have, pointed here[1]
by Sean:

    No, I'm saying that the guest should prefer the raw TSC over kvmclock if the
    TSC is stable, irrespective of SNP or TDX. This is effectively already done
    for the timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC
    clocksource if invariant TSC is exposed")), but the scheduler still uses
    kvmclock thanks to the kvm_sched_clock_init() code.

> If so, why don't you simply prevent that on STSC guests?

This is for all guests running on different hypervisor with stable TSC. And I
did try to limit this to KVM here [2], here is Sean's comment to that
implementation:

    > Although the kernel switches over to stable TSC clocksource instead of
    > kvmclock, the scheduler still keeps on using kvmclock as the sched clock.
    > This is due to kvm_sched_clock_init() updating the pv_sched_clock()
    > unconditionally.

    All PV clocks are affected by this, no? This seems like something that
    should be handled in common code, which is the point I was trying to make in
    v11.

Regards
Nikunj

1. https://lore.kernel.org/kvm/ZurCbP7MesWXQbqZ@google.com/
2. https://lore.kernel.org/kvm/20241009092850.197575-17-nikunj@amd.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-07 19:18           ` Borislav Petkov
@ 2025-01-08  7:47             ` Nikunj A. Dadhania
  2025-01-08  8:05               ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Nikunj A. Dadhania @ 2025-01-08  7:47 UTC (permalink / raw)
  To: Borislav Petkov, Tom Lendacky
  Cc: linux-kernel, x86, kvm, mingo, tglx, dave.hansen, pgonda, seanjc,
	pbonzini, francescolavra.fl



On 1/8/2025 12:48 AM, Borislav Petkov wrote:
> On Tue, Jan 07, 2025 at 12:53:26PM -0600, Tom Lendacky wrote:
>> Yes, but from a readability point of view this makes it perfectly clear
>> that Secure TSC is only for SNP guests.
> 
> That would mean that we need to check SNP with every SNP-specific feature
> which would be just silly. And all SNP feature bits have "SNP" in the same
> so...
> 

Right, here is the updated diff:

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 715c2c09582f..d6647953590b 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -98,8 +98,7 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
 		return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
 
 	case CC_ATTR_GUEST_SNP_SECURE_TSC:
-		return (sev_status & MSR_AMD64_SEV_SNP_ENABLED) &&
-			(sev_status & MSR_AMD64_SNP_SECURE_TSC);
+		return sev_status & MSR_AMD64_SNP_SECURE_TSC;
 
 	case CC_ATTR_HOST_SEV_SNP:
 		return cc_flags.host_sev_snp;
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 00a0ac3baab7..763cfeb65b2f 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -3218,7 +3218,8 @@ static int __init snp_get_tsc_info(void)
 
 void __init snp_secure_tsc_prepare(void)
 {
-	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) ||
+	    !cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
 		return;
 
 	if (snp_get_tsc_info()) {


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-08  7:47             ` Nikunj A. Dadhania
@ 2025-01-08  8:05               ` Borislav Petkov
  2025-01-08  8:37                 ` Nikunj A. Dadhania
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-08  8:05 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Tom Lendacky, linux-kernel, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl

On Wed, Jan 08, 2025 at 01:17:11PM +0530, Nikunj A. Dadhania wrote:
> diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
> index 00a0ac3baab7..763cfeb65b2f 100644
> --- a/arch/x86/coco/sev/core.c
> +++ b/arch/x86/coco/sev/core.c
> @@ -3218,7 +3218,8 @@ static int __init snp_get_tsc_info(void)
>  
>  void __init snp_secure_tsc_prepare(void)
>  {
> -	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> +	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) ||
> +	    !cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))

So how is moving the CC_ATTR_GUEST_SEV_SNP check here make any sense?

I simply zapped the MSR_AMD64_SEV_SNP_ENABLED check above locally.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-08  5:20     ` Nikunj A. Dadhania
@ 2025-01-08  8:22       ` Borislav Petkov
  2025-01-08  8:34         ` Nikunj A. Dadhania
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-08  8:22 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl, Alexey Makhalov,
	Juergen Gross, Boris Ostrovsky

On Wed, Jan 08, 2025 at 10:50:13AM +0530, Nikunj A. Dadhania wrote:
> All guests(including STSC) running on KVM or other hypervisors like Xen/VMWare
> uses the regular TSC when the guest is booted with TscInvariant bit set. But it
> doesn't switch the sched clock to use TSC which it should have, pointed here[1]
> by Sean:

Well, that paravirt_set_sched_clock() thing is there for a reason and all the
different guest types which call it perhaps have their reasons.

Now, if you think it would make sense to use the TSC in every guest type if
the TSC is this and that, then I'd need to see a patch or even a patchset
which explains why it is ok to do so on every guest type and then tested on
every guest type and then each patch would convert a single guest type and
properly explain why.

And your patch doesn't have that at all. I know Sean thinks it is a good idea
and perhaps it is but without proper justification and acks from the other
guest type maintainers, I simply won't take such a patch(-set).

And even if it is good, you need to solve it another way - not with this
delayed-work hackery.

IOW, this switching the paravirt clock crap to use TSC when the TSC is this
and that and the HV tells you so, should be a wholly separate patchset with
reviews and acks and the usual shebang.

If you want to take care only of STSC now, I'd take a patch which is known
good and tested properly. And that should happen very soon because the merge
window is closing in. Otherwise, there's always another merge window and
rushing things doesn't do any good anyway, as I keep preaching...

Your call.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-08  8:22       ` Borislav Petkov
@ 2025-01-08  8:34         ` Nikunj A. Dadhania
  2025-01-08 10:20           ` Nikunj A. Dadhania
  0 siblings, 1 reply; 49+ messages in thread
From: Nikunj A. Dadhania @ 2025-01-08  8:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl, Alexey Makhalov,
	Juergen Gross, Boris Ostrovsky



On 1/8/2025 1:52 PM, Borislav Petkov wrote:
> On Wed, Jan 08, 2025 at 10:50:13AM +0530, Nikunj A. Dadhania wrote:
> And your patch doesn't have that at all. I know Sean thinks it is a good idea
> and perhaps it is but without proper justification and acks from the other
> guest type maintainers, I simply won't take such a patch(-set).

I haven't heard anything from other guest type maintainers yet, and they are 
on CC from v12 onward.

> And even if it is good, you need to solve it another way - not with this
> delayed-work hackery.
> 
> IOW, this switching the paravirt clock crap to use TSC when the TSC is this
> and that and the HV tells you so, should be a wholly separate patchset with
> reviews and acks and the usual shebang.

I agree, we should handle this as a separate patch series.

> If you want to take care only of STSC now, I'd take a patch which is known
> good and tested properly. And that should happen very soon because the merge
> window is closing in. 

In that case, let me limit this only to STSC for now, i will send updated patch.

Regards,
Nikunj

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-08  8:05               ` Borislav Petkov
@ 2025-01-08  8:37                 ` Nikunj A. Dadhania
  2025-01-08  8:43                   ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Nikunj A. Dadhania @ 2025-01-08  8:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tom Lendacky, linux-kernel, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl



On 1/8/2025 1:35 PM, Borislav Petkov wrote:
> On Wed, Jan 08, 2025 at 01:17:11PM +0530, Nikunj A. Dadhania wrote:
>> diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
>> index 00a0ac3baab7..763cfeb65b2f 100644
>> --- a/arch/x86/coco/sev/core.c
>> +++ b/arch/x86/coco/sev/core.c
>> @@ -3218,7 +3218,8 @@ static int __init snp_get_tsc_info(void)
>>  
>>  void __init snp_secure_tsc_prepare(void)
>>  {
>> -	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
>> +	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) ||
>> +	    !cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> 
> So how is moving the CC_ATTR_GUEST_SEV_SNP check here make any sense?

In the comment that you gave here[1], I understood that this check has 
to be pushed to snp_secure_tsc_prepare().

diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 996ca27f0b72..95bae74fdab2 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -94,9 +94,7 @@ void __init mem_encrypt_init(void)
 	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
 	swiotlb_update_mem_attributes();
 
-	/* Initialize SNP Secure TSC */
-	if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
-		snp_secure_tsc_prepare();
+	snp_secure_tsc_prepare();
 
 	print_mem_encrypt_feature_info();
 }
 
> I simply zapped the MSR_AMD64_SEV_SNP_ENABLED check above locally.

For SEV and SEV-ES this SecureTSC bit should not be set, I think we should be 
fine without MSR_AMD64_SEV_SNP_ENABLED check.

Regards
Nikunj

1. https://lore.kernel.org/kvm/20241111103434.GAZzHduouKi4LBwbg8@fat_crate.local/


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests
  2025-01-08  8:37                 ` Nikunj A. Dadhania
@ 2025-01-08  8:43                   ` Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2025-01-08  8:43 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Tom Lendacky, linux-kernel, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl

On Wed, Jan 08, 2025 at 02:07:07PM +0530, Nikunj A. Dadhania wrote:
> >>  void __init snp_secure_tsc_prepare(void)
> >>  {
> >> -	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> >> +	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) ||
> >> +	    !cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> > 
> > So how is moving the CC_ATTR_GUEST_SEV_SNP check here make any sense?
> 
> In the comment that you gave here[1], I understood that this check has 
> to be pushed to snp_secure_tsc_prepare().

The check should be pushed there but you should not check
CC_ATTR_GUEST_SEV_SNP *and* CC_ATTR_GUEST_SNP_SECURE_TSC if all you wanna do
is check if you're running a STSC guest. That was the whole point.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-08  8:34         ` Nikunj A. Dadhania
@ 2025-01-08 10:20           ` Nikunj A. Dadhania
  2025-01-08 14:00             ` Sean Christopherson
  0 siblings, 1 reply; 49+ messages in thread
From: Nikunj A. Dadhania @ 2025-01-08 10:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl, Alexey Makhalov,
	Juergen Gross, Boris Ostrovsky


On 1/8/2025 2:04 PM, Nikunj A. Dadhania wrote:
> 
>> If you want to take care only of STSC now, I'd take a patch which is known
>> good and tested properly. And that should happen very soon because the merge
>> window is closing in. 
> 
> In that case, let me limit this only to STSC for now, i will send updated patch.

From: Nikunj A Dadhania <nikunj@amd.com>
Date: Wed, 8 Jan 2025 14:18:04 +0530
Subject: [PATCH] x86/kvmclock: Prefer TSC as the scheduler clock for Secure
 TSC guests

Although the kernel switches over to a stable TSC clocksource instead of
kvmclock, the scheduler still keeps on using kvmclock as the sched clock.
This is due to kvm_sched_clock_init() updating the pv_sched_clock()
unconditionally. Do not override the PV sched clock when Secure TSC is
enabled.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kernel/kvmclock.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d8fef3a65a35..82c4743a5e7a 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -324,8 +324,10 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
-	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
+		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	}
 
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
-- 
2.34.1





^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 10/13] x86/tsc: Switch Secure TSC guests away from kvm-clock
  2025-01-07 15:16   ` Borislav Petkov
@ 2025-01-08 10:45     ` Nikunj A. Dadhania
  0 siblings, 0 replies; 49+ messages in thread
From: Nikunj A. Dadhania @ 2025-01-08 10:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, seanjc, pbonzini, francescolavra.fl



On 1/7/2025 8:46 PM, Borislav Petkov wrote:
> On Mon, Jan 06, 2025 at 06:16:30PM +0530, Nikunj A Dadhania wrote:
>>  static int kvm_cs_enable(struct clocksource *cs)
>>  {
>> +	/*
>> +	 * TSC clocksource should be used for a guest with Secure TSC enabled,
>> +	 * taint the kernel and warn when the user changes the clocksource to
>> +	 * kvm-clock.
>> +	 */
>> +	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
>> +		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
>> +		WARN_ONCE(1, "For Secure TSC guest, changing the clocksource is not allowed!\n");
> 
> So this thing is trying to state that changing the clocksource is not allowed
> but it still allows it. Why not simply do this:
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 960260a8d884..d8fef3a65a35 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -151,14 +151,10 @@ bool kvm_check_and_clear_guest_paused(void)
>  
>  static int kvm_cs_enable(struct clocksource *cs)
>  {
> -	/*
> -	 * TSC clocksource should be used for a guest with Secure TSC enabled,
> -	 * taint the kernel and warn when the user changes the clocksource to
> -	 * kvm-clock.
> -	 */
> +	/* Only the TSC should be used in a Secure TSC guest. */
>  	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
> -		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> -		WARN_ONCE(1, "For Secure TSC guest, changing the clocksource is not allowed!\n");
> +		WARN_ONCE(1, "Secure TSC guest, changing the clocksource is not allowed!\n");
> +		return 1;
>  	}
>  
>  	vclocks_set_used(VDSO_CLOCKMODE_PVCLOCK);
> 
> ?

Works as expected:

$ echo 'kvm-clock' > /sys/devices/system/clocksource/clocksource0/current_clocksource
[   30.333603] ------------[ cut here ]------------
[   30.334802] Secure TSC guest, changing the clocksource is not allowed!
[   30.336460] WARNING: CPU: 0 PID: 19 at arch/x86/kernel/kvmclock.c:156 kvm_cs_enable+0x57/0x70

Regards
Nikunj

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-08 10:20           ` Nikunj A. Dadhania
@ 2025-01-08 14:00             ` Sean Christopherson
  2025-01-08 15:34               ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Sean Christopherson @ 2025-01-08 14:00 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Borislav Petkov, linux-kernel, thomas.lendacky, x86, kvm, mingo,
	tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Wed, Jan 08, 2025, Nikunj A. Dadhania wrote:
> 
> On 1/8/2025 2:04 PM, Nikunj A. Dadhania wrote:
> > 
> >> If you want to take care only of STSC now, I'd take a patch which is known
> >> good and tested properly. And that should happen very soon because the merge
> >> window is closing in. 
> > 
> > In that case, let me limit this only to STSC for now, i will send updated patch.
> 
> From: Nikunj A Dadhania <nikunj@amd.com>
> Date: Wed, 8 Jan 2025 14:18:04 +0530
> Subject: [PATCH] x86/kvmclock: Prefer TSC as the scheduler clock for Secure
>  TSC guests
> 
> Although the kernel switches over to a stable TSC clocksource instead of
> kvmclock, the scheduler still keeps on using kvmclock as the sched clock.
> This is due to kvm_sched_clock_init() updating the pv_sched_clock()
> unconditionally. Do not override the PV sched clock when Secure TSC is
> enabled.
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> ---
>  arch/x86/kernel/kvmclock.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index d8fef3a65a35..82c4743a5e7a 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -324,8 +324,10 @@ void __init kvmclock_init(void)
>  	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
>  		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
>  
> -	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> -	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> +	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
> +		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> +		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> +	}

This still misses my point.  Ditto for the "x86/tsc: Switch Secure TSC guests away
from kvm-clock".

I object to singling out kvmclock.  It's weird and misleading, because handling
only kvmclock suggests that other PV clocks are somehow trusted/ok, when in
reality the only reason kvmclock is getting singled out is (presumably) because
it's what Nikunj and the other folks enabling KVM SNP test on.

What I care most about is having a sane, consistent policy throughout the kernel.
E.g. so that a user/reader walks away with an understanding PV clocks are a
theoretical host attack vector and so should be avoided when Secure TSC is
available.

Ideally, if the TSC is the preferred clocksource, then the scheduler will use the
TSC and not a PV clock irrespective of STSC.  But I 100% agree with Boris that
it needs buy-in from other maintainers (including Paolo), because it's entirely
possible (likely, even) that there's an angle to scheduling I'm not considering.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-08 14:00             ` Sean Christopherson
@ 2025-01-08 15:34               ` Borislav Petkov
  2025-01-08 17:02                 ` Sean Christopherson
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-08 15:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nikunj A. Dadhania, linux-kernel, thomas.lendacky, x86, kvm,
	mingo, tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote:
> This still misses my point. 

Oh I got it, no worries.

> Ditto for the "x86/tsc: Switch Secure TSC guests away from kvm-clock".
> 
> I object to singling out kvmclock.  It's weird and misleading, because handling
> only kvmclock suggests that other PV clocks are somehow trusted/ok, when in
> reality the only reason kvmclock is getting singled out is (presumably) because
> it's what Nikunj and the other folks enabling KVM SNP test on.

Presumably.

> What I care most about is having a sane, consistent policy throughout the kernel.
> E.g. so that a user/reader walks away with an understanding PV clocks are a
> theoretical host attack vector and so should be avoided when Secure TSC is
> available.

Yap, agreed.

> Ideally, if the TSC is the preferred clocksource, then the scheduler will use the
> TSC and not a PV clock irrespective of STSC.  But I 100% agree with Boris that
> it needs buy-in from other maintainers (including Paolo), because it's entirely
> possible (likely, even) that there's an angle to scheduling I'm not considering.

That's exactly why I wanted to have this taken care of only for the STSC side
of things now and temporarily. So that we can finally land those STSC patches
- they've been pending for waaay too long.

And then ask Nikunj nicely to clean up this whole pv clock gunk, potentially
kill some of those old clocksources which probably don't matter anymore.

But your call how/when you wanna do this.

If you want the cleanup first, I'll take only a subset of the STSC set so that
I can unload some of that set upstream.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-08 15:34               ` Borislav Petkov
@ 2025-01-08 17:02                 ` Sean Christopherson
  2025-01-08 19:53                   ` Borislav Petkov
  2025-01-09  6:32                   ` Nikunj A. Dadhania
  0 siblings, 2 replies; 49+ messages in thread
From: Sean Christopherson @ 2025-01-08 17:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nikunj A. Dadhania, linux-kernel, thomas.lendacky, x86, kvm,
	mingo, tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Wed, Jan 08, 2025, Borislav Petkov wrote:
> On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote:
> > Ideally, if the TSC is the preferred clocksource, then the scheduler will use the
> > TSC and not a PV clock irrespective of STSC.  But I 100% agree with Boris that
> > it needs buy-in from other maintainers (including Paolo), because it's entirely
> > possible (likely, even) that there's an angle to scheduling I'm not considering.
> 
> That's exactly why I wanted to have this taken care of only for the STSC side
> of things now and temporarily. So that we can finally land those STSC patches
> - they've been pending for waaay too long.
> 
> And then ask Nikunj nicely to clean up this whole pv clock gunk, potentially
> kill some of those old clocksources which probably don't matter anymore.
> 
> But your call how/when you wanna do this.

I'm okay starting with just TDX and SNP guests, but I don't want to special case
SNP's Secure TSC anywhere in kvmclock or common TSC/sched code.

For TDX guests, the TSC is _always_ "secure".  So similar to singling out kvmclock,
handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent
state.  Which is why I originally suggested[*] fixing the sched_clock mess in a
generically; doing so would avoid the need to special case SNP or TDX in code
that doesn't/shouldn't care about SNP or TDX.

[*] https://lore.kernel.org/all/ZurCbP7MesWXQbqZ@google.com

> If you want the cleanup first, I'll take only a subset of the STSC set so that
> I can unload some of that set upstream.

My vote is to apply through "x86/sev: Mark Secure TSC as reliable clocksource",
and then split "x86/tsc: Switch Secure TSC guests away from kvm-clock" to grab
only the snp_secure_tsc_init() related changes (which is how that patch should
be constructed no matter what; adding support for MSR_AMD64_GUEST_TSC_FREQ has
nothing to do with kvmclock).

And then figure out how to wrangle clocksource and sched_clock in a way that is
sane and consistent.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-08 17:02                 ` Sean Christopherson
@ 2025-01-08 19:53                   ` Borislav Petkov
  2025-01-09  6:32                   ` Nikunj A. Dadhania
  1 sibling, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2025-01-08 19:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nikunj A. Dadhania, linux-kernel, thomas.lendacky, x86, kvm,
	mingo, tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Wed, Jan 08, 2025 at 09:02:34AM -0800, Sean Christopherson wrote:
> I'm okay starting with just TDX and SNP guests, but I don't want to special case
> SNP's Secure TSC anywhere in kvmclock or common TSC/sched code.
> 
> For TDX guests, the TSC is _always_ "secure".

Ah, good to know. I was wondering what the situation in TDX land is wrt TSC...

> So similar to singling out kvmclock,
> handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent
> state.  Which is why I originally suggested[*] fixing the sched_clock mess in a
> generically; doing so would avoid the need to special case SNP or TDX in code
> that doesn't/shouldn't care about SNP or TDX.

Right.

> My vote is to apply through "x86/sev: Mark Secure TSC as reliable clocksource",
> and then split "x86/tsc: Switch Secure TSC guests away from kvm-clock" to grab
> only the snp_secure_tsc_init() related changes (which is how that patch should
> be constructed no matter what; adding support for MSR_AMD64_GUEST_TSC_FREQ has
> nothing to do with kvmclock).
> 
> And then figure out how to wrangle clocksource and sched_clock in a way that is
> sane and consistent.

Makes sense to me, I'll carve out the bits.

Nikunj, I'd appreciate it if you gathered whatever bits are left wrt kvmclock
and take care of things as Sean suggests above.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-08 17:02                 ` Sean Christopherson
  2025-01-08 19:53                   ` Borislav Petkov
@ 2025-01-09  6:32                   ` Nikunj A. Dadhania
  2025-01-15 21:37                     ` Sean Christopherson
  1 sibling, 1 reply; 49+ messages in thread
From: Nikunj A. Dadhania @ 2025-01-09  6:32 UTC (permalink / raw)
  To: Sean Christopherson, Borislav Petkov
  Cc: linux-kernel, thomas.lendacky, x86, kvm, mingo, tglx, dave.hansen,
	pgonda, pbonzini, francescolavra.fl, Alexey Makhalov,
	Juergen Gross, Boris Ostrovsky



On 1/8/2025 10:32 PM, Sean Christopherson wrote:
> On Wed, Jan 08, 2025, Borislav Petkov wrote:
>> On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote:

> For TDX guests, the TSC is _always_ "secure".  So similar to singling out kvmclock,
> handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent
> state.  Which is why I originally suggested[*] fixing the sched_clock mess in a
> generically; 
> doing so would avoid the need to special case SNP or TDX in code> that doesn't/shouldn't care about SNP or TDX.

That is what I have attempted in this patch[1] where irrespective of SNP/TDX, whenever
TSC is picked up as a clock source, sched-clock will start using TSC instead of any
PV sched clock. This does not have any special case for STSC/SNP/TDX. 

> [*] https://lore.kernel.org/all/ZurCbP7MesWXQbqZ@google.com

Regards
Nikunj

1. https://lore.kernel.org/kvm/20250106124633.1418972-13-nikunj@amd.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-09  6:32                   ` Nikunj A. Dadhania
@ 2025-01-15 21:37                     ` Sean Christopherson
  2025-01-16 16:25                       ` Borislav Petkov
                                         ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Sean Christopherson @ 2025-01-15 21:37 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Borislav Petkov, linux-kernel, thomas.lendacky, x86, kvm, mingo,
	tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Thu, Jan 09, 2025, Nikunj A. Dadhania wrote:
> On 1/8/2025 10:32 PM, Sean Christopherson wrote:
> > On Wed, Jan 08, 2025, Borislav Petkov wrote:
> >> On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote:
> 
> > For TDX guests, the TSC is _always_ "secure".  So similar to singling out kvmclock,
> > handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent
> > state.  Which is why I originally suggested[*] fixing the sched_clock mess in a
> > generically; 
> > doing so would avoid the need to special case SNP or TDX in code> that doesn't/shouldn't care about SNP or TDX.
> 
> That is what I have attempted in this patch[1] where irrespective of SNP/TDX, whenever
> TSC is picked up as a clock source, sched-clock will start using TSC instead of any
> PV sched clock. This does not have any special case for STSC/SNP/TDX. 

*sigh*

This is a complete and utter trainwreck.

Paolo had an idea to handle this without a deferred callback by having kvmclock
detect if kvmclock is selected as the clocksource, OR if the TSC is unreliable,
i.e. if the kernel may switch to kvmclock due to the TSC being marked unreliable.

Unfortunately, that idea doesn't work because the ordering is all wrong.  The
*early* TSC initialization in setup_arch() happens after kvmclock_init() (via
init_hypervisor_platform())

	init_hypervisor_platform();

	tsc_early_init();

and even if we mucked with that, __clocksource_select() very deliberately doesn't
change the clocksource until the kernel is "finished" booting, in order to avoid
thrashing the clocksource.

  static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
  {
	struct clocksource *cs;

	if (!finished_booting || list_empty(&clocksource_list)) <===
		return NULL;

	...
  }


  /*
   * clocksource_done_booting - Called near the end of core bootup
   *
   * Hack to avoid lots of clocksource churn at boot time.
   * We use fs_initcall because we want this to start before
   * device_initcall but after subsys_initcall.
   */
  static int __init clocksource_done_booting(void)
  {
	mutex_lock(&clocksource_mutex);
	curr_clocksource = clocksource_default_clock();
	finished_booting = 1;
	/*
	 * Run the watchdog first to eliminate unstable clock sources
	 */
	__clocksource_watchdog_kthread();
	clocksource_select();
	mutex_unlock(&clocksource_mutex);
	return 0;
  }
  fs_initcall(clocksource_done_booting);

I fiddled with a variety of ideas to try and let kvmclock tell sched_clock that
it prefers the TSC, e.g. if TSC_RELIABLE is set, but after seeing this comment
in native_sched_clock():

	/*
	 * Fall back to jiffies if there's no TSC available:
	 * ( But note that we still use it if the TSC is marked
	 *   unstable. We do this because unlike Time Of Day,
	 *   the scheduler clock tolerates small errors and it's
	 *   very important for it to be as fast as the platform
	 *   can achieve it. )
	 */

My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
nonstop, and not marked stable via command line.  I.e. use the same criteria as
tweaking the clocksource rating.  As above, sched_clock is more tolerant of slop
than clocksource, so it's a bit ridiculous to care whether the TSC or kvmclock
(or something else entirely) is used for the clocksource.

If we wanted to go with a more conservative approach, e.g. to minimize the risk
of breaking existing setups, we could also condition the change on the TSC being
reliable and having a known frequency.  I.e. require SNP's Secure TSC, or require
the hypervisor to enumerate the TSC frequency via CPUID.  I don't see a ton of
value in that approach though, and long-term it would lead to some truly weird
code due to holding sched_clock to a higher standard than clocksource.

But wait, there's more!  Because TDX doesn't override .calibrate_tsc() or
.calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
the TSC frequency using the information provided by KVM, i.e. the untrusted host.

The "obvious" solution is to leave the calibration functions as-is if the TSC has
a known, reliable frequency, but even _that_ is riddled with problems, because
as-is, the kernel sets TSC_KNOWN_FREQ and TSC_RELIABLE in tsc_early_init(), which
runs *after* init_hypervisor_platform().  SNP Secure TSC fudges around this by
overiding the calibration routines, but that's a bit gross and easy to fix if we
also fix TDX.  And fixing TDX by running native_calibrate_tsc() would give the
same love to setups where the hypervisor provides CPUID 0x15 and/or 0x16.

All in all, I'm thinking something like this (across multiple patches):

---
 arch/x86/include/asm/tsc.h |  1 +
 arch/x86/kernel/kvmclock.c | 55 ++++++++++++++++++++++++++------------
 arch/x86/kernel/setup.c    |  7 +++++
 arch/x86/kernel/tsc.c      | 32 +++++++++++++++++-----
 4 files changed, 72 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94408a784c8e..e13d6c3f2298 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -28,6 +28,7 @@ static inline cycles_t get_cycles(void)
 }
 #define get_cycles get_cycles
 
+extern void tsc_early_detect(void);
 extern void tsc_early_init(void);
 extern void tsc_init(void);
 extern void mark_tsc_unstable(char *reason);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5b2c15214a6b..fa6bf71cc511 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -317,33 +317,54 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
-	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	/*
+	 * If the TSC counts at a constant frequency across P/T states, counts
+	 * in deep C-states, and the TSC hasn't been marked unstable, prefer
+	 * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
+	 * that TSC is chosen as the clocksource.  Note, the TSC unstable check
+	 * exists purely to honor the TSC being marked unstable via command
+	 * line, any runtime detection of an unstable will happen after this.
+	 */
+	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+	    !check_tsc_unstable()) {
+		kvm_clock.rating = 299;
+		pr_warn("kvm-clock: Using native sched_clock\n");
+	} else {
+		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	}
+
+	/*
+	 * If the TSC frequency is already known, e.g. via CPUID, rely on that
+	 * information.  For "normal" VMs, the hypervisor controls kvmclock and
+	 * CPUID, i.e. the frequency is coming from the same place.  For CoCo
+	 * VMs, the TSC frequency may be provided by trusted firmware, in which
+	 * case it's highly desirable to use that information, not kvmclock's.
+	 * Note, TSC_KNOWN_FREQ must be read before presetting loops-per-jiffy,
+	 * (see kvm_get_tsc_khz()).
+	 */
+	if (!boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ) ||
+	    !boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
+		pr_warn("kvm-clock: Using native calibration\n");
+		x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+		x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+	}
 
-	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
-	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;
 	x86_platform.set_wallclock = kvm_set_wallclock;
 #ifdef CONFIG_X86_LOCAL_APIC
 	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
 #endif
+	/*
+	 * Save/restore "sched" clock state even if kvmclock isn't being used
+	 * for sched_clock, as kvmclock is still used for wallclock and relies
+	 * on these hooks to re-enable kvmclock after suspend+resume.
+	 */
 	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
 	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
 	kvm_get_preset_lpj();
 
-	/*
-	 * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
-	 * with P/T states and does not stop in deep C-states.
-	 *
-	 * Invariant TSC exposed by host means kvmclock is not necessary:
-	 * can use TSC as clocksource.
-	 *
-	 */
-	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
-	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
-	    !check_tsc_unstable())
-		kvm_clock.rating = 299;
-
 	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
 	pv_info.name = "KVM";
 }
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f1fea506e20f..2b6800426349 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -907,6 +907,13 @@ void __init setup_arch(char **cmdline_p)
 	reserve_ibft_region();
 	x86_init.resources.dmi_setup();
 
+	/*
+	 * Detect, but do not fully initialize, TSC info before initializing
+	 * the hypervisor platform, so that hypervisor code can make informed
+	 * decisions about using a paravirt clock vs. TSC.
+	 */
+	tsc_early_detect();
+
 	/*
 	 * VMware detection requires dmi to be available, so this
 	 * needs to be done after dmi_setup(), for the boot CPU.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 0864b314c26a..9baffb425386 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
 	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
 	unsigned int crystal_khz;
 
-	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+	/*
+	 * Ignore the vendor when running as a VM, if the hypervisor provides
+	 * garbage CPUID information then the vendor is also suspect.
+	 */
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
+	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return 0;
 
 	if (boot_cpu_data.cpuid_level < 0x15)
@@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
 		return 0;
 
 	/*
-	 * For Atom SoCs TSC is the only reliable clocksource.
-	 * Mark TSC reliable so no watchdog on it.
+	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
+	 * VM, any watchdog is going to be less reliable than the TSC as the
+	 * watchdog source will be emulated in software.  In both cases, mark
+	 * the TSC reliable so that no watchdog runs on it.
 	 */
-	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
+	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
 		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -1509,6 +1517,20 @@ static void __init tsc_enable_sched_clock(void)
 	static_branch_enable(&__use_tsc);
 }
 
+void __init tsc_early_detect(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_TSC))
+		return;
+
+	snp_secure_tsc_init();
+
+	/*
+	 * Run through native TSC calibration to set TSC_KNOWN_FREQ and/or
+	 * TSC_RELIABLE as appropriate.
+	 */
+	native_calibrate_tsc();
+}
+
 void __init tsc_early_init(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_TSC))
@@ -1517,8 +1539,6 @@ void __init tsc_early_init(void)
 	if (is_early_uv_system())
 		return;
 
-	snp_secure_tsc_init();
-
 	if (!determine_cpu_tsc_frequencies(true))
 		return;
 	tsc_enable_sched_clock();

base-commit: 0563ee35ae2c9cfb0c6a7b2c0ddf7d9372bb8a98
-- 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-15 21:37                     ` Sean Christopherson
@ 2025-01-16 16:25                       ` Borislav Petkov
  2025-01-16 16:56                         ` Sean Christopherson
  2025-01-21  3:59                       ` Nikunj A. Dadhania
  2025-01-28  5:41                       ` Nikunj A Dadhania
  2 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-16 16:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nikunj A. Dadhania, linux-kernel, thomas.lendacky, x86, kvm,
	mingo, tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Wed, Jan 15, 2025 at 01:37:25PM -0800, Sean Christopherson wrote:
> My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked stable via command line.

So how do we deal with the case here where a malicious HV could set those bits
and still tweak the TSC?

IOW, I think Secure TSC and TDX should be the only ones who trust the TSC when
in a guest.

If anything, trusting the TSC in a normal guest should at least issue
a warning saying that we do use the TSC but there's no 100% guarantee that it
is trustworthy...

> But wait, there's more!  Because TDX doesn't override .calibrate_tsc() or
> .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
> frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
> the TSC frequency using the information provided by KVM, i.e. the untrusted host.

Yeah, I guess we don't want that. Or at least we should warn about it.
 
> +	/*
> +	 * If the TSC counts at a constant frequency across P/T states, counts
> +	 * in deep C-states, and the TSC hasn't been marked unstable, prefer
> +	 * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
> +	 * that TSC is chosen as the clocksource.  Note, the TSC unstable check
> +	 * exists purely to honor the TSC being marked unstable via command
> +	 * line, any runtime detection of an unstable will happen after this.
> +	 */
> +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> +	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> +	    !check_tsc_unstable()) {
> +		kvm_clock.rating = 299;
> +		pr_warn("kvm-clock: Using native sched_clock\n");

The warn is in the right direction but probably should say TSC still cannot be
trusted 100%.

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 0864b314c26a..9baffb425386 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
>  	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
>  	unsigned int crystal_khz;
>  
> -	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +	/*
> +	 * Ignore the vendor when running as a VM, if the hypervisor provides
> +	 * garbage CPUID information then the vendor is also suspect.
> +	 */
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> +	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return 0;
>  
>  	if (boot_cpu_data.cpuid_level < 0x15)
> @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
>  		return 0;
>  
>  	/*
> -	 * For Atom SoCs TSC is the only reliable clocksource.
> -	 * Mark TSC reliable so no watchdog on it.
> +	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
> +	 * VM, any watchdog is going to be less reliable than the TSC as the
> +	 * watchdog source will be emulated in software.  In both cases, mark
> +	 * the TSC reliable so that no watchdog runs on it.
>  	 */
> -	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
> +	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
>  		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
>  
>  #ifdef CONFIG_X86_LOCAL_APIC

It looks all wrong if a function called native_* sprinkles a bunch of "am
I a guest" checks. I guess we should split it into VM and native variants.

But yeah, the general direction is ok once we agree on what we do when
exactly.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-16 16:25                       ` Borislav Petkov
@ 2025-01-16 16:56                         ` Sean Christopherson
  2025-01-17 20:28                           ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Sean Christopherson @ 2025-01-16 16:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nikunj A. Dadhania, linux-kernel, thomas.lendacky, x86, kvm,
	mingo, tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Thu, Jan 16, 2025, Borislav Petkov wrote:
> On Wed, Jan 15, 2025 at 01:37:25PM -0800, Sean Christopherson wrote:
> > My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
> > nonstop, and not marked stable via command line.
> 
> So how do we deal with the case here where a malicious HV could set those bits
> and still tweak the TSC?

You don't.  If the hypervisor is malicious, it's game over for non-CoCo guests.
The clocksource being untrusted is completely unintersting because the host has
full access to VM state.

It's probably game over for SEV and SEV-ES too, e.g. thanks to remapping attacks,
lack of robust attestation, and a variety of other avenues a truly malicious
hypervisor can exploit to attack the guest.

It's only with SNP and TDX that the clocksource becomes at all interesting.

> IOW, I think Secure TSC and TDX should be the only ones who trust the TSC when
> in a guest.
> 
> If anything, trusting the TSC in a normal guest should at least issue
> a warning saying that we do use the TSC but there's no 100% guarantee that it
> is trustworthy...

Why?  For non-TDX/SecureTSC guests, *all* clocksources are host controlled.

> > But wait, there's more!  Because TDX doesn't override .calibrate_tsc() or
> > .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
> > frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
> > the TSC frequency using the information provided by KVM, i.e. the untrusted host.
> 
> Yeah, I guess we don't want that. Or at least we should warn about it.

CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC
would ideally assert that the kernel doesn't switch to some other calibration
method too.  Not sure where to hook into that though, without bleeding TDX and
SNP details everywhere.

> > +	/*
> > +	 * If the TSC counts at a constant frequency across P/T states, counts
> > +	 * in deep C-states, and the TSC hasn't been marked unstable, prefer
> > +	 * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
> > +	 * that TSC is chosen as the clocksource.  Note, the TSC unstable check
> > +	 * exists purely to honor the TSC being marked unstable via command
> > +	 * line, any runtime detection of an unstable will happen after this.
> > +	 */
> > +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > +	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > +	    !check_tsc_unstable()) {
> > +		kvm_clock.rating = 299;
> > +		pr_warn("kvm-clock: Using native sched_clock\n");
> 
> The warn is in the right direction but probably should say TSC still cannot be
> trusted 100%.

Heh, I didn't mean to include the printks, they were purely for my own debugging.

> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index 0864b314c26a..9baffb425386 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
> >  	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
> >  	unsigned int crystal_khz;
> >  
> > -	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> > +	/*
> > +	 * Ignore the vendor when running as a VM, if the hypervisor provides
> > +	 * garbage CPUID information then the vendor is also suspect.
> > +	 */
> > +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> > +	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
> >  		return 0;
> >  
> >  	if (boot_cpu_data.cpuid_level < 0x15)
> > @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
> >  		return 0;
> >  
> >  	/*
> > -	 * For Atom SoCs TSC is the only reliable clocksource.
> > -	 * Mark TSC reliable so no watchdog on it.
> > +	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
> > +	 * VM, any watchdog is going to be less reliable than the TSC as the
> > +	 * watchdog source will be emulated in software.  In both cases, mark
> > +	 * the TSC reliable so that no watchdog runs on it.
> >  	 */
> > -	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
> > +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
> > +	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
> >  		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> >  
> >  #ifdef CONFIG_X86_LOCAL_APIC
> 
> It looks all wrong if a function called native_* sprinkles a bunch of "am
> I a guest" checks. I guess we should split it into VM and native variants.

I agree the naming is weird, but outside of the vendor checks, the VM code is
identical to the "native" code, so I don't know that it's worth splitting into
multiple functions.

What if we simply rename it to calibrate_tsc_from_cpuid()?

> But yeah, the general direction is ok once we agree on what we do when
> exactly.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-16 16:56                         ` Sean Christopherson
@ 2025-01-17 20:28                           ` Borislav Petkov
  2025-01-17 20:59                             ` Sean Christopherson
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2025-01-17 20:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nikunj A. Dadhania, linux-kernel, thomas.lendacky, x86, kvm,
	mingo, tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Thu, Jan 16, 2025 at 08:56:25AM -0800, Sean Christopherson wrote:
> It's only with SNP and TDX that the clocksource becomes at all interesting.

So basically you're saying, let's just go ahead and trust the TSC when the HV
sets a bunch of CPUID bits.

But we really really trust it when the guest type is SNP+STSC or TDX since
there the HV is out of the picture and the only one who can flub it there is
the OEM.

> CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC
> would ideally assert that the kernel doesn't switch to some other calibration
> method too.  Not sure where to hook into that though, without bleeding TDX and
> SNP details everywhere.

We could use the platform calibrate* function pointers and assign TDX- or
SNP-specific ones and perhaps even define new such function ptrs. That's what
the platform stuff is for... needs staring, ofc.

> I agree the naming is weird, but outside of the vendor checks, the VM code is
> identical to the "native" code, so I don't know that it's worth splitting into
> multiple functions.
> 
> What if we simply rename it to calibrate_tsc_from_cpuid()?

This is all wrong layering with all those different guest types having their
own ->calibrate_tsc:

arch/x86/kernel/cpu/acrn.c:32:  x86_platform.calibrate_tsc = acrn_get_tsc_khz;
arch/x86/kernel/cpu/mshyperv.c:424:             x86_platform.calibrate_tsc = hv_get_tsc_khz;
arch/x86/kernel/cpu/vmware.c:419:               x86_platform.calibrate_tsc = vmware_get_tsc_khz;
arch/x86/kernel/jailhouse.c:213:        x86_platform.calibrate_tsc              = jailhouse_get_tsc;
arch/x86/kernel/kvmclock.c:323: x86_platform.calibrate_tsc = kvm_get_tsc_khz;
arch/x86/kernel/tsc.c:944:      tsc_khz = x86_platform.calibrate_tsc();
arch/x86/kernel/tsc.c:1458:                     tsc_khz = x86_platform.calibrate_tsc();
arch/x86/kernel/x86_init.c:148: .calibrate_tsc                  = native_calibrate_tsc,
arch/x86/xen/time.c:569:        x86_platform.calibrate_tsc = xen_tsc_khz;

What you want sounds like a redesign to me considering how you want to keep
the KVM guest code and baremetal pretty close... Hmmm, needs staring...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-17 20:28                           ` Borislav Petkov
@ 2025-01-17 20:59                             ` Sean Christopherson
  2025-01-21 11:32                               ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Sean Christopherson @ 2025-01-17 20:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nikunj A. Dadhania, linux-kernel, thomas.lendacky, x86, kvm,
	mingo, tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Fri, Jan 17, 2025, Borislav Petkov wrote:
> On Thu, Jan 16, 2025 at 08:56:25AM -0800, Sean Christopherson wrote:
> > It's only with SNP and TDX that the clocksource becomes at all interesting.
> 
> So basically you're saying, let's just go ahead and trust the TSC when the HV
> sets a bunch of CPUID bits.

Sort of.  It's not a trust thing though.  The Xen, KVM, and VMware PV clocks are
all based on TSC, i.e. we already "trust" the hypervisor to not muck with TSC.

The purpose of such PV clocks is to account for things in software, that aren't
handled in hardware.  E.g. to provide a constant counter on hardware without a
constant TSC.

The proposal here, and what kvmclock already does for clocksource, is to use the
raw TSC when the hypervisor sets bits that effectively say that the massaging of
TSC done by the PV clock isn't needed.

> But we really really trust it when the guest type is SNP+STSC or TDX since
> there the HV is out of the picture and the only one who can flub it there is
> the OEM.

Yep.  This one _is_ about trust.  Specifically, we trust the raw TSC more than
any clock that is provided by the HV.

> > CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC
> > would ideally assert that the kernel doesn't switch to some other calibration
> > method too.  Not sure where to hook into that though, without bleeding TDX and
> > SNP details everywhere.
> 
> We could use the platform calibrate* function pointers and assign TDX- or
> SNP-specific ones and perhaps even define new such function ptrs. That's what
> the platform stuff is for... needs staring, ofc.
> 
> > I agree the naming is weird, but outside of the vendor checks, the VM code is
> > identical to the "native" code, so I don't know that it's worth splitting into
> > multiple functions.
> > 
> > What if we simply rename it to calibrate_tsc_from_cpuid()?
> 
> This is all wrong layering with all those different guest types having their
> own ->calibrate_tsc:
> 
> arch/x86/kernel/cpu/acrn.c:32:  x86_platform.calibrate_tsc = acrn_get_tsc_khz;
> arch/x86/kernel/cpu/mshyperv.c:424:             x86_platform.calibrate_tsc = hv_get_tsc_khz;
> arch/x86/kernel/cpu/vmware.c:419:               x86_platform.calibrate_tsc = vmware_get_tsc_khz;
> arch/x86/kernel/jailhouse.c:213:        x86_platform.calibrate_tsc              = jailhouse_get_tsc;
> arch/x86/kernel/kvmclock.c:323: x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> arch/x86/kernel/tsc.c:944:      tsc_khz = x86_platform.calibrate_tsc();
> arch/x86/kernel/tsc.c:1458:                     tsc_khz = x86_platform.calibrate_tsc();
> arch/x86/kernel/x86_init.c:148: .calibrate_tsc                  = native_calibrate_tsc,
> arch/x86/xen/time.c:569:        x86_platform.calibrate_tsc = xen_tsc_khz;
> 
> What you want sounds like a redesign to me considering how you want to keep
> the KVM guest code and baremetal pretty close... Hmmm, needs staring...

It's not KVM guest code though.  The CPUID stuff is Intel's architecturally
defined behavior.  There are oodles and oodles of features that are transparently
emulated by the hypervisor according to hardware specifications.  Generally
speaking, the kernel treats those as "native", e.g. native_wrmsrl(), native_cpuid(),
etc.

What I am proposing is that, for TDX especially, instead of relying on the hypervisor
to use a paravirtual channel for communicating the TSC frequency, we rely on the
hardware-defined way of getting the frequency, because CPUID is emulated by the
trusted entity, i.e. the OEM.

Hmm, though I suppose I'm arguing against myself in that case.  If the hypervisor
provides the frequency and there are no trust issues, why would we care if the
kernel gets the frequency via CPUID or the PV channel.  It's really only TDX that
matters.  And we could handle TDX by overriding .calibrate_tsc() in tsc_init(),
same as Secure TSC.

That said, I do think it makes sense to either override the vendor and F/M/S
checks native_calibrate_tsc().  Or even better drop the initial vendor check
entirely, because both Intel and AMD have a rich history of implementing each
other's CPUID leaves.  I.e. I see no reason to ignore CPUID 0x15 just because
the CPU isn't Intel.

As for the Goldmost F/M/S check, that one is a virtualization specific thing.
The argument is that when running as a guest, any non-TSC clocksource is going
to be emulated by the hypervisor, and therefore is going to be less reliable than
TSC.  I.e. putting a watchdog on TSC does more harm than good, because what ends
up happening is the TSC gets marked unreliable because the *watchdog* is unreliable.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-15 21:37                     ` Sean Christopherson
  2025-01-16 16:25                       ` Borislav Petkov
@ 2025-01-21  3:59                       ` Nikunj A. Dadhania
  2025-01-28  5:41                       ` Nikunj A Dadhania
  2 siblings, 0 replies; 49+ messages in thread
From: Nikunj A. Dadhania @ 2025-01-21  3:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Borislav Petkov, linux-kernel, thomas.lendacky, x86, kvm, mingo,
	tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky



On 1/16/2025 3:07 AM, Sean Christopherson wrote:

> My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked stable via command line.  I.e. use the same criteria as
> tweaking the clocksource rating.  As above, sched_clock is more tolerant of slop
> than clocksource, so it's a bit ridiculous to care whether the TSC or kvmclock
> (or something else entirely) is used for the clocksource.
> 
> If we wanted to go with a more conservative approach, e.g. to minimize the risk
> of breaking existing setups, we could also condition the change on the TSC being
> reliable and having a known frequency.  I.e. require SNP's Secure TSC, or require
> the hypervisor to enumerate the TSC frequency via CPUID.  I don't see a ton of
> value in that approach though, and long-term it would lead to some truly weird
> code due to holding sched_clock to a higher standard than clocksource.
> 
> But wait, there's more!  Because TDX doesn't override .calibrate_tsc() or
> .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
> frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
> the TSC frequency using the information provided by KVM, i.e. the untrusted host.
> 
> The "obvious" solution is to leave the calibration functions as-is if the TSC has
> a known, reliable frequency, but even _that_ is riddled with problems, because
> as-is, the kernel sets TSC_KNOWN_FREQ and TSC_RELIABLE in tsc_early_init(), which
> runs *after* init_hypervisor_platform().  SNP Secure TSC fudges around this by
> overiding the calibration routines, but that's a bit gross and easy to fix if we
> also fix TDX.  And fixing TDX by running native_calibrate_tsc() would give the
> same love to setups where the hypervisor provides CPUID 0x15 and/or 0x16.

One change that I wasn't sure was about non-Intel guests using CPUID 0x15H/0x16H,
that you have already answered in the subsequent emails. At present, AMD platform
does not implement either of them and will bail out from the below check:

        /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
        cpuid(CPUID_LEAF_TSC, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);

        if (ebx_numerator == 0 || eax_denominator == 0)
                return 0;

> All in all, I'm thinking something like this (across multiple patches):

Tested on AMD Milan and changes are working as intended:

For SecureTSC guests with TSC_INVARIANT set:
* Raw TSC is used as clocksource and sched-clock
* Calibration is done using GUEST_TSC_FREQ MSR

For non-SecureTSC guests with TSC_INVARIANT set:
* Raw TSC is used as clocksource and sched-clock
* Calibration is done using kvm-clock

For non-SecureTSC guests without TSC_INVARIANT:
* kvm-clock(based on TSC) is used as clocksource and sched-clock
* Calibration is done using kvm-clock

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 0864b314c26a..9baffb425386 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
>  	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
>  	unsigned int crystal_khz;
>  
> -	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +	/*
> +	 * Ignore the vendor when running as a VM, if the hypervisor provides
> +	 * garbage CPUID information then the vendor is also suspect.
> +	 */
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> +	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return 0;
>  
>  	if (boot_cpu_data.cpuid_level < 0x15)

Regards
Nikunj

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-17 20:59                             ` Sean Christopherson
@ 2025-01-21 11:32                               ` Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2025-01-21 11:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nikunj A. Dadhania, linux-kernel, thomas.lendacky, x86, kvm,
	mingo, tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

On Fri, Jan 17, 2025 at 12:59:37PM -0800, Sean Christopherson wrote:
> The proposal here, and what kvmclock already does for clocksource, is to use the
> raw TSC when the hypervisor sets bits that effectively say that the massaging of
> TSC done by the PV clock isn't needed.
>
> > But we really really trust it when the guest type is SNP+STSC or TDX since
> > there the HV is out of the picture and the only one who can flub it there is
> > the OEM.
> 
> Yep.  This one _is_ about trust.  Specifically, we trust the raw TSC more than
> any clock that is provided by the HV.

Ok, makes sense.

> It's not KVM guest code though.  The CPUID stuff is Intel's architecturally
> defined behavior.  There are oodles and oodles of features that are transparently
> emulated by the hypervisor according to hardware specifications.  Generally
> speaking, the kernel treats those as "native", e.g. native_wrmsrl(), native_cpuid(),
> etc.

Uuuh, this is calling for confusion.

native_* has always been stuff the kernel calls when it runs, well, natively,
on baremetal. And not some PV or whatever-enlightened ops.

Now, if you want to emulate those transparently that's fine but this is what
native means in arch/x86/. Unless I've missed some memo but I doubt it. And
I asked around :)

> What I am proposing is that, for TDX especially, instead of relying on the
> hypervisor to use a paravirtual channel for communicating the TSC frequency,
> we rely on the hardware-defined way of getting the frequency, because CPUID
> is emulated by the trusted entity, i.e. the OEM.

I wouldn't trust the OEM with a single bit but that's a different story. :-P

> Hmm, though I suppose I'm arguing against myself in that case.  If the
> hypervisor provides the frequency and there are no trust issues, why would
> we care if the kernel gets the frequency via CPUID or the PV channel.  It's
> really only TDX that matters.  And we could handle TDX by overriding
> .calibrate_tsc() in tsc_init(), same as Secure TSC.

I guess it all boils down to the trust level you want to have with the TSC.
I don't know whether there's even a HV which tries to lie to the guest with
the TSC and I guess we'll find out eventually but for now we could treat the
two things similarly. The two things being:

- TDX and STSC SNP guests - full TSC trust
- HV with the required CPUID bits set - almost full, we assume HV is
  benevolent. Could change if we notice something.

> That said, I do think it makes sense to either override the vendor and F/M/S
> checks native_calibrate_tsc().  Or even better drop the initial vendor check
> entirely,

I have no clue why that thing is even there:

aa297292d708 ("x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID")

I guess back then it meant that only Intel sports those new CPUID leafs but
those things do change.

> because both Intel and AMD have a rich history of implementing each other's
> CPUID leaves.  I.e. I see no reason to ignore CPUID 0x15 just because the
> CPU isn't Intel.
> 
> As for the Goldmost F/M/S check, that one is a virtualization specific
> thing.  The argument is that when running as a guest, any non-TSC
> clocksource is going to be emulated by the hypervisor, and therefore is
> going to be less reliable than TSC.  I.e. putting a watchdog on TSC does
> more harm than good, because what ends up happening is the TSC gets marked
> unreliable because the *watchdog* is unreliable.

So I think the best thing to do is to carve out the meat of that function
which is valid for both baremetal and virtualized and then have separate
helpers which call the common thing. So that you don't have to test
on *everything* when having to change it in the future for whatever reason and
so that both camps can be relatively free when implementing their
idiosyncrasies.

And then it should fit in the current scheme with platform function ptrs.

I know you want to use native_calibrate_tsc() but then you have to sprinkle
"am I a guest" checks and this reminds me of the "if XEN" fiasco back then.
Also, a really nasty lying HV won't simply set X86_FEATURE_HYPERVISOR...

So I'd prefer if we fit a guest which runs on a relatively honest HV :) into
the current scheme as the kernel running simply on yet another platform, even
at the price of some small code duplication.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
  2025-01-15 21:37                     ` Sean Christopherson
  2025-01-16 16:25                       ` Borislav Petkov
  2025-01-21  3:59                       ` Nikunj A. Dadhania
@ 2025-01-28  5:41                       ` Nikunj A Dadhania
  2 siblings, 0 replies; 49+ messages in thread
From: Nikunj A Dadhania @ 2025-01-28  5:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Borislav Petkov, linux-kernel, thomas.lendacky, x86, kvm, mingo,
	tglx, dave.hansen, pgonda, pbonzini, francescolavra.fl,
	Alexey Makhalov, Juergen Gross, Boris Ostrovsky

Sean Christopherson <seanjc@google.com> writes:

>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 0864b314c26a..9baffb425386 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
>  	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
>  	unsigned int crystal_khz;
>  
> -	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +	/*
> +	 * Ignore the vendor when running as a VM, if the hypervisor provides
> +	 * garbage CPUID information then the vendor is also suspect.
> +	 */
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> +	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return 0;
>  
>  	if (boot_cpu_data.cpuid_level < 0x15)
> @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
>  		return 0;
>  
>  	/*
> -	 * For Atom SoCs TSC is the only reliable clocksource.
> -	 * Mark TSC reliable so no watchdog on it.
> +	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
> +	 * VM, any watchdog is going to be less reliable than the TSC as the
> +	 * watchdog source will be emulated in software.  In both cases, mark
> +	 * the TSC reliable so that no watchdog runs on it.
>  	 */
> -	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
> +	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
>  		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);

One more point here is for AMD guests, TSC will not be marked reliable
as per the above change, it will only be effective for CPUs supporting
CPUID 15H/16H. Although, the watchdog should be stopped for AMD guests
as well.

Will it make sense to move this before cpuid_level check ?

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index e7abcc4d02c3..2769d1598c0d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -672,6 +672,14 @@ unsigned long native_calibrate_tsc(void)
 	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return 0;
 
+	/*
+	 * In a VM, any watchdog is going to be less reliable than the TSC as
+	 * the watchdog source will be emulated in software. Mark the TSC
+	 * reliable so that no watchdog runs on it.
+	 */
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+
 	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
 		return 0;
 
@@ -719,13 +727,10 @@ unsigned long native_calibrate_tsc(void)
 		return 0;
 
 	/*
-	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
-	 * VM, any watchdog is going to be less reliable than the TSC as the
-	 * watchdog source will be emulated in software.  In both cases, mark
-	 * the TSC reliable so that no watchdog runs on it.
+	 * For Atom SoCs TSC is the only reliable clocksource.
+	 * Mark TSC reliable so no watchdog on it.
 	 */
-	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
-	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
+	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
 		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
 
 #ifdef CONFIG_X86_LOCAL_APIC

Regards
Nikunj

^ permalink raw reply related	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-01-28  5:41 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 01/13] virt: sev-guest: Remove is_vmpck_empty() helper Nikunj A Dadhania
2025-01-07 18:38   ` Tom Lendacky
2025-01-06 12:46 ` [PATCH v16 02/13] virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL Nikunj A Dadhania
2025-01-07 18:40   ` Tom Lendacky
2025-01-06 12:46 ` [PATCH v16 03/13] x86/sev: Carve out and export SNP guest messaging init routines Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 04/13] x86/sev: Relocate SNP guest messaging routines to common code Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests Nikunj A Dadhania
2025-01-07 10:42   ` Borislav Petkov
2025-01-07 11:43     ` Nikunj A. Dadhania
2025-01-07 12:37       ` Borislav Petkov
2025-01-07 18:53         ` Tom Lendacky
2025-01-07 19:18           ` Borislav Petkov
2025-01-08  7:47             ` Nikunj A. Dadhania
2025-01-08  8:05               ` Borislav Petkov
2025-01-08  8:37                 ` Nikunj A. Dadhania
2025-01-08  8:43                   ` Borislav Petkov
2025-01-07 19:46   ` Tom Lendacky
2025-01-07 19:53     ` Borislav Petkov
2025-01-06 12:46 ` [PATCH v16 06/13] x86/sev: Change TSC MSR behavior for Secure TSC enabled guests Nikunj A Dadhania
2025-01-07 20:09   ` Tom Lendacky
2025-01-06 12:46 ` [PATCH v16 07/13] x86/sev: Prevent GUEST_TSC_FREQ MSR interception " Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 08/13] x86/sev: Prevent RDTSC/RDTSCP " Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 09/13] x86/sev: Mark Secure TSC as reliable clocksource Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 10/13] x86/tsc: Switch Secure TSC guests away from kvm-clock Nikunj A Dadhania
2025-01-07 15:16   ` Borislav Petkov
2025-01-08 10:45     ` Nikunj A. Dadhania
2025-01-06 12:46 ` [PATCH v16 11/13] x86/tsc: Upgrade TSC clocksource rating for guests Nikunj A Dadhania
2025-01-07 17:51   ` Borislav Petkov
2025-01-06 12:46 ` [PATCH v16 12/13] x86/tsc: Switch to native sched clock Nikunj A Dadhania
2025-01-07 19:37   ` Borislav Petkov
2025-01-08  5:20     ` Nikunj A. Dadhania
2025-01-08  8:22       ` Borislav Petkov
2025-01-08  8:34         ` Nikunj A. Dadhania
2025-01-08 10:20           ` Nikunj A. Dadhania
2025-01-08 14:00             ` Sean Christopherson
2025-01-08 15:34               ` Borislav Petkov
2025-01-08 17:02                 ` Sean Christopherson
2025-01-08 19:53                   ` Borislav Petkov
2025-01-09  6:32                   ` Nikunj A. Dadhania
2025-01-15 21:37                     ` Sean Christopherson
2025-01-16 16:25                       ` Borislav Petkov
2025-01-16 16:56                         ` Sean Christopherson
2025-01-17 20:28                           ` Borislav Petkov
2025-01-17 20:59                             ` Sean Christopherson
2025-01-21 11:32                               ` Borislav Petkov
2025-01-21  3:59                       ` Nikunj A. Dadhania
2025-01-28  5:41                       ` Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 13/13] x86/sev: Allow Secure TSC feature for SNP guests Nikunj A Dadhania

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).