Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [RFC PATCH 05/15] x86/virt/tdx: Move tdx_tdr_pa() up in the file
From: Edgecombe, Rick P @ 2026-05-28 21:32 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-6-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> Move the tdx_tdr_pa() in preparation for upcoming changes to use them
> during TDX bringup.
> 
> No functional change intended.
> 
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply

* Re: [RFC PATCH 06/15] x86/virt/tdx: Initialize Quoting extension during bringup
From: Edgecombe, Rick P @ 2026-05-28 21:35 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-7-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> Initialize the Quoting extension and fetch its metadata during TDX
> bringup.
> 
> Because Quoting is an optional TDX feature, do not let its
> initialization failures cause TDX bringup to fail.
> 
> This patch
> 

Don't say "this patch" in tip logs. The patch is a temporary format, and some
x86 maintainers hate the term in logs.

>  does not include the opt-in portion of the initialization.
> It mainly lays the groundwork for TDX Quoting support. Opt-in will be
> added in a follow-up patch once the feature can be properly used by the
> system.

This could be imperative mood.

> 
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>


^ permalink raw reply

* Re: [RFC PATCH 07/15] x86/virt/tdx: Prepare Quote buffer during extension bringup
From: Edgecombe, Rick P @ 2026-05-28 22:30 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-8-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> The host uses a Quote buffer to communicate with the TDX module when
> generating Quotes.
> 

Can this be put in common terms. This is going to mean nothing to someone
reading this that doesn't already know the feature.

>  Because the Quote buffer is shared with TDX guests,

Why capitalize "Quote"?

> prepare the required metadata during Quoting extension bringup.

What does prepare the required metadata mean?

How does it being shared with TDX guest suggest this? Just that TDX guests will
need them? Is the reason just that only one is needed, so do it during global
init? 

> 
> This mostly involves determining the physical addresses of the Quote
> buffer pages and arranging them in the HPA_LINKED_LIST format defined by
> the Intel TDX Module ABI specification.
> 
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 85 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index fb84fb6d952b..9d04293394d7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -32,6 +32,7 @@
>  #include <linux/idr.h>
>  #include <linux/kvm_types.h>
>  #include <linux/bitfield.h>
> +#include <linux/vmalloc.h>
>  #include <asm/page.h>
>  #include <asm/special_insns.h>
>  #include <asm/msr-index.h>
> @@ -61,6 +62,13 @@ static LIST_HEAD(tdx_memlist);
>  static struct tdx_sys_info tdx_sysinfo __ro_after_init;
>  static bool tdx_module_initialized __ro_after_init;
>  
> +static struct quote_data {
> +	void *buf;
> +	u64 buf_len;
> +	u64 *hpa_list;
> +	phys_addr_t hpa_list_pa;
> +} quote_data;

Hmm, I think this should separate the type and variable declaration. It's not a
common pattern. I don't think there is an official rule.

> +
>  typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>  
>  static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
> @@ -1205,9 +1213,78 @@ static inline u64 tdx_tdr_pa(struct tdx_td *td)
>  	return page_to_phys(td->tdr_page);
>  }
>  
> +#define HPAS_PER_PAGE			(PAGE_SIZE / sizeof(u64))
> +
> +static int tdx_quote_create_buf(unsigned int nr_pages, struct quote_data *qdata)
> +{
> +	unsigned long pfn;
> +	u64 qlist_npages;
> +	int err, i, j;
> +	u64 *qlist;
> +	void *qbuf;
> +
> +	if (!nr_pages)
> +		return -EINVAL;
> +
> +	/* The last entry of a linked list page points to the next page	*/
> +	qlist_npages = (u64)DIV_ROUND_UP(nr_pages, HPAS_PER_PAGE - 1);
> +
> +	qlist = vmalloc_array(qlist_npages, PAGE_SIZE);
> +	if (!qlist) {
> +		err = -ENOMEM;
> +		goto out_err;

Just return ENOMEM here. vfree() doesn't do any work if passed NULL, but it's
weird flow.

> +	}
> +
> +	/*
> +	 * Make sure unfilled entries are always -1, which means NULL in TDX.

Huh?

> +	 * Only the last page needs to be filled. All the other pages will be
> +	 * fully populated.
> +	 */
> +	memset((u8 *)qlist + (qlist_npages - 1) * PAGE_SIZE, 0xff, PAGE_SIZE);

What are the entries? And what is a -1 in u8? Or is it supposed to be u64?
Please make this a lot clearer.

> +
> +	qbuf = vcalloc(nr_pages, PAGE_SIZE);
> +	if (!qbuf) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	/* Populate HPA_LINKED_LIST as per TDX ABI spec */
> +	for (i = 0, j = 0; j < nr_pages; i++) {
> +		if ((i % HPAS_PER_PAGE) == HPAS_PER_PAGE - 1) {
> +			/*
> +			 * The last entry always points to the next page. The
> +			 * address of the following entry must be on next page's
> +			 * boundary.
> +			 */

Can you maybe just explain this format that you are building in like one
sentence at the beginning of the function? "The quote buffer is passed to the
tdx module in a format that like... (some common terms that have no TDX
jargon)."

> +			pfn = vmalloc_to_pfn(&qlist[i + 1]);
> +			qlist[i] = PFN_PHYS(pfn);
> +			continue;
> +		}
> +
> +		pfn = vmalloc_to_pfn((u8 *)qbuf + j * PAGE_SIZE);
> +		qlist[i] = PFN_PHYS(pfn);
> +		j++;
> +	}
> +
> +	qdata->buf = qbuf;
> +	qdata->buf_len = (u64)nr_pages * PAGE_SIZE;
> +	qdata->hpa_list = qlist;
> +
> +	pfn = vmalloc_to_pfn(qlist);

Do we need a vmalloc_to_pa() helper? Maybe put it in terms of tdx format. Like
vmalloc_pfn_to_tdxpa() and keep it here? The tdx update stuff does this a bunch
too.

> +	qdata->hpa_list_pa = PFN_PHYS(pfn);
> +
> +	return 0;
> +
> +out_err:
> +	vfree(qlist);
> +
> +	return err;

It only returns -ENOMEM, so do we need the err var?

> +}
> +
>  static void tdx_quote_init(void)
>  {
>  	struct tdx_module_args args = {};
> +	unsigned int nr_quote_pages;
>  	u64 r;
>  
>  	do {
> @@ -1218,7 +1295,13 @@ static void tdx_quote_init(void)
>  		return;
>  
>  	/* Quoting metadata is valid only after initialization */
> -	get_tdx_sys_info_quote(&tdx_sysinfo.quote);
> +	if (get_tdx_sys_info_quote(&tdx_sysinfo.quote))
> +		return;

How come this patch gets error handling? Why is it needed now when it wasn't
before?

> +
> +	nr_quote_pages = PAGE_ALIGN(tdx_sysinfo.quote.max_quote_size) /
> +			 PAGE_SIZE;
> +	if (tdx_quote_create_buf(nr_quote_pages, &quote_data))
> +		pr_err("Failed to create quote buffer\n");

Err... what happens in ENOMEM scenario? NULL pointer later?

>  }
>  
>  /* Initialize the TDX Module Extensions then Extension-SEAMCALLs can be used */


^ permalink raw reply

* Re: [RFC PATCH 09/15] x86/virt/tdx: Add interface to generate a Quote
From: Edgecombe, Rick P @ 2026-05-28 22:30 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-10-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> +void *tdx_quote_generate(struct tdx_td *td, void *in_data, u32 in_data_len,
> +			 u32 *quote_len)
> +{
> +	void *quote_dup = NULL;
> +	u64 r, out_len;
> +
> +	if (!tdx_quote_enabled())
> +		return NULL;
> +
> +	/* TDH.QUOTE.GET expects the input data to fit in a page */
> +	if (in_data_len > PAGE_SIZE)
> +		return NULL;

Do we really need this check? We can't trust the caller to pass the right size?

> +
> +	mutex_lock(&tdx_quote_lock);
> +
> +	/*
> +	 * Use the first page of the quote buffer for input data. The buffer
> +	 * must be at least one page in size. @in_data may not be page-aligned,
> +	 * but TDH.QUOTE.GET expects page-aligned addresses.
> +	 */
> +	memcpy(quote_data.buf, in_data, (size_t)in_data_len);
> +
> +	r = tdx_quote_get(td, quote_data.hpa_list[0], (u64)in_data_len,
> +			  quote_data.hpa_list_pa, quote_data.buf_len, &out_len);
> +	if (r || !out_len || out_len > quote_data.buf_len)


How do these various error conditions happen?

> +		goto out;
> +
> +	/*
> +	 * The quote buffer is a shared resource, so use it only for the
> +	 * SEAMCALL and copy the data out as soon as possible.
> +	 */
> +	quote_dup = kvmemdup(quote_data.buf, out_len, GFP_KERNEL);

So at init time we allocate a vmalloc for the quote and pre-populate the
hpa_list. Then we use it every time and copy the contents to a new vmalloc.
Would it really be that hard to keep the hpa list allocation around, do a
vmalloc here and update the pfn list. Then do get quote on that and pass back
the vmalloc we just allocated? Just feels like global reuse way has extra pieces
in it. Compared to the whole quoting operation, this vmalloc_to_pfn() loop is
probably not very expensive.

> +
> +out:
> +	mutex_unlock(&tdx_quote_lock);
> +
> +	*quote_len = (u32)out_len;
> +
> +	return quote_dup;
> +}
> +EXPORT_SYMBOL_FOR_KVM(tdx_quote_generate);
> +


^ permalink raw reply

* Re: [PATCH v5 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Kalra, Ashish @ 2026-05-28 23:52 UTC (permalink / raw)
  To: Ackerley Tng, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgGfyb7zvZ1u1j7YLomD+JdAxnVW36gtvNG9gxgZ80vMyQ@mail.gmail.com>

Hello Ackerley,

On 5/28/2026 9:45 AM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@amd.com> writes:
> 
> Thank you Ashish!
> 
>> From: Ashish Kalra <ashish.kalra@amd.com>
>>
>> When SEV-SNP is enabled, all writes to memory are checked to ensure
>> integrity of SNP guest memory. This imposes performance overhead on the
>> whole system.
>>
>> RMPOPT is a new instruction that minimizes the performance overhead of
>> RMP checks on the hypervisor and on non-SNP guests by allowing RMP
>> checks to be skipped for 1GB regions of memory that are known not to
>> contain any SEV-SNP guest memory.
>>
>> Add support for performing RMP optimizations asynchronously using a
>> dedicated workqueue.
>>
>> Enable RMPOPT optimizations globally for all system RAM up to 2TB at
> 
> This should also be updated to say "Enable RMPOPT optimizations for up
> to 2TB worth of system RAM at..."
> 
> The current phrasing sounds like only addresses [0, 2TB) are allowed to
> be optimized, but actually any address [start, start + 2TB) can be
> optimized?

Yes, i will update it.

> 
>> RMP initialization time. RMP checks can initially be skipped for 1GB
>> memory ranges that do not contain SEV-SNP guest memory (excluding
>> preassigned pages such as the RMP table and firmware pages). As SNP
>> guests are launched, RMPUPDATE will disable the corresponding RMPOPT
>> optimizations.
>>
>> Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
>> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
>> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
>> ---
>>  arch/x86/virt/svm/sev.c | 167 +++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 164 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
>> index 82f9dc7a57c3..8876cac052d5 100644
>> --- a/arch/x86/virt/svm/sev.c
>> +++ b/arch/x86/virt/svm/sev.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/iommu.h>
>>  #include <linux/amd-iommu.h>
>>  #include <linux/nospec.h>
>> +#include <linux/workqueue.h>
>>
>>  #include <asm/sev.h>
>>  #include <asm/processor.h>
>> @@ -125,7 +126,18 @@ static void *rmp_bookkeeping __ro_after_init;
>>  static u64 probed_rmp_base, probed_rmp_size;
>>
>>  static cpumask_t rmpopt_cpumask;
>> -static phys_addr_t rmpopt_pa_start;
>> +static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
>> +
>> +enum rmpopt_function {
>> +	RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
>> +	RMPOPT_FUNC_REPORT_STATUS
>> +};
>> +
>> +#define RMPOPT_WORK_TIMEOUT	10000
>> +
>> +static struct workqueue_struct *rmpopt_wq;
>> +static struct delayed_work rmpopt_delayed_work;
>> +static DEFINE_MUTEX(rmpopt_wq_mutex);
>>
>>  static LIST_HEAD(snp_leaked_pages_list);
>>  static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
>> @@ -564,12 +576,21 @@ EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
>>
>>  static void rmpopt_cleanup(void)
>>  {
>> +	guard(mutex)(&rmpopt_wq_mutex);
>> +
>> +	if (!rmpopt_wq)
>> +		return;
>> +
>> +	cancel_delayed_work_sync(&rmpopt_delayed_work);
>> +	destroy_workqueue(rmpopt_wq);
>> +
>>  	cpus_read_lock();
>>  	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, 0);
>>  	cpus_read_unlock();
>>
>>  	cpumask_clear(&rmpopt_cpumask);
>> -	rmpopt_pa_start = 0;
>> +	rmpopt_pa_start = rmpopt_pa_end = 0;
>> +	rmpopt_wq = NULL;
>>  }
>>
>>  void snp_shutdown(void)
>> @@ -587,6 +608,105 @@ void snp_shutdown(void)
>>  }
>>  EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
>>
>> +static inline bool __rmpopt(u64 rax, u64 rcx)
> 
> Perhaps use pa_start instead of rax and op_type for rcx?
> 

I used these parameters to align with the RMPOPT specifications (rax and rcx)
which i think makes more sense.

>> +{
>> +	bool optimized;
>> +
>> +	asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
>> +		     : "=@ccc" (optimized)
>> +		     : "a" (rax), "c" (rcx)
>> +		     : "memory", "cc");
>> +
>> +	return optimized;
>> +}
>> +
>> +static void rmpopt(u64 pa)
>> +{
>> +	u64 rax = ALIGN_DOWN(pa, SZ_1G);
>> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
>> +
> 
> And pa_start and op_type here too.
> 
>> +	__rmpopt(rax, rcx);
>> +}
>> +
>> +/*
>> + * 'val' is a system physical address.
>> + */
>> +static void rmpopt_smp(void *val)
>> +{
>> +	rmpopt((u64)val);
>> +}
>> +
>> +/*
>> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
>> + * range of memory does not contain any SNP guest memory.
>> + */
>> +static void rmpopt_work_handler(struct work_struct *work)
>> +{
>> +	bool current_cpu_cleared = false;
>> +	phys_addr_t pa;
>> +	int this_cpu;
>> +
>> +	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
>> +		rmpopt_pa_start, rmpopt_pa_end);
>> +
>> +	/*
>> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
>> +	 * reserved processor memory. The RMP scan is the most expensive
>> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
>> +	 * if they can see a cached result in the reserved processor memory.
>> +	 *
>> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>> +	 * on every other primary thread. This potentially allows the
> 
> I like the leader and follower comments below, thanks! With this
> leader/follower setup, will the followers definitely see the cached scan
> results, or might the followers still potentially not benefit from the
> caching? If it's still only "potentially", why?

I am verifying with the H/W architects if this is always going to be true or not,
will the followers always benefit from the scan results cached by the leader (first CPU)
or there is a possibility that the followers cannot see/access/get the cached results
and instead do full RMP scanning ?

> 
>> +	 * followers to use the "cached" scan results to avoid repeating
>> +	 * full scans.
>> +	 */
>> +
>> +	/*
>> +	 * Pin the worker to the current CPU for the leader loop so that
>> +	 * this_cpu remains valid and the RMPOPT instruction executes on
>> +	 * the CPU that was cleared from the cpumask.  The workqueue is
>> +	 * WQ_UNBOUND, so without pinning, the scheduler could migrate
>> +	 * the worker between the cpumask manipulation and the leader
>> +	 * loop, causing the leader to run on a different CPU while
>> +	 * this_cpu's core is skipped entirely.
>> +	 *
>> +	 * Use migrate_disable() rather than get_cpu() to prevent
>> +	 * migration while still allowing preemption.
>> +	 *
>> +	 * Note: rmpopt_cpumask is modified here without holding
>> +	 * rmpopt_wq_mutex.  This is safe because the delayed_work
>> +	 * mechanism guarantees single-threaded execution of this
>> +	 * handler, and rmpopt_cleanup() calls cancel_delayed_work_sync()
>> +	 * to ensure handler completion before tearing down the cpumask.
>> +	 */
>> +	migrate_disable();
>> +	this_cpu = smp_processor_id();
>> +	if (cpumask_test_cpu(this_cpu, &rmpopt_cpumask)) {
>> +		cpumask_clear_cpu(this_cpu, &rmpopt_cpumask);
>> +		current_cpu_cleared = true;
>> +	}
>> +
> 
> Instead of reusing the global rmpopt_cpumask, why not make a copy of
> rmpopt_cpumask for this function? Then this function won't have to
> figure out current_cpu_cleared or restore rmpopt_cpumask at the end.
> 
> I'm thinking to also drop the test and clear, this function can just
> always clear, like
> 
>   cpumask_clear_cpu(smp_processor_id(), followers_cpumask);
> 
> and later
> 
>   on_each_cpu_mask(&followers_cpumask, ...);

That's surely a much cleaner approach. Instead of modifying global
rmpopt_cpumask and using a local copy:

  cpumask_var_t follower_mask;
  alloc_cpumask_var(&follower_mask, GFP_KERNEL);
  cpumask_copy(follower_mask, &rmpopt_cpumask);

  migrate_disable();
  this_cpu = smp_processor_id();
  cpumask_clear_cpu(this_cpu, follower_mask);  // modify local only

  // leader loop on this_cpu...
  migrate_enable();

  // follower loop with follower_mask...
  on_each_cpu_mask(follower_mask, rmpopt_smp, ...);

  free_cpumask_var(follower_mask);

This eliminates:
- current_cpu_cleared variable
- The restore at the end
  
Additionally, the global rmpopt_cpumask is never modified, so no concurrency concerns with debugfs or other readers.

> 
> Actually, if for whatever reason cpumask_test_cpu(this_cpu,
> &rmpopt_cpumask) above returns false, would that mean somehow some cpu
> exists that wasn't enabled right when rmpopt was initialized? 

The work handler can always execute on a cpu which is not in the rmpopt_cpumask, so i believe the
cpumask_test_cpu() needs to be there.

The leader loop must only run on a CPU that has RMPOPT_BASE MSR programmed. If the WQ_UNBOUND scheduler puts the
handler on a CPU not in rmpopt_cpumask, that CPU's core never had RMPOPT enabled -> RMPOPT instruction causes #UD.

  So the leader should be conditional:

  cpumask_copy(follower_mask, &rmpopt_cpumask);

  migrate_disable();
  this_cpu = smp_processor_id();

  if (cpumask_test_cpu(this_cpu, follower_mask)) {
      cpumask_clear_cpu(this_cpu, follower_mask);

      /* Leader: prime the RMPOPT cache on this CPU */
      for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
          rmpopt(pa);
  }

  migrate_enable();

  /* Followers: run RMPOPT on remaining cores */
  cpus_read_lock();
  for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
      on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
      cond_resched();
  }
  cpus_read_unlock();

If the current CPU isn't in rmpopt_cpumask, the leader is skipped and all cores run as followers — they lose the caching
optimization from a leader priming pass, but correctness is maintained.

Alternatively, i could pick the first CPU from rmpopt_cpumask as the explicit leader instead of relying on whichever CPU the
scheduler chose.

 	cpumask_copy(follower_mask, &rmpopt_cpumask);

        migrate_disable();
        this_cpu = smp_processor_id();

        if (cpumask_test_cpu(this_cpu, follower_mask)) {
                /* Fast path: leader runs locally, no IPIs */
                cpumask_clear_cpu(this_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        rmpopt(pa);
        } else {
                /*
                 * Current CPU does not have RMPOPT_BASE MSR programmed.
                 * Pick an explicit leader from the cpumask to avoid #UD.
                 */
                int leader_cpu = cpumask_first(follower_mask);

                cpumask_clear_cpu(leader_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        smp_call_function_single(leader_cpu, rmpopt_smp,
                                                 (void *)pa, true);
        }

        migrate_enable();

        /* Followers: run RMPOPT on remaining cores */
        cpus_read_lock();
        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                on_each_cpu_mask(follower_mask, rmpopt_smp,
                                 (void *)pa, true);

                /* Give a chance for other threads to run */
                cond_resched();
        }
        cpus_read_unlock();

        free_cpumask_var(follower_mask);
  }

> If yes, what happens if we call rmpopt() on a cpu where it wasn't initialized?

That will cause a #UD exception, if RMPOPT instruction is issued on 
a CPU where RMPOPT is not enabled (RMPOPT_BASE.RMPOPT_EN==0), so
it is essential to issue RMPOPT instruction only on the cpumask (covers both
primary and secondary threads) which was setup initially when rmpopt was
initialized and on which the RMPOPT_BASE MSR was setup and RMPOPT enabled.

I believe, there are actually three cases to be considered here: 

  1. Current CPU is in rmpopt_cpumask -> primary thread, run leader locally, remove from followers
  2. Current CPU's sibling is in rmpopt_cpumask -> sibling thread, RMPOPT_BASE per-core is programmed, run leader locally, 
     remove the sibling's primary from the follower mask
  3. Neither -> new/unknown CPU, RMPOPT_BASE never programmed on this core, fall back to explicit leader via IPI.

So this seems to the *correct* implementation of the RMPOPT loop: 

   	cpumask_copy(follower_mask, &rmpopt_cpumask);

        migrate_disable();
        this_cpu = smp_processor_id();

        if (cpumask_test_cpu(this_cpu, follower_mask)) {
                /*
                 * Current CPU is a primary thread in rmpopt_cpumask.
                 * Run leader locally and remove from follower mask.
                 */
                cpumask_clear_cpu(this_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        rmpopt(pa);
        } else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
                                      follower_mask)) {
                /*
                 * Current CPU is a sibling thread whose primary is in
                 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
                 * is safe to run the leader locally.  Remove the sibling's
                 * primary from the follower mask as this core is already
                 * covered by the leader.
                 */
                cpumask_andnot(follower_mask, follower_mask,
                               topology_sibling_cpumask(this_cpu));

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        rmpopt(pa);
        } else {
                /*
                 * Current CPU's core does not have RMPOPT_BASE MSR
                 * programmed.  Pick an explicit leader from the cpumask
                 * to avoid #UD.
                 */
                int leader_cpu = cpumask_first(follower_mask);

                cpumask_clear_cpu(leader_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        smp_call_function_single(leader_cpu, rmpopt_smp,
                                                 (void *)pa, true);
        }

        migrate_enable();

        /* Followers: run RMPOPT on remaining cores */
        cpus_read_lock();
        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                on_each_cpu_mask(follower_mask, rmpopt_smp,
                                 (void *)pa, true);

                /* Give a chance for other threads to run */
                cond_resched();
        }
        cpus_read_unlock();

        free_cpumask_var(follower_mask);
  }

Thanks,
Ashish

> 
>> +	/* Leader: prime the RMPOPT cache on this CPU */
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>> +		rmpopt(pa);
>> +
>> +	migrate_enable();
>> +
>> +	/* Followers: run RMPOPT on all other cores */
>> +	cpus_read_lock();
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> +		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
>> +				 (void *)pa, true);
>> +
>> +		 /* Give a chance for other threads to run */
>> +		cond_resched();
>> +	}
>> +	cpus_read_unlock();
>> +
>> +	if (current_cpu_cleared)
>> +		cpumask_set_cpu(this_cpu, &rmpopt_cpumask);
>> +}
>> +
>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Borislav Petkov @ 2026-05-29  0:26 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Dave Hansen, tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <3334a64a-9a5e-4ad5-94f3-01fef788df2e@amd.com>

On Thu, May 28, 2026 at 02:55:44PM -0500, Kalra, Ashish wrote:
> Hello Dave,
> 
> On 5/28/2026 2:50 PM, Dave Hansen wrote:
> > On 5/28/26 12:37, Kalra, Ashish wrote:
> >> A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:
> >>
> >>   for_each_cpu(cpu, &rmpopt_cpumask)
> >>       wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
> > 
> > I'm glad we're on the same page finally. I just hope we can get to this
> > point more quickly next time. I started off with exactly this
> > suggestion, but someone chimed in to the thread and said it was "slower":
> > 
> >> https://lore.kernel.org/lkml/6a50d050-f602-43fd-a44a-cecedd9823eb@amd.com/
> > 
> 
> Yes, actually i should have made it explicitly clear that we need to do it in
> parallel especially for issuing the RMPOPT instruction itself, as that is
> in a performance critical path (and for that we are using on_each_cpu_mask()).

So which is it? Do we need the wrmsrq_on_cpus() helper or not?

I'm confused.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Kalra, Ashish @ 2026-05-29  0:29 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260529002613.GEahjdJRsX2uNz0GnH@fat_crate.local>


On 5/28/2026 7:26 PM, Borislav Petkov wrote:
> On Thu, May 28, 2026 at 02:55:44PM -0500, Kalra, Ashish wrote:
>> Hello Dave,
>>
>> On 5/28/2026 2:50 PM, Dave Hansen wrote:
>>> On 5/28/26 12:37, Kalra, Ashish wrote:
>>>> A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:
>>>>
>>>>   for_each_cpu(cpu, &rmpopt_cpumask)
>>>>       wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
>>>
>>> I'm glad we're on the same page finally. I just hope we can get to this
>>> point more quickly next time. I started off with exactly this
>>> suggestion, but someone chimed in to the thread and said it was "slower":
>>>
>>>> https://lore.kernel.org/lkml/6a50d050-f602-43fd-a44a-cecedd9823eb@amd.com/
>>>
>>
>> Yes, actually i should have made it explicitly clear that we need to do it in
>> parallel especially for issuing the RMPOPT instruction itself, as that is
>> in a performance critical path (and for that we are using on_each_cpu_mask()).
> 
> So which is it? Do we need the wrmsrq_on_cpus() helper or not?
> 
> I'm confused.
> 

No, we don't need it, i will drop this helper function patch.

Thanks,
Ashish

^ permalink raw reply

* Re: [PATCH kernel] crypto/ccp/tsm: Enable the root port after the endpoint
From: Herbert Xu @ 2026-05-29  6:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linux-crypto, linux-kernel, Tom Lendacky, David S. Miller,
	Dan Williams, x86, linux-coco, Pratik R . Sampat, Xu Yilun,
	Aneesh Kumar K . V
In-Reply-To: <20260521074301.2369293-1-aik@amd.com>

On Thu, May 21, 2026 at 05:43:01PM +1000, Alexey Kardashevskiy wrote:
> The PCIe r7.0, chapter "6.33.8 Other IDE Rules" mandates if selective IDE
> is enabled for config requersts, a stream must be enabled on the endpoint
> before enabling it on the rootport:
> 
> ===
> For Selective IDE, the Stream must not be used until it has been enabled in
> both Partner Ports. For cases where one of the Partner Ports is a Root Port
> and Selective IDE for Configuration Requests is enabled, the other
> Partner Port must be enabled prior to the Root Port. For other scenarios,
> the mechanisms to satisfy this requirement are implementation-specific.
> ===
> 
> Do what the spec says.
> 
> Fixes: 4be423572da1 ("crypto/ccp: Implement SEV-TIO PCIe IDE (phase1)")
> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> ---
>  drivers/crypto/ccp/sev-dev-tsm.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: Juergen Gross @ 2026-05-29  9:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, kvm, linux-coco, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	David Woodhouse, Paul Durrant
In-Reply-To: <6a7843b6-3c8a-4151-b3cf-3df59c93ce9f@suse.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 1823 bytes --]

On 28.05.26 17:50, Jürgen Groß wrote:
> On 28.05.26 15:21, Sean Christopherson wrote:
>> On Thu, May 28, 2026, Jürgen Groß wrote:
>>> On 28.05.26 15:09, Sean Christopherson wrote:
>>>> On Thu, May 28, 2026, Juergen Gross wrote:
>>>>> Please disregard this series, there is one complication sashiko made me
>>>>> aware of.
>>>>
>>>> Sashiko beat me to the punch. :-)
>>>>
>>>> See commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad 
>>>> WRMSR(MCi_CTL/STATUS)")
>>>> for a real world example of how things can and will go wrong.
>>>
>>> Yeah, with Sashiko's pointer it was easy to spot.
>>>
>>> Question now is whether the already existing cases of -errno passed as return
>>> value are wrong or on purpose.
>>
>> What are the existing cases?
>>
>>> If the latter, there should be a comment for
>>> that, otherwise they need to be fixed..
>>>
>>> Disentangling the MSR emulation return values from the "normal" ones ("return
>>> to guest"/"return to user mode") will be quite interesting with the overloaded
>>> semantics of "1".
>>
>> LOL, "interesting".
> 
> What do you think about the following idea:
> 
> Lets pass struct msr_info * down to all functions which get their return
> value passed up. Then extend msr_info with a bool "return_to_guest" (valid
> only if !host_initiated), which should be set instead of passing "1" up to
> the caller (probably using an inline helper). Then the return value could
> be 0 or -errno, and after MSR emulation the return_to_guest indicator can
> be tested if needed.

In the end I think it is less error prone to define a struct kvm_msr_ret_t
used as return type, consisting of an int and a bool with the same semantics
as above. Helpers will still be a good idea IMHO.

I'll have a try how it looks like.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* [PATCH v4 00/47] x86: Try to wrangle PV clocks vs. TSC
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner

Well, the number of patches in the series is going in the wrong direction,
but I'm much happier with this version, which eschews the x86_platform
overrides entirely in favor of a fixed sequence for selecting the TSC/CPU
frequency "routine".

Given that previous versions had fatal NULL pointer deref bugs that affected
VMware and Xen, this series needs testing and acks from those maintainers.

The primary goal of this series to fix flaws with SNP and TDX guests where a
PV clock provided by the untrusted hypervisor is used instead of the secure
TSC that is controlled by trusted firmware.

The secondary goal is modernize running under KVM.  Currently, KVM guests will
use TSC for clocksource, but not sched_clock.  And Linux-as-a-KVM-guest doesn't
support paravirt enumeration of the TSC/APIC frequencies, even though QEMU
provides that information by default.

The tertiary goal is to clean up the PV clock code to deduplicate logic across
hypervisors, and to hopefully make it all easier to maintain going forward.

v4 also adds a quaternary goal of cleaning up the TSC calibration code, which
was made stupidly hard to follow by hypervisor code mixing in with the native
calibration routines, instead of being implemented as a pure alternative.

Lots more background on the SNP/TDX motiviation:
https://lore.kernel.org/all/20250106124633.1418972-13-nikunj@amd.com

As before, I deliberately omitted jailhouse-dev@googlegroups.com from the To/Cc,
as those emails bounced on v1, AFAICT nothing has changed.

Note, I deliberately didn't collect a few reviews as the patches changed quite
a bit from what was reviewed in v3.

v4:
 - Use x86_init_noop() to skip save/restore on VMware and Xen instead of
   nullifying x86_platform.{save,restore}_sched_clock_state. [Sashiko]
 - Use '0' to indicate "failure" when getting the CPU frequency from CPUID, to
   avoid using an out-param and thus make it all but impossible to
   unintentionally clobber the global cpu_khz (which v3 did). [Sashiko]
 - Rename cpuid_get_cpu_freq() => __cpu_khz_from_cpuid() to capture its
   relationship with cpu_khz_from_cpuid().
 - Compute lapic_timer_period in units of ticks, not Khz. [Sashiko]
 - Kill off x86_platform_ops.calibrate_{cpu,tsc}(), and instead use dedicated
   hooks for hypervisor code, and direct calls for TDX and SNP. [David, loosely]
 - Drop SNP's secure TSC override of _CPU_ calibration, as there's zero
   evidence it's justified or a net positive.
 - Collect reviews/acks. [David, Wei]
 - Decouple getting TSC/APIC frequencies from KVM PV CPUID from kvmclock. [David]
 - Fix an amusing number of Opportunistically misspellings. [David]
 - Set kvm_sched_clock_offset _before_ registering kvmclock as sched_clock,
   and add a comment to guard against future goofs. [Sashiko]
 - Keep "setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)" in Hyper-V's handling
   of HV_ACCESS_TSC_INVARIANT, as it's technically possible to have a VM
   with HV_ACCESS_TSC_INVARIANT but not HV_ACCESS_FREQUENCY_MSRS.  Though as
   a _very_ nice side effect of using dedicated sequencing for selecting the
   TSC frequency source, this would have naturally happened anyways. [Sashiko]

v3:
 - https://lore.kernel.org/all/20260515191942.1892718-1-seanjc@google.com
 - Collect reviews. [Michael, Thomas]
 - Use Hyper-V reference counter / refcounter instead of Hyper-V timer. [Michael]
 - Use the paravirt CPUID interface first proposed by VMware for KVM's
   "official" mechanism for communicating frequency to KVM-aware guests,
   instead of abusing Intel's CPUID leafs. [David]
 - Deal with paravirt code being moved into asm/timers.h and
   arch/x86/kernel/tsc.c.

v2:
 - https://lore.kernel.org/all/Z8YWttWDtvkyCtdJ@google.com
 - Add struct to hold the TSC CPUID output. [Boris]
 - Don't pointlessly inline the TSC CPUID helpers. [Boris]
 - Fix a variable goof in a helper, hopefully for real this time. [Dan]
 - Collect reviews. [Nikunj]
 - Override the sched_clock save/restore hooks if and only if a PV clock
   is successfully registered.
 - During resome, restore clocksources before reading persistent time.
 - Clean up more warts created by kvmclock.
 - Fix more bugs in kvmclock's suspend/resume handling.
 - Try to harden kvmclock against future bugs.

v1: https://lore.kernel.org/all/20250201021718.699411-1-seanjc@google.com

David Woodhouse (3):
  KVM: x86: Officially define CPUID 0x40000010 as PV Timing Info (TSC
    and Bus)
  x86/kvm: Obtain TSC frequency from PV CPUID if present
  x86/xen: Obtain TSC frequency from CPUID if present

Sean Christopherson (44):
  x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
  x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
  x86/sev: Mark TSC as reliable when configuring Secure TSC
  x86/sev: Don't override CPU frequency calibration for SNP's Secure TSC
  x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
  x86/sev: Shove SNP's secure/trusted TSC frequency directly into
    "calibration"
  x86/tdx: Force TSC frequency with CPUID-based info provided by the
    TDX-Module
  x86/tsc: Add dedicated hypervisor hooks for getting known TSC/CPU
    frequencies
  x86/acrn: Mark TSC frequency as known when using ACRN for calibration
  x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
  x86/tsc: Kill off x86_platform_ops.calibrate_{cpu,tsc}() hooks
  x86/tsc: Rename pit_hpet_ptimer_calibrate_cpu() =>
    native_calibrate_cpu_late()
  x86/tsc: Fold native_calibrate_cpu() into recalibrate_cpu_khz()
  x86/kvmclock: Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz()
  x86/kvm: Mark TSC as reliable when it's constant and nonstop
  x86/kvm: Get local APIC bus frequency from PV CPUID Timing Info
  x86/tsc: Add standalone helper for getting CPU frequency from CPUID
  x86/kvm: Get CPU base frequency from CPUID when it's available
  clocksource: hyper-v: Register sched_clock save/restore iff it's
    necessary
  clocksource: hyper-v: Drop wrappers to sched_clock save/restore
    helpers
  clocksource: hyper-v: Don't save/restore TSC offset when using HV
    sched_clock
  x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
  x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
  x86/paravirt: Remove unnecessary PARAVIRT=n stub for
    paravirt_set_sched_clock()
  x86/paravirt: Move handling of unstable PV clocks into
    paravirt_set_sched_clock()
  x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
  x86/xen/time: NOP-ify x86_platform's sched_clock save/restore hooks
  x86/vmware: NOP-ify save/restore hooks when using VMware's sched_clock
  x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
  x86/paravirt: Pass sched_clock save/restore helpers during
    registration
  x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c
  x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init
  x86/pvclock: Mark setup helpers and related various as
    __init/__ro_after_init
  x86/pvclock: WARN if pvclock's valid_flags are overwritten
  x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during
    kvmclock_init()
  timekeeping: Resume clocksources before reading persistent clock
  x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't
    sched_clock
  x86/kvmclock: WARN if wall clock is read while kvmclock is suspended
  x86/paravirt: Mark __paravirt_set_sched_clock() as __init
  x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
  x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted
    TSC
  x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
  x86/kvmclock: Plumb in AP-online and BSP-resume to kvmlock, for
    documentation
  x86/paravirt: Move using_native_sched_clock() stub into timer.h

 Documentation/virt/kvm/x86/cpuid.rst |  12 ++
 arch/x86/coco/sev/core.c             |  21 +--
 arch/x86/coco/tdx/tdx.c              |  19 ++-
 arch/x86/include/asm/acrn.h          |   5 -
 arch/x86/include/asm/kvm_para.h      |  12 +-
 arch/x86/include/asm/sev.h           |   4 +-
 arch/x86/include/asm/tdx.h           |   2 +
 arch/x86/include/asm/timer.h         |  15 +-
 arch/x86/include/asm/tsc.h           |  11 +-
 arch/x86/include/asm/x86_init.h      |   8 +-
 arch/x86/include/uapi/asm/kvm_para.h |  11 ++
 arch/x86/kernel/cpu/acrn.c           |  10 +-
 arch/x86/kernel/cpu/mshyperv.c       |  65 +-------
 arch/x86/kernel/cpu/vmware.c         |  13 +-
 arch/x86/kernel/jailhouse.c          |   7 +-
 arch/x86/kernel/kvm.c                | 108 +++++++++++--
 arch/x86/kernel/kvmclock.c           | 208 ++++++++++++++++---------
 arch/x86/kernel/pvclock.c            |   9 +-
 arch/x86/kernel/tsc.c                | 218 +++++++++++++++++----------
 arch/x86/kernel/x86_init.c           |   2 -
 arch/x86/mm/mem_encrypt_amd.c        |   3 -
 arch/x86/xen/time.c                  |  25 ++-
 drivers/clocksource/hyperv_timer.c   |  38 +++--
 include/clocksource/hyperv_timer.h   |   2 -
 kernel/time/timekeeping.c            |   9 +-
 25 files changed, 533 insertions(+), 304 deletions(-)


base-commit: 4678d11f294de0fd295a265e02955b5d1a4a2684
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply

* [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
frequency.  In practice, this is likely one big nop, as re-calibration is
used only for SMP=n kernels, and only for hardware that is 20+ years old,
i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/tsc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index c5110eb554bc..08cf6625d484 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -946,7 +946,8 @@ void recalibrate_cpu_khz(void)
 		return;
 
 	cpu_khz = x86_platform.calibrate_cpu();
-	tsc_khz = x86_platform.calibrate_tsc();
+	if (!boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ))
+		tsc_khz = x86_platform.calibrate_tsc();
 	if (tsc_khz == 0)
 		tsc_khz = cpu_khz;
 	else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Extract retrieval of TSC frequency information from CPUID into standalone
helpers so that TDX guest support can reuse the logic.  Provide a version
that includes the multiplier math as TDX does NOT want to use
native_calibrate_tsc()'s fallback logic that derives the TSC frequency
based on CPUID.0x16, when the core crystal frequency isn't known.

Opportunistically drop native_calibrate_tsc()'s "== 0" and "!= 0" checks
in favor of the kernel's preferred style.

No functional change intended.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/tsc.h |  8 +++++
 arch/x86/kernel/tsc.c      | 67 +++++++++++++++++++++++++-------------
 2 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 4f7f09f50552..6cf26e62e9a6 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -83,6 +83,14 @@ static inline cycles_t get_cycles(void)
 }
 #define get_cycles get_cycles
 
+struct cpuid_tsc_info {
+	unsigned int denominator;
+	unsigned int numerator;
+	unsigned int crystal_khz;
+	unsigned int tsc_khz;
+};
+extern int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info);
+
 extern void tsc_early_init(void);
 extern void tsc_init(void);
 extern void mark_tsc_unstable(char *reason);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 08cf6625d484..f7f561722efa 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -658,46 +658,67 @@ static unsigned long quick_pit_calibrate(void)
 	return delta;
 }
 
+static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
+{
+	unsigned int ecx_hz, edx;
+
+	memset(info, 0, sizeof(*info));
+
+	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+		return -ENOENT;
+
+	/* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
+	cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
+
+	if (!info->denominator || !info->numerator)
+		return -ENOENT;
+
+	/*
+	 * Note, some CPUs provide the multiplier information, but not the core
+	 * crystal frequency.  The multiplier information is still useful for
+	 * such CPUs, as the crystal frequency can be gleaned from CPUID.0x16.
+	 */
+	info->crystal_khz = ecx_hz / 1000;
+	return 0;
+}
+
+int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
+{
+	if (cpuid_get_tsc_info(info) || !info->crystal_khz)
+		return -ENOENT;
+
+	info->tsc_khz = info->crystal_khz * info->numerator / info->denominator;
+	return 0;
+}
+
 /**
  * native_calibrate_tsc - determine TSC frequency
  * Determine TSC frequency via CPUID, else return 0.
  */
 unsigned long native_calibrate_tsc(void)
 {
-	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
-	unsigned int crystal_khz;
+	struct cpuid_tsc_info info;
 
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
 		return 0;
 
-	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+	if (cpuid_get_tsc_info(&info))
 		return 0;
 
-	eax_denominator = ebx_numerator = ecx_hz = edx = 0;
-
-	/* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
-	cpuid(CPUID_LEAF_TSC, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);
-
-	if (ebx_numerator == 0 || eax_denominator == 0)
-		return 0;
-
-	crystal_khz = ecx_hz / 1000;
-
 	/*
 	 * Denverton SoCs don't report crystal clock, and also don't support
 	 * CPUID_LEAF_FREQ for the calculation below, so hardcode the 25MHz
 	 * crystal clock.
 	 */
-	if (crystal_khz == 0 &&
-			boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
-		crystal_khz = 25000;
+	if (!info.crystal_khz && boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
+		info.crystal_khz = 25000;
 
 	/*
 	 * TSC frequency reported directly by CPUID is a "hardware reported"
 	 * frequency and is the most accurate one so far we have. This
 	 * is considered a known frequency.
 	 */
-	if (crystal_khz != 0)
+	if (info.crystal_khz)
 		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 
 	/*
@@ -705,15 +726,15 @@ unsigned long native_calibrate_tsc(void)
 	 * clock, but we can easily calculate it to a high degree of accuracy
 	 * by considering the crystal ratio and the CPU speed.
 	 */
-	if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
+	if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
 		unsigned int eax_base_mhz, ebx, ecx, edx;
 
 		cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
-		crystal_khz = eax_base_mhz * 1000 *
-			eax_denominator / ebx_numerator;
+		info.crystal_khz = eax_base_mhz * 1000 *
+			info.denominator / info.numerator;
 	}
 
-	if (crystal_khz == 0)
+	if (!info.crystal_khz)
 		return 0;
 
 	/*
@@ -730,10 +751,10 @@ unsigned long native_calibrate_tsc(void)
 	 * lapic_timer_period here to avoid having to calibrate the APIC
 	 * timer later.
 	 */
-	lapic_timer_period = crystal_khz * 1000 / HZ;
+	lapic_timer_period = info.crystal_khz * 1000 / HZ;
 #endif
 
-	return crystal_khz * ebx_numerator / eax_denominator;
+	return info.crystal_khz * info.numerator / info.denominator;
 }
 
 static unsigned long cpu_khz_from_cpuid(void)
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 03/47] x86/sev: Mark TSC as reliable when configuring Secure TSC
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Move the code to mark the TSC as reliable from sme_early_init() to
snp_secure_tsc_init().  The only reader of TSC_RELIABLE is the aptly
named check_system_tsc_reliable(), which runs in tsc_init(), i.e.
after snp_secure_tsc_init().

This will allow consolidating the handling of TSC_KNOWN_FREQ and
TSC_RELIABLE when overriding the TSC calibration routine.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/coco/sev/core.c      | 2 ++
 arch/x86/mm/mem_encrypt_amd.c | 3 ---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index ecd77d3217f3..ed0ac52a765e 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2037,6 +2037,8 @@ void __init snp_secure_tsc_init(void)
 	secrets = (__force struct snp_secrets_page *)mem;
 
 	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+
 	rdmsrq(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
 
 	/* Extract the GUEST TSC MHZ from BIT[17:0], rest is reserved space */
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 2f8c32173972..6c3af974c7c2 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -535,9 +535,6 @@ void __init sme_early_init(void)
 		 */
 		x86_init.resources.dmi_setup = snp_dmi_setup;
 	}
-
-	if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
-		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
 }
 
 void __init mem_encrypt_free_decrypted_mem(void)
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 04/47] x86/sev: Don't override CPU frequency calibration for SNP's Secure TSC
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Don't override the kernel's CPU frequency calibration routine when
registering SNP's Secure TSC calibration routine.  SNP (the architecture)
provides zero guarantees that the CPU runs at the same frequency as the
TSC.  The justification for clobbering the CPU routine was:

  Since the difference between CPU base and TSC frequency does not apply
  in this case, the same callback is being used.

but that's simply not true.  E.g. if APERF/MPERF is exposed to the VM, then
the CPU frequency absolutely does matter.

While relying on heuristics and/or the untrusted hypervisor to provide the
CPU frequency isn't ideal, it's at least not outright wrong.

Fixes: 73bbf3b0fbba ("x86/tsc: Init the TSC for Secure TSC guests")
Cc: Nikunj A Dadhania <nikunj@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/coco/sev/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index ed0ac52a765e..665de1aea0ee 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2046,7 +2046,6 @@ void __init snp_secure_tsc_init(void)
 
 	snp_tsc_freq_khz = SNP_SCALE_TSC_FREQ(tsc_freq_mhz * 1000, secrets->tsc_factor);
 
-	x86_platform.calibrate_cpu = securetsc_get_tsc_khz;
 	x86_platform.calibrate_tsc = securetsc_get_tsc_khz;
 
 	early_memunmap(mem, PAGE_SIZE);
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 05/47] x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Move the check on having a Secure TSC to the common tsc_early_init() so
that it's obvious that having a Secure TSC is conditional, and to prepare
for adding TDX to the mix (blindly initializing *both* SNP and TDX TSC
logic looks especially weird).

No functional change intended.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/coco/sev/core.c | 3 ---
 arch/x86/kernel/tsc.c    | 3 ++-
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 665de1aea0ee..403dcea86452 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2025,9 +2025,6 @@ void __init snp_secure_tsc_init(void)
 	unsigned long tsc_freq_mhz;
 	void *mem;
 
-	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
-		return;
-
 	mem = early_memremap_encrypted(sev_secrets_pa, PAGE_SIZE);
 	if (!mem) {
 		pr_err("Unable to get TSC_FACTOR: failed to map the SNP secrets page.\n");
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f7f561722efa..833eed5c048a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1543,7 +1543,8 @@ void __init tsc_early_init(void)
 	if (is_early_uv_system())
 		return;
 
-	snp_secure_tsc_init();
+	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+		snp_secure_tsc_init();
 
 	if (!determine_cpu_tsc_frequencies(true))
 		return;
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 06/47] x86/sev: Shove SNP's secure/trusted TSC frequency directly into "calibration"
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

As a first step towards dropping .calibrate_{cpu,tsc}() and explicitly
defining precedence/priority for "calibration" routines, pass the secure
TSC frequency obtained from SNP firmware directly to
determine_cpu_tsc_frequencies() instead of overriding the .calibrate_tsc()
hook.

Unlike the native calibration routines, all of the paravirtual overrides,
including SNP and TDX, are constant in the sense that the frequency
provided by the hypervisor or trusted firmware is fixed, known, and always
available during early boot.  More importantly, for CoCo (SNP and TDX) VMs,
it's imperative that the kernel uses the frequency provided by the trusted
firmware, not by the untrusted hypervisor.  Enforcing the priority between
sources by carefully ordering seemingly unrelated init calls, so that the
trusted override "wins", is brittle and all but impossible to follow.

While it's rather weird, deliberately prioritize tsc_early_khz over all
else to maintain existing behavior.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/coco/sev/core.c   | 14 ++++----------
 arch/x86/include/asm/sev.h |  4 ++--
 arch/x86/kernel/tsc.c      | 19 ++++++++++++-------
 3 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 403dcea86452..bc5ae9ef74da 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -99,7 +99,6 @@ static const char * const sev_status_feat_names[] = {
  */
 static u64 snp_tsc_scale __ro_after_init;
 static u64 snp_tsc_offset __ro_after_init;
-static unsigned long snp_tsc_freq_khz __ro_after_init;
 
 DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
 DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
@@ -2014,15 +2013,10 @@ void __init snp_secure_tsc_prepare(void)
 	pr_debug("SecureTSC enabled");
 }
 
-static unsigned long securetsc_get_tsc_khz(void)
-{
-	return snp_tsc_freq_khz;
-}
-
-void __init snp_secure_tsc_init(void)
+unsigned int __init snp_secure_tsc_init(void)
 {
+	unsigned long snp_tsc_freq_khz, tsc_freq_mhz;
 	struct snp_secrets_page *secrets;
-	unsigned long tsc_freq_mhz;
 	void *mem;
 
 	mem = early_memremap_encrypted(sev_secrets_pa, PAGE_SIZE);
@@ -2043,7 +2037,7 @@ void __init snp_secure_tsc_init(void)
 
 	snp_tsc_freq_khz = SNP_SCALE_TSC_FREQ(tsc_freq_mhz * 1000, secrets->tsc_factor);
 
-	x86_platform.calibrate_tsc = securetsc_get_tsc_khz;
-
 	early_memunmap(mem, PAGE_SIZE);
+
+	return snp_tsc_freq_khz;
 }
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 594cfa19cbd4..05ebf0b73ef4 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -530,7 +530,7 @@ int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req
 int snp_svsm_vtpm_send_command(u8 *buffer);
 
 void __init snp_secure_tsc_prepare(void);
-void __init snp_secure_tsc_init(void);
+unsigned int snp_secure_tsc_init(void);
 enum es_result savic_register_gpa(u64 gpa);
 enum es_result savic_unregister_gpa(u64 *gpa);
 u64 savic_ghcb_msr_read(u32 reg);
@@ -637,7 +637,7 @@ static inline int snp_send_guest_request(struct snp_msg_desc *mdesc,
 					 struct snp_guest_req *req) { return -ENODEV; }
 static inline int snp_svsm_vtpm_send_command(u8 *buffer) { return -ENODEV; }
 static inline void __init snp_secure_tsc_prepare(void) { }
-static inline void __init snp_secure_tsc_init(void) { }
+static inline unsigned int __init snp_secure_tsc_init(void) { return 0; }
 static inline void sev_evict_cache(void *va, int npages) {}
 static inline enum es_result savic_register_gpa(u64 gpa) { return ES_UNSUPPORTED; }
 static inline enum es_result savic_unregister_gpa(u64 *gpa) { return ES_UNSUPPORTED; }
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 833eed5c048a..2b8f94c3fcc7 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1474,15 +1474,16 @@ static int __init init_tsc_clocksource(void)
  */
 device_initcall(init_tsc_clocksource);
 
-static bool __init determine_cpu_tsc_frequencies(bool early)
+static bool __init determine_cpu_tsc_frequencies(bool early,
+						 unsigned int known_tsc_khz)
 {
 	/* Make sure that cpu and tsc are not already calibrated */
 	WARN_ON(cpu_khz || tsc_khz);
 
 	if (early) {
 		cpu_khz = x86_platform.calibrate_cpu();
-		if (tsc_early_khz)
-			tsc_khz = tsc_early_khz;
+		if (known_tsc_khz)
+			tsc_khz = known_tsc_khz;
 		else
 			tsc_khz = x86_platform.calibrate_tsc();
 	} else {
@@ -1537,16 +1538,20 @@ static void __init tsc_enable_sched_clock(void)
 
 void __init tsc_early_init(void)
 {
+	unsigned int known_tsc_khz = 0;
+
 	if (!boot_cpu_has(X86_FEATURE_TSC))
 		return;
 	/* Don't change UV TSC multi-chassis synchronization */
 	if (is_early_uv_system())
 		return;
 
-	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
-		snp_secure_tsc_init();
+	if (tsc_early_khz)
+		known_tsc_khz = tsc_early_khz;
+	else if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+		known_tsc_khz = snp_secure_tsc_init();
 
-	if (!determine_cpu_tsc_frequencies(true))
+	if (!determine_cpu_tsc_frequencies(true, known_tsc_khz))
 		return;
 	tsc_enable_sched_clock();
 }
@@ -1567,7 +1572,7 @@ void __init tsc_init(void)
 
 	if (!tsc_khz) {
 		/* We failed to determine frequencies earlier, try again */
-		if (!determine_cpu_tsc_frequencies(false)) {
+		if (!determine_cpu_tsc_frequencies(false, 0)) {
 			mark_tsc_unstable("could not calculate TSC khz");
 			setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 			return;
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 07/47] x86/tdx: Force TSC frequency with CPUID-based info provided by the TDX-Module
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

When running as a TDX guest, explicitly set the TSC frequency to a known
value, using CPUID-based information, instead of potentially relying on a
hypervisor-controlled PV routine.  For TDX guests, CPUID.0x15 is always
emulated by the TDX-Module, i.e. the information from CPUID is more
trustworthy than the information provided by the hypervisor.

To maintain backwards compatibility with TDX guest kernels that use native
calibration, and because it's the least awful option, retain
native_calibrate_tsc()'s stuffing of the local APIC bus period using the
core crystal frequency.  While it's entirely possible for the hypervisor
to emulate the APIC timer at a different frequency than the core crystal
frequency, the commonly accepted interpretation of Intel's SDM is that APIC
timer runs at the core crystal frequency when that latter is enumerated via
CPUID:

  The APIC timer frequency will be the processor’s bus clock or core
  crystal clock frequency (when TSC/core crystal clock ratio is enumerated
  in CPUID leaf 0x15).

If the hypervisor is malicious and deliberately runs the APIC timer at the
wrong frequency, nothing would stop the hypervisor from modifying the
frequency at any time, i.e. attempting to manually calibrate the frequency
out of paranoia would be futile.

Deliberately leave CPU frequency calibration as is, since the TDX-Module
doesn't provide any guarantees with respect to CPUID.0x16.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/coco/tdx/tdx.c    | 20 +++++++++++++++++---
 arch/x86/include/asm/tdx.h |  2 ++
 arch/x86/kernel/tsc.c      |  3 +++
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 29b6f1ed59ec..5d7976359220 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -8,6 +8,7 @@
 #include <linux/export.h>
 #include <linux/io.h>
 #include <linux/kexec.h>
+#include <asm/apic.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
@@ -1123,9 +1124,6 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
-	/* TSC is the only reliable clock in TDX guest */
-	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
-
 	cc_vendor = CC_VENDOR_INTEL;
 
 	/* Configure the TD */
@@ -1195,3 +1193,19 @@ void __init tdx_early_init(void)
 
 	tdx_announce();
 }
+
+unsigned int __init tdx_tsc_init(void)
+{
+	struct cpuid_tsc_info info;
+
+	if (WARN_ON_ONCE(cpuid_get_tsc_freq(&info)))
+		return 0;
+
+	lapic_timer_period = info.crystal_khz * 1000 / HZ;
+
+	/* TSC is the only reliable clock in TDX guest */
+	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
+	return info.tsc_khz;
+}
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e5a9cf656c07..1d841d464aa4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -67,6 +67,7 @@ struct ve_info {
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 void __init tdx_early_init(void);
+unsigned int __init tdx_tsc_init(void);
 
 void tdx_get_ve_info(struct ve_info *ve);
 
@@ -88,6 +89,7 @@ void __init tdx_dump_td_ctls(u64 td_ctls);
 #else
 
 static inline void tdx_early_init(void) { };
+static inline unsigned int tdx_tsc_init(void) { return 0; }
 static inline void tdx_halt(void) { };
 
 static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 2b8f94c3fcc7..2603f136e29b 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -34,6 +34,7 @@
 #include <asm/topology.h>
 #include <asm/uv/uv.h>
 #include <asm/sev.h>
+#include <asm/tdx.h>
 
 unsigned int __read_mostly cpu_khz;	/* TSC clocks / usec, not used here */
 EXPORT_SYMBOL(cpu_khz);
@@ -1550,6 +1551,8 @@ void __init tsc_early_init(void)
 		known_tsc_khz = tsc_early_khz;
 	else if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
 		known_tsc_khz = snp_secure_tsc_init();
+	else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+		known_tsc_khz = tdx_tsc_init();
 
 	if (!determine_cpu_tsc_frequencies(true, known_tsc_khz))
 		return;
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 08/47] x86/tsc: Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
instead of overriding seemingly generic platform hooks, and explicitly
priotize hypervisor-provided frequencies over native methods, but do NOT
clobber the frequency obtained from trusted firmware.  While shuffling the
hooks around is arguably "six of one, half dozen of the other", scoping
them to x86_hyper_init makes their purpose more obvious, and allows for
explicitly defining the priority of sources (as is done here).

Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/acrn.h     |  5 -----
 arch/x86/include/asm/x86_init.h |  4 ++++
 arch/x86/kernel/cpu/acrn.c      | 10 +++++++---
 arch/x86/kernel/cpu/mshyperv.c  |  6 +++---
 arch/x86/kernel/cpu/vmware.c    |  8 ++++----
 arch/x86/kernel/jailhouse.c     |  6 +++---
 arch/x86/kernel/kvmclock.c      |  6 +++---
 arch/x86/kernel/tsc.c           | 23 +++++++++++++++++++----
 arch/x86/xen/time.c             |  4 ++--
 9 files changed, 45 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/acrn.h b/arch/x86/include/asm/acrn.h
index db42b477c41d..a892179c61c6 100644
--- a/arch/x86/include/asm/acrn.h
+++ b/arch/x86/include/asm/acrn.h
@@ -32,11 +32,6 @@ static inline u32 acrn_cpuid_base(void)
 	return 0;
 }
 
-static inline unsigned long acrn_get_tsc_khz(void)
-{
-	return cpuid_eax(ACRN_CPUID_TIMING_INFO);
-}
-
 /*
  * Hypercalls for ACRN
  *
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 6c8a6ead84f6..a4f8a4aa601d 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -120,6 +120,8 @@ struct x86_init_pci {
  * @msi_ext_dest_id:		MSI supports 15-bit APIC IDs
  * @init_mem_mapping:		setup early mappings during init_mem_mapping()
  * @init_after_bootmem:		guest init after boot allocator is finished
+ * @get_tsc_khz:		get the TSC frequency (returns 0 if frequency is unknown)
+ * @get_cpu_khz:		get the CPU frequency (returns 0 if frequency is unknown)
  */
 struct x86_hyper_init {
 	void (*init_platform)(void);
@@ -128,6 +130,8 @@ struct x86_hyper_init {
 	bool (*msi_ext_dest_id)(void);
 	void (*init_mem_mapping)(void);
 	void (*init_after_bootmem)(void);
+	unsigned int (*get_tsc_khz)(void);
+	unsigned int (*get_cpu_khz)(void);
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index dc119af83524..ad8f2da8003b 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -24,13 +24,15 @@ static u32 __init acrn_detect(void)
 	return acrn_cpuid_base();
 }
 
+static unsigned int __init acrn_get_tsc_khz(void)
+{
+	return cpuid_eax(ACRN_CPUID_TIMING_INFO);
+}
+
 static void __init acrn_init_platform(void)
 {
 	/* Install system interrupt handler for ACRN hypervisor callback */
 	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
-
-	x86_platform.calibrate_tsc = acrn_get_tsc_khz;
-	x86_platform.calibrate_cpu = acrn_get_tsc_khz;
 }
 
 static bool acrn_x2apic_available(void)
@@ -78,4 +80,6 @@ const __initconst struct hypervisor_x86 x86_hyper_acrn = {
 	.type			= X86_HYPER_ACRN,
 	.init.init_platform     = acrn_init_platform,
 	.init.x2apic_available  = acrn_x2apic_available,
+	.init.get_tsc_khz	= acrn_get_tsc_khz,
+	.init.get_cpu_khz	= acrn_get_tsc_khz,
 };
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 185d4f677ec0..733e12d5a7dd 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -395,7 +395,7 @@ static int hv_nmi_unknown(unsigned int val, struct pt_regs *regs)
 }
 #endif
 
-static unsigned long hv_get_tsc_khz(void)
+static unsigned int __init hv_get_tsc_khz(void)
 {
 	unsigned long freq;
 
@@ -573,8 +573,8 @@ static void __init ms_hyperv_init_platform(void)
 
 	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
 	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
-		x86_platform.calibrate_tsc = hv_get_tsc_khz;
-		x86_platform.calibrate_cpu = hv_get_tsc_khz;
+		x86_init.hyper.get_tsc_khz = hv_get_tsc_khz;
+		x86_init.hyper.get_cpu_khz = hv_get_tsc_khz;
 		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	}
 
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 34b73573b108..7c8cf4885e82 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -64,7 +64,7 @@ struct vmware_steal_time {
 	u64 reserved[7];
 };
 
-static unsigned long vmware_tsc_khz __ro_after_init;
+static unsigned long vmware_tsc_khz __initdata;
 static u8 vmware_hypercall_mode     __ro_after_init;
 
 unsigned long vmware_hypercall_slow(unsigned long cmd,
@@ -137,7 +137,7 @@ static inline int __vmware_platform(void)
 	return eax != UINT_MAX && ebx == VMWARE_HYPERVISOR_MAGIC;
 }
 
-static unsigned long vmware_get_tsc_khz(void)
+static unsigned int __init vmware_get_tsc_khz(void)
 {
 	return vmware_tsc_khz;
 }
@@ -419,8 +419,8 @@ static void __init vmware_platform_setup(void)
 		}
 
 		vmware_tsc_khz = tsc_khz;
-		x86_platform.calibrate_tsc = vmware_get_tsc_khz;
-		x86_platform.calibrate_cpu = vmware_get_tsc_khz;
+		x86_init.hyper.get_tsc_khz = vmware_get_tsc_khz;
+		x86_init.hyper.get_cpu_khz = vmware_get_tsc_khz;
 
 #ifdef CONFIG_X86_LOCAL_APIC
 		/* Skip lapic calibration since we know the bus frequency. */
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index f58ce9220e0f..4034e08c5f11 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -68,7 +68,7 @@ static void __init jailhouse_timer_init(void)
 	lapic_timer_period = setup_data.v1.apic_khz * (1000 / HZ);
 }
 
-static unsigned long jailhouse_get_tsc(void)
+static unsigned int __init jailhouse_get_tsc(void)
 {
 	return precalibrated_tsc_khz;
 }
@@ -210,8 +210,6 @@ static void __init jailhouse_init_platform(void)
 	x86_init.mpparse.parse_smp_cfg		= jailhouse_parse_smp_config;
 	x86_init.pci.arch_init			= jailhouse_pci_arch_init;
 
-	x86_platform.calibrate_cpu		= jailhouse_get_tsc;
-	x86_platform.calibrate_tsc		= jailhouse_get_tsc;
 	x86_platform.get_wallclock		= jailhouse_get_wallclock;
 	x86_platform.legacy.rtc			= 0;
 	x86_platform.legacy.warm_reset		= 0;
@@ -293,5 +291,7 @@ const struct hypervisor_x86 x86_hyper_jailhouse __refconst = {
 	.detect			= jailhouse_detect,
 	.init.init_platform	= jailhouse_init_platform,
 	.init.x2apic_available	= jailhouse_x2apic_available,
+	.init.get_tsc_khz	= jailhouse_get_tsc,
+	.init.get_cpu_khz	= jailhouse_get_tsc,
 	.ignore_nopv		= true,
 };
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index b5991d53fc0e..ec888eef74aa 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -115,7 +115,7 @@ static inline void kvm_sched_clock_init(bool stable)
  * poll of guests can be running and trouble each other. So we preset
  * lpj here
  */
-static unsigned long kvm_get_tsc_khz(void)
+static unsigned int __init kvm_get_tsc_khz(void)
 {
 	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	return pvclock_tsc_khz(this_cpu_pvti());
@@ -321,8 +321,8 @@ void __init kvmclock_init(void)
 	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
 	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
 
-	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
-	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+	x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
+	x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;
 	x86_platform.set_wallclock = kvm_set_wallclock;
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 2603f136e29b..362596612442 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1476,13 +1476,17 @@ static int __init init_tsc_clocksource(void)
 device_initcall(init_tsc_clocksource);
 
 static bool __init determine_cpu_tsc_frequencies(bool early,
+						 unsigned int known_cpu_khz,
 						 unsigned int known_tsc_khz)
 {
 	/* Make sure that cpu and tsc are not already calibrated */
 	WARN_ON(cpu_khz || tsc_khz);
 
 	if (early) {
-		cpu_khz = x86_platform.calibrate_cpu();
+		if (known_cpu_khz)
+			cpu_khz = known_cpu_khz;
+		else
+			cpu_khz = x86_platform.calibrate_cpu();
 		if (known_tsc_khz)
 			tsc_khz = known_tsc_khz;
 		else
@@ -1539,7 +1543,7 @@ static void __init tsc_enable_sched_clock(void)
 
 void __init tsc_early_init(void)
 {
-	unsigned int known_tsc_khz = 0;
+	unsigned int known_cpu_khz = 0, known_tsc_khz = 0;
 
 	if (!boot_cpu_has(X86_FEATURE_TSC))
 		return;
@@ -1547,6 +1551,9 @@ void __init tsc_early_init(void)
 	if (is_early_uv_system())
 		return;
 
+	if (x86_init.hyper.get_cpu_khz)
+		known_cpu_khz = x86_init.hyper.get_cpu_khz();
+
 	if (tsc_early_khz)
 		known_tsc_khz = tsc_early_khz;
 	else if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
@@ -1554,7 +1561,15 @@ void __init tsc_early_init(void)
 	else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
 		known_tsc_khz = tdx_tsc_init();
 
-	if (!determine_cpu_tsc_frequencies(true, known_tsc_khz))
+	/*
+	 * If the TSC frequency is still unknown, i.e. not provided by the user
+	 * or by trusted firmware, try to get it from the hypervisor (which is
+	 * untrusted when running as a CoCo guest).
+	 */
+	if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
+		known_tsc_khz = x86_init.hyper.get_tsc_khz();
+
+	if (!determine_cpu_tsc_frequencies(true, known_cpu_khz, known_tsc_khz))
 		return;
 	tsc_enable_sched_clock();
 }
@@ -1575,7 +1590,7 @@ void __init tsc_init(void)
 
 	if (!tsc_khz) {
 		/* We failed to determine frequencies earlier, try again */
-		if (!determine_cpu_tsc_frequencies(false, 0)) {
+		if (!determine_cpu_tsc_frequencies(false, 0, 0)) {
 			mark_tsc_unstable("could not calculate TSC khz");
 			setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 			return;
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index d62c14334b35..1adb44fdddb2 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -38,7 +38,7 @@
 static u64 xen_sched_clock_offset __read_mostly;
 
 /* Get the TSC speed from Xen */
-static unsigned long xen_tsc_khz(void)
+static unsigned int __init xen_tsc_khz(void)
 {
 	struct pvclock_vcpu_time_info *info =
 		&HYPERVISOR_shared_info->vcpu_info[0].time;
@@ -569,7 +569,7 @@ static void __init xen_init_time_common(void)
 	static_call_update(pv_steal_clock, xen_steal_clock);
 	paravirt_set_sched_clock(xen_sched_clock);
 
-	x86_platform.calibrate_tsc = xen_tsc_khz;
+	x86_init.hyper.get_tsc_khz = xen_tsc_khz;
 	x86_platform.get_wallclock = xen_get_wallclock;
 }
 
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 09/47] x86/acrn: Mark TSC frequency as known when using ACRN for calibration
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Mark the TSC frequency as known when using ACRN's PV CPUID information.
Per commit 81a71f51b89e ("x86/acrn: Set up timekeeping") and common sense,
the TSC freq is explicitly provided by the hypervisor.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/acrn.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index ad8f2da8003b..0303fe6a2efa 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -33,6 +33,8 @@ static void __init acrn_init_platform(void)
 {
 	/* Install system interrupt handler for ACRN hypervisor callback */
 	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
+
+	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 }
 
 static bool acrn_x2apic_available(void)
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Now that all paravirt code that explicitly specifies the TSC frequency
also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.

Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
by the user.  Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
line parameter"), one of the goals of the param is to allow the refined
calibration work "to do meaningful error checking".

Note, preferring the user-provided TSC frequency over the frequency from
the hypervisor or trusted firmware, while simultaneously not treating the
user-provided frequency as gospel, is obviously incongruous.  Sweep the
problem under the rug for now to avoid opening a big can of worms that
likely doesn't have a great answer.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/coco/sev/core.c       | 1 -
 arch/x86/coco/tdx/tdx.c        | 1 -
 arch/x86/kernel/cpu/acrn.c     | 2 --
 arch/x86/kernel/cpu/mshyperv.c | 1 -
 arch/x86/kernel/cpu/vmware.c   | 2 --
 arch/x86/kernel/jailhouse.c    | 1 -
 arch/x86/kernel/kvmclock.c     | 1 -
 arch/x86/kernel/tsc.c          | 9 +++++++++
 arch/x86/xen/time.c            | 1 -
 9 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index bc5ae9ef74da..72313b36b6f5 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2027,7 +2027,6 @@ unsigned int __init snp_secure_tsc_init(void)
 
 	secrets = (__force struct snp_secrets_page *)mem;
 
-	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
 
 	rdmsrq(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 5d7976359220..ab463c2b2dab 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -1205,7 +1205,6 @@ unsigned int __init tdx_tsc_init(void)
 
 	/* TSC is the only reliable clock in TDX guest */
 	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
-	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 
 	return info.tsc_khz;
 }
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index 0303fe6a2efa..ad8f2da8003b 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -33,8 +33,6 @@ static void __init acrn_init_platform(void)
 {
 	/* Install system interrupt handler for ACRN hypervisor callback */
 	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
-
-	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 }
 
 static bool acrn_x2apic_available(void)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 733e12d5a7dd..f8653fc05a40 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -575,7 +575,6 @@ static void __init ms_hyperv_init_platform(void)
 	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
 		x86_init.hyper.get_tsc_khz = hv_get_tsc_khz;
 		x86_init.hyper.get_cpu_khz = hv_get_tsc_khz;
-		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	}
 
 	if (ms_hyperv.priv_high & HV_ISOLATION) {
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 7c8cf4885e82..2d0624c66799 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -390,8 +390,6 @@ static void __init vmware_set_capabilities(void)
 {
 	setup_force_cpu_cap(X86_FEATURE_CONSTANT_TSC);
 	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
-	if (vmware_tsc_khz)
-		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMCALL)
 		setup_force_cpu_cap(X86_FEATURE_VMCALL);
 	else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index 4034e08c5f11..e4d7d9e2cd69 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -255,7 +255,6 @@ static void __init jailhouse_init_platform(void)
 	pr_debug("Jailhouse: PM-Timer IO Port: %#x\n", pmtmr_ioport);
 
 	precalibrated_tsc_khz = setup_data.v1.tsc_khz;
-	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 
 	pci_probe = 0;
 
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ec888eef74aa..69752b170e0a 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -117,7 +117,6 @@ static inline void kvm_sched_clock_init(bool stable)
  */
 static unsigned int __init kvm_get_tsc_khz(void)
 {
-	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	return pvclock_tsc_khz(this_cpu_pvti());
 }
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 362596612442..8cef918486db 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1569,6 +1569,15 @@ void __init tsc_early_init(void)
 	if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
 		known_tsc_khz = x86_init.hyper.get_tsc_khz();
 
+	/*
+	 * Mark the TSC frequency as known if it was obtained from a hypervisor
+	 * or trusted firmware.  Don't mark the frequency as known if the user
+	 * specified the frequency, as the user-provided frequency is intended
+	 * as a "starting point", not a known, guaranteed frequency.
+	 */
+	if (known_tsc_khz && !tsc_early_khz)
+		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
 	if (!determine_cpu_tsc_frequencies(true, known_cpu_khz, known_tsc_khz))
 		return;
 	tsc_enable_sched_clock();
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1adb44fdddb2..487ad838c441 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -43,7 +43,6 @@ static unsigned int __init xen_tsc_khz(void)
 	struct pvclock_vcpu_time_info *info =
 		&HYPERVISOR_shared_info->vcpu_info[0].time;
 
-	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	return pvclock_tsc_khz(info);
 }
 
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 11/47] x86/tsc: Kill off x86_platform_ops.calibrate_{cpu,tsc}() hooks
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Now that getting the CPU and/or TSC frequencies from the hypervisor uses
dedicated hooks, drop x86_platform_ops.calibrate_{cpu,tsc}() and instead
directly invoke the correct helper at each phase of (re)calibration.  In
addition to eliminating unnecessary code, this makes it a bit more obvious
when the "late" path invokes pit_hpet_ptimer_calibrate_cpu() instead of
x86_platform_ops.calibrate_cpu().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/tsc.h      |  2 --
 arch/x86/include/asm/x86_init.h |  4 ----
 arch/x86/kernel/tsc.c           | 28 ++++++++++++----------------
 arch/x86/kernel/x86_init.c      |  2 --
 4 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 6cf26e62e9a6..4a224f99c3b9 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -97,8 +97,6 @@ extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern void mark_tsc_async_resets(char *reason);
-extern unsigned long native_calibrate_cpu_early(void);
-extern unsigned long native_calibrate_tsc(void);
 extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
 
 extern int tsc_clocksource_reliable;
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index a4f8a4aa601d..ada17827ea51 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -292,8 +292,6 @@ struct x86_hyper_runtime {
 
 /**
  * struct x86_platform_ops - platform specific runtime functions
- * @calibrate_cpu:		calibrate CPU
- * @calibrate_tsc:		calibrate TSC, if different from CPU
  * @get_wallclock:		get time from HW clock like RTC etc.
  * @set_wallclock:		set time back to HW clock
  * @iommu_shutdown:		set by an IOMMU driver for shutdown if necessary
@@ -317,8 +315,6 @@ struct x86_hyper_runtime {
  * @guest:			guest incarnations callbacks
  */
 struct x86_platform_ops {
-	unsigned long (*calibrate_cpu)(void);
-	unsigned long (*calibrate_tsc)(void);
 	void (*get_wallclock)(struct timespec64 *ts);
 	int (*set_wallclock)(const struct timespec64 *ts);
 	void (*iommu_shutdown)(void);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 8cef918486db..5b4b6e43c94c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -696,7 +696,7 @@ int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
  * native_calibrate_tsc - determine TSC frequency
  * Determine TSC frequency via CPUID, else return 0.
  */
-unsigned long native_calibrate_tsc(void)
+static unsigned long native_calibrate_tsc(void)
 {
 	struct cpuid_tsc_info info;
 
@@ -931,7 +931,7 @@ static unsigned long pit_hpet_ptimer_calibrate_cpu(void)
 /**
  * native_calibrate_cpu_early - can calibrate the cpu early in boot
  */
-unsigned long native_calibrate_cpu_early(void)
+static unsigned long native_calibrate_cpu_early(void)
 {
 	unsigned long flags, fast_calibrate = cpu_khz_from_cpuid();
 
@@ -945,7 +945,7 @@ unsigned long native_calibrate_cpu_early(void)
 	return fast_calibrate;
 }
 
-
+#ifndef CONFIG_SMP
 /**
  * native_calibrate_cpu - calibrate the cpu
  */
@@ -958,6 +958,7 @@ static unsigned long native_calibrate_cpu(void)
 
 	return tsc_freq;
 }
+#endif
 
 void recalibrate_cpu_khz(void)
 {
@@ -967,9 +968,9 @@ void recalibrate_cpu_khz(void)
 	if (!boot_cpu_has(X86_FEATURE_TSC))
 		return;
 
-	cpu_khz = x86_platform.calibrate_cpu();
+	cpu_khz = native_calibrate_cpu();
 	if (!boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ))
-		tsc_khz = x86_platform.calibrate_tsc();
+		tsc_khz = native_calibrate_tsc();
 	if (tsc_khz == 0)
 		tsc_khz = cpu_khz;
 	else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
@@ -1483,17 +1484,19 @@ static bool __init determine_cpu_tsc_frequencies(bool early,
 	WARN_ON(cpu_khz || tsc_khz);
 
 	if (early) {
+		/*
+		 * Early CPU calibration can only use methods that are available
+		 * early in boot (obviously).
+		 */
 		if (known_cpu_khz)
 			cpu_khz = known_cpu_khz;
 		else
-			cpu_khz = x86_platform.calibrate_cpu();
+			cpu_khz = native_calibrate_cpu_early();
 		if (known_tsc_khz)
 			tsc_khz = known_tsc_khz;
 		else
-			tsc_khz = x86_platform.calibrate_tsc();
+			tsc_khz = native_calibrate_tsc();
 	} else {
-		/* We should not be here with non-native cpu calibration */
-		WARN_ON(x86_platform.calibrate_cpu != native_calibrate_cpu);
 		cpu_khz = pit_hpet_ptimer_calibrate_cpu();
 	}
 
@@ -1590,13 +1593,6 @@ void __init tsc_init(void)
 		return;
 	}
 
-	/*
-	 * native_calibrate_cpu_early can only calibrate using methods that are
-	 * available early in boot.
-	 */
-	if (x86_platform.calibrate_cpu == native_calibrate_cpu_early)
-		x86_platform.calibrate_cpu = native_calibrate_cpu;
-
 	if (!tsc_khz) {
 		/* We failed to determine frequencies earlier, try again */
 		if (!determine_cpu_tsc_frequencies(false, 0, 0)) {
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index ebefb77c37bb..c674cbbd466d 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -144,8 +144,6 @@ static void enc_kexec_finish_noop(void) {}
 static bool is_private_mmio_noop(u64 addr) {return false; }
 
 struct x86_platform_ops x86_platform __ro_after_init = {
-	.calibrate_cpu			= native_calibrate_cpu_early,
-	.calibrate_tsc			= native_calibrate_tsc,
 	.get_wallclock			= mach_get_cmos_time,
 	.set_wallclock			= mach_set_cmos_time,
 	.iommu_shutdown			= iommu_shutdown_noop,
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 12/47] x86/tsc: Rename pit_hpet_ptimer_calibrate_cpu() => native_calibrate_cpu_late()
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Rename the late CPU calibration routine so that its relationship to the
early routine is more obvious and intuitive.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/tsc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 5b4b6e43c94c..534462c81c78 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -779,7 +779,7 @@ static unsigned long cpu_khz_from_cpuid(void)
  * calibrate cpu using pit, hpet, and ptimer methods. They are available
  * later in boot after acpi is initialized.
  */
-static unsigned long pit_hpet_ptimer_calibrate_cpu(void)
+static unsigned long native_calibrate_cpu_late(void)
 {
 	u64 tsc1, tsc2, delta, ref1, ref2;
 	unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
@@ -954,7 +954,7 @@ static unsigned long native_calibrate_cpu(void)
 	unsigned long tsc_freq = native_calibrate_cpu_early();
 
 	if (!tsc_freq)
-		tsc_freq = pit_hpet_ptimer_calibrate_cpu();
+		tsc_freq = native_calibrate_cpu_late();
 
 	return tsc_freq;
 }
@@ -1497,7 +1497,7 @@ static bool __init determine_cpu_tsc_frequencies(bool early,
 		else
 			tsc_khz = native_calibrate_tsc();
 	} else {
-		cpu_khz = pit_hpet_ptimer_calibrate_cpu();
+		cpu_khz = native_calibrate_cpu_late();
 	}
 
 	/*
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 13/47] x86/tsc: Fold native_calibrate_cpu() into recalibrate_cpu_khz()
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Fold the guts of native_calibrate_cpu() into its sole remaining caller,
recalibrate_cpu_khz() to eliminate the extra SMP=n #ifdef, and so that it's
more obvious that directly invoking the early vs. late calibration routines
in determine_cpu_tsc_frequencies() is intentional.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/tsc.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 534462c81c78..3e911f0f7364 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -945,21 +945,6 @@ static unsigned long native_calibrate_cpu_early(void)
 	return fast_calibrate;
 }
 
-#ifndef CONFIG_SMP
-/**
- * native_calibrate_cpu - calibrate the cpu
- */
-static unsigned long native_calibrate_cpu(void)
-{
-	unsigned long tsc_freq = native_calibrate_cpu_early();
-
-	if (!tsc_freq)
-		tsc_freq = native_calibrate_cpu_late();
-
-	return tsc_freq;
-}
-#endif
-
 void recalibrate_cpu_khz(void)
 {
 #ifndef CONFIG_SMP
@@ -968,7 +953,10 @@ void recalibrate_cpu_khz(void)
 	if (!boot_cpu_has(X86_FEATURE_TSC))
 		return;
 
-	cpu_khz = native_calibrate_cpu();
+	cpu_khz = native_calibrate_cpu_early();
+	if (!cpu_khz)
+		cpu_khz = native_calibrate_cpu_late();
+
 	if (!boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ))
 		tsc_khz = native_calibrate_tsc();
 	if (tsc_khz == 0)
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 14/47] x86/kvmclock: Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz()
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz() in anticipation of
adding support for getting TSC info from PV CPUID, i.e. in a KVM specific
way, but without non-kvmclock.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/kvmclock.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 69752b170e0a..c4a782a0c903 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -115,7 +115,7 @@ static inline void kvm_sched_clock_init(bool stable)
  * poll of guests can be running and trouble each other. So we preset
  * lpj here
  */
-static unsigned int __init kvm_get_tsc_khz(void)
+static unsigned int __init kvmclock_get_tsc_khz(void)
 {
 	return pvclock_tsc_khz(this_cpu_pvti());
 }
@@ -125,7 +125,7 @@ static void __init kvm_get_preset_lpj(void)
 	unsigned long khz;
 	u64 lpj;
 
-	khz = kvm_get_tsc_khz();
+	khz = kvmclock_get_tsc_khz();
 
 	lpj = ((u64)khz * 1000);
 	do_div(lpj, HZ);
@@ -320,8 +320,8 @@ void __init kvmclock_init(void)
 	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
 	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
 
-	x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
-	x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
+	x86_init.hyper.get_tsc_khz = kvmclock_get_tsc_khz;
+	x86_init.hyper.get_cpu_khz = kvmclock_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;
 	x86_platform.set_wallclock = kvm_set_wallclock;
 #ifdef CONFIG_X86_LOCAL_APIC
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 15/47] KVM: x86: Officially define CPUID 0x40000010 as PV Timing Info (TSC and Bus)
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

From: David Woodhouse <dwmw@amazon.co.uk>

Formally define and document CPUID 0x40000010 as providing TSC and local
APIC bus frequency information for KVM's PV CPUID range.  Way back in
2008, VMware proposed (https://lkml.org/lkml/2008/10/1/246) carving out a
range of CPUID leaves for use by hypervisors.  While the broader proposal
from VMware was mostly shot down in flames, use of CPUID 0x40000010 to
provide TSC and local APIC bus frequency information survived and made it's
way into multiple guest operating systems.

XNU unconditionally assumes CPUID 0x40000010 contains the frequency
information, if it's present on any hypervisor:

  https://github.com/apple/darwin-xnu/blob/main/osfmk/i386/cpuid.c

As does FreeBSD:

  https://github.com/freebsd/freebsd-src/commit/4a432614f68

More importantly, QEMU (the de facto "reference" VMM for KVM) has
conditionally provided timing information in CPUID 0x40000010 for almost
9 years, since commit 9954a1582e ("x86-KVM: Supply TSC and APIC clock
rates to guest like VMWare").

So at this point it would be daft for KVM (or any hypervisor) to expose
0x40000010 for any *other* content.  Officially carve out and define the
CPUID leaf so that Linux-as-a-guest can follow suit and pull TSC and Local
APIC Bus frequency information from CPUID.

Defer providing userspace with the necessary information needed to
precisely and accurately enumerate the _actual_ configured TSC frequency
to the guest (that exact information, along with the scaled ratio, isn't
exposed to userspace).  As evidenced by QEMU, providing CPUID 0x40000010
without help from KVM is entirely possible, just not ideal.

Link: https://lore.kernel.org/all/ea0d7f43d910cee9600b254e303f468722fa355b.camel@infradead.org
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
[sean: drop KVM filling of CPUID, add documentation, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/x86/cpuid.rst | 12 ++++++++++++
 arch/x86/include/uapi/asm/kvm_para.h | 11 +++++++++++
 2 files changed, 23 insertions(+)

diff --git a/Documentation/virt/kvm/x86/cpuid.rst b/Documentation/virt/kvm/x86/cpuid.rst
index bda3e3e737d7..f02e395cfa9b 100644
--- a/Documentation/virt/kvm/x86/cpuid.rst
+++ b/Documentation/virt/kvm/x86/cpuid.rst
@@ -122,3 +122,15 @@ KVM_HINTS_REALTIME 0            guest checks this feature bit to
                                 preempted for an unlimited time
                                 allowing optimizations
 ================== ============ =================================
+
+function: KVM_CPUID_TIMING_INFO (0x40000010)
+
+returns::
+
+   eax = (Virtual) TSC frequency in kHz
+   ebx = (Virtual) Bus (local APIC timer) frequency in kHz
+   ecx = 0 (Reserved)
+   edx = 0 (Reserved)
+
+Note, KVM only defines the semantics of KVM_CPUID_TIMING_INFO; KVM does NOT
+advertise support via KVM_GET_SUPPORTED_CPUID.
\ No newline at end of file
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..c3a384711f3a 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -44,6 +44,17 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT	24
 
+/*
+ * The timing information leaf provides TSC and local APIC timer frequency
+ * information to the guest.  Note, userspace is responsible for filling the
+ * leaf with the correct information.
+ *
+ *  # EAX: (Virtual) TSC frequency in kHz.
+ *  # EBX: (Virtual) Bus (local APIC timer) frequency in kHz.
+ *  # ECX, EDX: Reserved (must be zero).
+ */
+#define KVM_CPUID_TIMING_INFO	0x40000010
+
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox