Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v3 40/41] x86/tsc: Add standalone helper for getting CPU frequency from CPUID
From: Sean Christopherson @ 2026-05-15 23:04 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515195116.E0AFBC2BCB3@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] Unintended clobbering of the global `cpu_khz` variable in `native_calibrate_tsc()` due to a missing local variable declaration.
> --
> 
> commit 36e45379d4c9f40ae24c5c323b4f0471c94c6bef
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/tsc: Add standalone helper for getting CPU frequency from CPUID
> 
> This commit extracts the logic for getting the CPU frequency via CPUID into
> a new standalone helper function. It also updates the TSC calibration code
> to use this new helper.
> 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> 
> [ ... ]
> 
> > @@ -754,13 +772,8 @@ unsigned long native_calibrate_tsc(void)
> >  	 * clock, but we can easily calculate it to a high degree of accuracy
> >  	 * by considering the crystal ratio and the CPU speed.
> >  	 */
> > -	if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
> > -		unsigned int eax_base_mhz, ebx, ecx, edx;
> > -
> > -		cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
> > -		info.crystal_khz = eax_base_mhz * 1000 *
> > -			info.denominator / info.numerator;
> > -	}
> > +	if (!info.crystal_khz && !cpuid_get_cpu_freq(&cpu_khz))
> > +		info.crystal_khz = cpu_khz * info.denominator / info.numerator;
> 
> Does this unintentionally clobber the global cpu_khz variable?

Yep.  I don't think I even realized there was such a global variable.  Probably
makes sense to use a different name for the local variables.

^ permalink raw reply

* Re: [PATCH v3 15/41] x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
From: Sean Christopherson @ 2026-05-15 22:43 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515194803.1D898C2BCB0@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> > index 3d3165eef8214..21d366d01985e 100644
> > --- a/arch/x86/xen/time.c
> > +++ b/arch/x86/xen/time.c
> > @@ -568,6 +568,12 @@ static void __init xen_init_time_common(void)
> >  	xen_sched_clock_offset = xen_clocksource_read();
> >  	static_call_update(pv_steal_clock, xen_steal_clock);
> >  	paravirt_set_sched_clock(xen_sched_clock);
> > +	/*
> > +	 * Xen has paravirtualized suspend/resume and so doesn't use the common
> > +	 * x86 sched_clock save/restore hooks.
> > +	 */
> > +	x86_platform.save_sched_clock_state = NULL;
> > +	x86_platform.restore_sched_clock_state = NULL;
> 
> Does setting these hooks to NULL lead to an unconditional pointer dereference
> regression during suspend, hibernation, or kexec operations in Xen guests?
> 
> Looking at save_processor_state() and __restore_processor_state() in
> arch/x86/power/cpu.c, the hooks are called directly without being checked
> for NULL first:
> 
> arch/x86/power/cpu.c:save_processor_state() {
> 	__save_processor_state(&saved_context);
> 	x86_platform.save_sched_clock_state();
> }
> 
> arch/x86/power/cpu.c:__restore_processor_state() {
> 	...
> 	do_fpu_end();
> 	tsc_verify_tsc_adjust(true);
> 	x86_platform.restore_sched_clock_state();
> 	...
> }
> 
> Will operations that trigger a generic CPU state save or restore on a Xen
> guest execute a call on these NULL pointers, causing a kernel panic
> regression? Should these instead be assigned to an empty stub function to
> safely bypass them?

/facepalm

Yes.  I suspect past me carried over the Xen changes before I understood why
it's safe on Xen (Xen uses a unique PV suspend/resume flow).

Playing nice with NULL x86_platform.{save,restore}_sched_clock_state pointers
is the obvious fix.  The other option would be to wire up nop callbacks, but I
don't see any value in doing so.  I really don't want to leave the callbacks
wired up to tsc_{save,restore}_sched_clock_state() (unless it turns out VMware
actually needs them).

Regardless, this definitely highlights that VMware guests need to be tested. :-/

^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Jason Gunthorpe @ 2026-05-15 22:35 UTC (permalink / raw)
  To: Yu Zhang
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan
In-Reply-To: <20260511162408.1180069-5-zhangyu1@linux.microsoft.com>

On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> +					  unsigned long start,
> +					  unsigned long end)
> +{
> +	unsigned long start_pfn = start >> PAGE_SHIFT;
> +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> +	unsigned long nr_pages = end_pfn - start_pfn;
> +	u16 count = 0;
> +
> +	while (nr_pages > 0) {
> +		unsigned long flush_pages;
> +		int order;
> +		unsigned long pfn_align;
> +		unsigned long size_align;
> +
> +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> +			break;
> +		}
> +
> +		if (start_pfn)
> +			pfn_align = __ffs(start_pfn);
> +		else
> +			pfn_align = BITS_PER_LONG - 1;
> +
> +		size_align = __fls(nr_pages);
> +		order = min(pfn_align, size_align);
> +		iova_list[count].page_mask_shift = order;
> +		iova_list[count].page_number = start_pfn;
> +
> +		flush_pages = 1UL << order;
> +		start_pfn += flush_pages;
> +		nr_pages -= flush_pages;
> +		count++;
> +	}

This seems like a really silly hypervisor interface. Why doesn't it
just accept a normal range? Splitting it into power of two aligned
ranges is very inefficient.

Jason

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Jason Gunthorpe @ 2026-05-15 22:31 UTC (permalink / raw)
  To: Yu Zhang
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan
In-Reply-To: <20260511162408.1180069-4-zhangyu1@linux.microsoft.com>

On Tue, May 12, 2026 at 12:24:07AM +0800, Yu Zhang wrote:
> +/*
> + * Identity and blocking domains are static singletons: identity is a 1:1
> + * passthrough with no page table, blocking rejects all DMA. Neither holds
> + * per-IOMMU state, so one instance suffices even with multiple vIOMMUs.
> + */
> +static struct hv_iommu_domain hv_identity_domain;
> +static struct hv_iommu_domain hv_blocking_domain;
> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;

Please follow the format of other drivers and statically initialize
these here instead of in C code.

> +static struct iommu_ops hv_iommu_ops;
> +static LIST_HEAD(hv_iommu_pci_bus_list);
> +static DEFINE_SPINLOCK(hv_iommu_pci_bus_lock);
> +
> +#define hv_iommu_present(iommu_cap) (iommu_cap & HV_IOMMU_CAP_PRESENT)
> +#define hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1)
> +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1_5LVL)
> +#define hv_iommu_ats_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_ATS)

prefer to see static inlines

> +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_detach_device_domain *input;
> +	struct pci_dev *pdev;
> +	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> +	/* See the attach function, only PCI devices for now */
> +	if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> +		return;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	dev_dbg(dev, "detaching from domain %d\n", hv_domain->device_domain.domain_id.id);
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	if (hv_iommu_lookup_logical_dev_id(pdev, &input->device_id.as_uint64)) {
> +		local_irq_restore(flags);
> +		dev_warn(&pdev->dev, "no IOMMU registration for vPCI bus on detach\n");
> +		return;
> +	}
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);

FWIW the hypervisor cannot implement the linux attach semantics if it
has detach/attach. It must support simply 'attach' which atomically
changes the translation. With detach you have a confusing problem if
errors happen in the middle of the sequence the device is left in an
unclear state. You should at least document what state the hypervisor
is supposed to leaave the translation iin during these failures..

> +static void hv_iommu_release_device(struct device *dev)
> +{
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (pdev->ats_enabled)
> +		pci_disable_ats(pdev);
> +
> +	dev_iommu_priv_set(dev, NULL);
> +	set_dma_ops(dev, NULL);

Does the driver really need to mess with set_dma_ops ? I'd rather not
see that in new iommu drivers if at all possible :|

> +static int __init hv_initialize_static_domains(void)
> +{
> +	int ret;
> +	struct hv_iommu_domain *hv_domain;
> +
> +	/* Default stage-1 identity domain */
> +	hv_domain = &hv_identity_domain;
> +
> +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret)
> +		return ret;
> +
> +	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY);
> +	if (ret)
> +		goto delete_identity_domain;
> +
> +	hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY;
> +	hv_domain->domain.ops = &hv_iommu_identity_domain_ops;
> +	hv_domain->domain.owner = &hv_iommu_ops;
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;

identity doesn't use geometry or pgsize_bitmap

> +	/* Default stage-1 blocked domain */
> +	hv_domain = &hv_blocking_domain;
> +
> +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret)
> +		goto delete_identity_domain;
> +
> +	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED);
> +	if (ret)
> +		goto delete_blocked_domain;
> +
> +	hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED;
> +	hv_domain->domain.ops = &hv_iommu_blocking_domain_ops;
> +	hv_domain->domain.owner = &hv_iommu_ops;
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;

Nor does blocked

> +#define INTERRUPT_RANGE_START	(0xfee00000)
> +#define INTERRUPT_RANGE_END	(0xfeefffff)
> +static void hv_iommu_get_resv_regions(struct device *dev,
> +		struct list_head *head)
> +{
> +	struct iommu_resv_region *region;
> +
> +	region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
> +				      INTERRUPT_RANGE_END - INTERRUPT_RANGE_START + 1,
> +				      0, IOMMU_RESV_MSI, GFP_KERNEL);

Surprised these constants are not discovered from the hv?

> +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
> +{
> +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +}
> +
> +static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
> +				struct iommu_iotlb_gather *iotlb_gather)
> +{
> +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +
> +	iommu_put_pages_list(&iotlb_gather->freelist);
> +}

Full invalidation only huh?

> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +};
> +
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +};

Usually I would expect these to have their own attach
functions. blocking in particular must have an attach op that cannot
fail. It is used to recover the device back to a known translation in
case of cascading other errors.

> +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +	IOMMU_PT_DOMAIN_OPS(x86_64),
> +	.flush_iotlb_all = hv_iommu_flush_iotlb_all,
> +	.iotlb_sync = hv_iommu_iotlb_sync,
> +	.free = hv_iommu_paging_domain_free,
> +};
> +
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> +{
> +	int ret;
> +	struct hv_iommu_domain *hv_domain;
> +	struct pt_iommu_x86_64_cfg cfg = {};
> +
> +	hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
> +	if (!hv_domain)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret) {
> +		kfree(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;

geoemtry shouldn't be set here, it is overriden by
pt_iommu_x86_64_init() with the exact page table configuration

> +static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
> +			struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> +{
> +	ida_init(&hv_iommu->domain_ids);
> +
> +	hv_iommu->cap = hv_iommu_cap->iommu_cap;
> +	hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
> +	if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
> +	    hv_iommu->max_iova_width > 48) {
> +		pr_info("5-level paging not supported, limiting iova width to 48.\n");
> +		hv_iommu->max_iova_width = 48;
> +	}
> +
> +	hv_iommu->geometry = (struct iommu_domain_geometry) {
> +		.aperture_start = 0,
> +		.aperture_end = (((u64)1) << hv_iommu->max_iova_width) - 1,
> +		.force_aperture = true,
> +	};
> +
> +	hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
> +	hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
> +	/* Only x86 page sizes (4K/2M/1G) are supported */
> +	hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap &
> +				  (SZ_4K | SZ_2M | SZ_1G);
> +	if (hv_iommu->pgsize_bitmap != hv_iommu_cap->pgsize_bitmap)
> +		pr_warn("unsupported page sizes masked: 0x%llx -> 0x%llx\n",
> +			hv_iommu_cap->pgsize_bitmap, hv_iommu->pgsize_bitmap);

IKD if you need this logic, the way the page table code is used it
just sort of falls out naturally that other page sizes are ignored.

> +struct hv_iommu_domain {
> +	union {
> +		struct iommu_domain    domain;
> +		struct pt_iommu        pt_iommu;
> +		struct pt_iommu_x86_64 pt_iommu_x86_64;
> +	};

You should retain the build assertions here

Jason

^ permalink raw reply

* Re: [PATCH v3 11/41] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
From: Sean Christopherson @ 2026-05-15 22:29 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515203447.ED0E7C2BCB0@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index df95516a9d899..006e3a13500b4 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> [ ... ]
> > @@ -178,8 +178,22 @@ static void kvm_register_clock(char *txt)
> >  	pr_debug("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
> >  }
> >  
> > +static void kvmclock_disable(void)
> > +{
> > +	if (msr_kvm_system_time)
> > +		native_write_msr(msr_kvm_system_time, 0);
> > +}
> > +
> >  static void kvm_save_sched_clock_state(void)
> >  {
> > +	/*
> > +	 * Stop host writes to kvmclock immediately prior to suspend/hibernate.
> > +	 * If the system is hibernating, then kvmclock will likely reside at a
> > +	 * different physical address when the system awakens, and host writes
> > +	 * to the old address prior to reconfiguring kvmclock would clobber
> > +	 * random memory.
> > +	 */
> > +	kvmclock_disable();
> >  }
> 
> Can this lead to memory corruption if another clocksource overwrites the
> x86_platform.save_sched_clock_state callback?

Yes, that's why the kernel only allows one "main" hypervisor.

^ permalink raw reply

* Re: [PATCH v1 1/4] iommu: Move Hyper-V IOMMU driver to its own subdirectory
From: Jason Gunthorpe @ 2026-05-15 22:19 UTC (permalink / raw)
  To: Yu Zhang
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan
In-Reply-To: <20260511162408.1180069-2-zhangyu1@linux.microsoft.com>

On Tue, May 12, 2026 at 12:24:05AM +0800, Yu Zhang wrote:
> From: Easwar Hariharan <eahariha@linux.microsoft.com>
> 
> The Hyper-V IOMMU driver currently only supports IRQ remapping.
> As it will be adding DMA remapping support, prepare a directory
> to contain all the different feature files.

Any possibility we could put the irq remapping thing under the irq
directory?

The other drivers have it here because they are co-mingled with their
iommu HW, will hyperv have the same issue?

Jason

^ permalink raw reply

* Re: [PATCH v3 11/41] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
From: sashiko-bot @ 2026-05-15 20:34 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-12-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Moving kvmclock_disable() for the BSP to x86_platform.save_sched_clock_state leaves kvmclock enabled across hibernation if the callback is overwritten by another clocksource (e.g., Hyper-V), leading to memory corruption.
--

commit 1ada3230c82e463b5d5311eb45fc919a4e06098c
Author: Sean Christopherson <seanjc@google.com>

x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()

This commit changes the suspend logic to avoid disabling kvmclock on the BSP
during syscore_suspend(). It defers the operation to the sched_clock restore
callback and introduces a new action enum to handle different offline and
suspend cases.

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index df95516a9d899..006e3a13500b4 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
[ ... ]
> @@ -178,8 +178,22 @@ static void kvm_register_clock(char *txt)
>  	pr_debug("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
>  }
>  
> +static void kvmclock_disable(void)
> +{
> +	if (msr_kvm_system_time)
> +		native_write_msr(msr_kvm_system_time, 0);
> +}
> +
>  static void kvm_save_sched_clock_state(void)
>  {
> +	/*
> +	 * Stop host writes to kvmclock immediately prior to suspend/hibernate.
> +	 * If the system is hibernating, then kvmclock will likely reside at a
> +	 * different physical address when the system awakens, and host writes
> +	 * to the old address prior to reconfiguring kvmclock would clobber
> +	 * random memory.
> +	 */
> +	kvmclock_disable();
>  }

Can this lead to memory corruption if another clocksource overwrites the
x86_platform.save_sched_clock_state callback?

In environments where KVM emulates Hyper-V, both KVM and Hyper-V clocksources
might be initialized. If Hyper-V overwrites
x86_platform.save_sched_clock_state with its own callback, and previous
callbacks are not chained or invoked, kvm_save_sched_clock_state() will never
be called.

Because kvmclock clears its clocksource.suspend hook when registering as
sched_clock, the clocksource framework cannot act as a fallback.

This would leave kvmclock enabled on the BSP across hibernation. As mentioned
in the comment, could this result in the hypervisor writing to the old physical
memory address after resume and clobbering random memory?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=11

^ permalink raw reply

* Re: [PATCH v3 32/41] x86/tsc: Rejects attempts to override TSC calibration with lesser routine
From: sashiko-bot @ 2026-05-15 20:16 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-33-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Spurious WARN_ON stack trace on every boot for CoCo (SEV-SNP/TDX) guests under KVM.
--

commit e57255e4d076143a67d0e6e44f914aec31aed8c7
Author: Sean Christopherson <seanjc@google.com>

x86/tsc: Rejects attempts to override TSC calibration with lesser routine

This commit adds a sanity check when registering a TSC frequency calibration
routine to ensure the incoming routine is as robust as the outgoing one,
rejecting it with a warning if it is not. This prevents replacing a trusted
calibration source with a less trusted one.

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 98bef1d06fa9d..7a261214fa3ef 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1319,8 +1319,13 @@ void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
>  
>  	if (properties & TSC_FREQUENCY_KNOWN)
>  		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +	else if (WARN_ON(boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)))
> +		return;
> +
>  	if (properties & TSC_RELIABLE)
>  		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> +	else if (WARN_ON(boot_cpu_has(X86_FEATURE_TSC_RELIABLE)))
> +		return;

Will this cause a spurious warning stack trace on every boot for CoCo
(SEV-SNP/TDX) guests under KVM?

For KVM CoCo guests, CoCo registers first via tsc_early_init(), setting both
TSC_FREQUENCY_KNOWN and X86_FEATURE_TSC_RELIABLE.

KVM then registers second via kvmclock_init(). At this point in the patch
series, KVM only passes TSC_FREQUENCY_KNOWN, so the properties & TSC_RELIABLE
check is false, and it falls through to this WARN_ON. Since CoCo already set
the capability, the condition evaluates to true.

I see this is fixed later in the series by the commit "x86/kvmclock: Mark TSC
as reliable when it's constant and nonstop", which modifies KVM to pass
TSC_FREQ_KNOWN_AND_RELIABLE on modern hardware, preventing this warning.

Could the patch ordering be adjusted to prevent this intermediate regression?

>  
>  	x86_platform.calibrate_tsc = calibrate_tsc;
>  	if (calibrate_cpu)

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=32

^ permalink raw reply

* Re: [PATCH v3 37/41] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
From: sashiko-bot @ 2026-05-15 20:09 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-38-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Bypassing `kvm_sched_clock_init()` when `CONSTANT_TSC` is present ignores KVM's `PVCLOCK_TSC_STABLE_BIT`, leading to unfiltered backward time jumps in the core scheduler.
--

commit 57cf68835ef70733c037ca9ffa04fd39e1d70027
Author: Sean Christopherson <seanjc@google.com>

x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop

This commit updates kvmclock_init to prefer the TSC over kvmclock for
sched_clock when the TSC is constant, non-stop, and not marked unstable.
It does this by bypassing kvm_sched_clock_init and instead modifying the
TSC properties so that TSC becomes the preferred clocksource.

Link: https://lore.kernel.org/all/Z4hDK27OV7wK572A@google.com

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
[ ... ]
> @@ -416,22 +416,22 @@ void __init kvmclock_init(void)
[ ... ]
>  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
>  	    !check_tsc_unstable()) {
>  		kvm_clock.rating = 299;
>  		tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> +	} else {
> +		kvm_sched_clock_init(stable);
>  	}
>  
> -	kvm_sched_clock_init(stable);

Does bypassing kvm_sched_clock_init() when CONSTANT_TSC is present
accidentally ignore KVM's PVCLOCK_TSC_STABLE_BIT?

CONSTANT_TSC indicates the hardware TSC rate doesn't change with power
states, but it doesn't guarantee cross-CPU synchronization, such as across
multi-socket systems or differing vCPU offsets from hotplug and migration.
KVM explicitly signals this cross-CPU stability via PVCLOCK_TSC_STABLE_BIT,
which is mapped to the stable variable.

Prior to this patch, kvm_sched_clock_init() was always called, invoking
__paravirt_set_sched_clock() with the stable flag. This allowed the system
to call clear_sched_clock_stable() if the hypervisor could not guarantee
synchronization, enabling crucial per-CPU monotonicity filters in
sched_clock_cpu().

By skipping kvm_sched_clock_init() entirely and registering
TSC_FREQ_KNOWN_AND_RELIABLE, the kernel's native cross-CPU TSC synchronization
checks and the clocksource watchdog appear to be bypassed.

If KVM clears PVCLOCK_TSC_STABLE_BIT, could this result in the guest
ignoring the condition and using raw, unfiltered rdtsc() values? When tasks
migrate between vCPUs with differing TSC offsets, this regression might cause
backward time jumps in sched_clock(), which breaks rq_clock monotonicity
and corrupts CFS vruntime and min_vruntime calculations.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=37

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: sashiko-bot @ 2026-05-15 20:06 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-3-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] The new helper function `tsc_register_calibration_routines` is missing the `__init` attribute.
--

commit ac711dd28fcd18f28a56325512fc567fdd7d6098
Author: Sean Christopherson <seanjc@google.com>

x86/tsc: Add helper to register CPU and TSC freq calibration routines

Add a helper to register non-native CPU and TSC frequency calibration routines.
This consolidates handling of common TSC properties forced by hypervisors
and guards against overriding a TSC calibration routine with a less trusted one.

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index f92236f40cbcc..7e639c0a94a21 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1281,6 +1281,23 @@ static void __init check_system_tsc_reliable(void)
>  		tsc_disable_clocksource_watchdog();
>  }
>  
> +/*
> + * TODO: Disentangle AMD_MEM_ENCRYPT and make SEV guest support depend on
> + *	 HYPERVISOR_GUEST.
> + */
> +#if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
> +void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
> +				       unsigned long (*calibrate_cpu)(void))

This isn't a bug, but should tsc_register_calibration_routines() be marked
with the __init attribute?

Since all callers such as kvmclock_init(), snp_secure_tsc_init(), and
ms_hyperv_init_platform() are __init routines, and this function writes to
the x86_platform structure which is typically __ro_after_init, adding the
__init annotation would allow it to be discarded after the boot process
completes.

> +{
> +	if (WARN_ON_ONCE(!calibrate_tsc))
> +		return;
> +
> +	x86_platform.calibrate_tsc = calibrate_tsc;
> +	if (calibrate_cpu)
> +		x86_platform.calibrate_cpu = calibrate_cpu;
> +}
> +#endif

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=2

^ permalink raw reply

* Re: [PATCH v3 41/41] x86/kvmclock: Get CPU base frequency from CPUID when it's available
From: sashiko-bot @ 2026-05-15 19:59 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-42-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] Fallback path in `kvm_get_cpu_khz()` bypasses `kvm_para_tsc_khz()`, causing a precision regression when CPUID.0x16 is missing.
- [Medium] Core x86 timekeeping logic unconditionally overwrites `cpu_khz` if it diverges from `tsc_khz` by >10%, nullifying the commit's intent.
--

commit 244283a64d7a3809c004f242b3973386bb521665
Author: Sean Christopherson <seanjc@google.com>

    x86/kvmclock: Get CPU base frequency from CPUID when it's available

This commit updates kvmclock initialization to prefer fetching the base CPU
frequency from CPUID.0x16 instead of assuming the virtual CPU runs at the
same frequency as the TSC. It introduces kvm_get_cpu_khz() which falls back
to the PV clock TSC calculation if CPUID is unavailable.

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 62c8ea2e67691..7607920ae386e 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -190,6 +190,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
>  	}
>  }
>  
> +static unsigned long kvm_get_cpu_khz(void)
> +{
> +	unsigned int cpu_khz;
> +
> +	/*
> +	 * Prefer CPUID over kvmclock when possible, as the base CPU frequency
> +	 * isn't necessarily the same as the kvmlock "TSC" frequency.
> +	 */
> +	if (!cpuid_get_cpu_freq(&cpu_khz))
> +		return cpu_khz;
> +
> +	return pvclock_tsc_khz(this_cpu_pvti());

Before this patch, both calibrate_cpu and calibrate_tsc used
kvm_get_tsc_khz(), which checks kvm_para_tsc_khz() to get the exact TSC
frequency from KVM's PV CPUID timing leaf, and only falls back to
pvclock_tsc_khz() if it is unavailable.

Does the fallback path in kvm_get_cpu_khz() bypass kvm_para_tsc_khz()
entirely? For VMs that expose KVM's PV CPUID timing leaf but not
CPUID.0x16, it looks like cpu_khz will now be reverse-calculated from
pvclock multipliers instead of using the exact hypervisor-provided value.

Could this introduce a precision regression due to truncation loss, causing
cpu_khz and tsc_khz to needlessly diverge on the fallback path?

> +}
> +
>  /*
>   * If we don't do that, there is the possibility that the guest
>   * will calibrate under heavy load - thus, getting a lower lpj -

[ ... ]

> @@ -434,7 +448,7 @@ void __init kvmclock_init(void)
>  		kvm_sched_clock_init(stable);
>  	}
>  
> -	tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz,
> +	tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_cpu_khz,
>  					  tsc_properties);

The commit message notes that treating the TSC and CPU frequencies as one
and the same is nonsensical when the hypervisor explicitly enumerates the
CPU frequency.

However, looking at the core x86 initialization in
determine_cpu_tsc_frequencies(), it unconditionally overwrites cpu_khz if it
diverges from tsc_khz by more than 10%:

arch/x86/kernel/tsc.c:determine_cpu_tsc_frequencies() {
	...
	if (tsc_khz == 0)
		tsc_khz = cpu_khz;
	else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
		cpu_khz = tsc_khz;
	...
}

If the hypervisor explicitly enumerates a CPU base frequency via CPUID.0x16
that genuinely differs from the TSC frequency by more than 10%, will this
core logic silently discard the explicitly enumerated CPU frequency and
overwrite it with the TSC frequency? 

Does this check thwart the patch's goal of decoupling the two values
exactly in the scenarios where they substantially differ?

>  
>  	x86_platform.get_wallclock = kvm_get_wallclock;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=41

^ permalink raw reply

* Re: [PATCH v3 18/41] x86/paravirt: Pass sched_clock save/restore helpers during registration
From: sashiko-bot @ 2026-05-15 19:56 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-19-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Unconditional NULL pointer dereference in ACPI suspend/resume and kexec paths due to nullified sched_clock save/restore hooks for VMware and Xen guests.
--

commit 812ddcb2484da2fae1600ef7146674adbef2e760
Author: Sean Christopherson <seanjc@google.com>

x86/paravirt: Pass sched_clock save/restore helpers during registration

Pass in a PV clock's save/restore helpers when configuring sched_clock
instead of relying on each PV clock to manually set the save/restore hooks.
In addition to bringing sanity to the code, this will allow gracefully
"rejecting" a PV sched_clock, e.g. when running as a CoCo guest that has
access to a "secure" TSC.

No functional change intended.

> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
> index b5cb66ca022b8..968de002975f9 100644
> --- a/arch/x86/kernel/cpu/vmware.c
> +++ b/arch/x86/kernel/cpu/vmware.c
> @@ -347,11 +347,8 @@ static void __init vmware_paravirt_ops_setup(void)
>  
>  	vmware_cyc2ns_setup();
>  
> -	if (vmw_sched_clock) {
> -		paravirt_set_sched_clock(vmware_sched_clock);
> -		x86_platform.save_sched_clock_state = NULL;
> -		x86_platform.restore_sched_clock_state = NULL;
> -	}
> +	if (vmw_sched_clock)
> +		paravirt_set_sched_clock(vmware_sched_clock, NULL, NULL);

[ ... ]

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index b14c4ada89a3f..0114c63dfdd9a 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -280,12 +280,15 @@ bool using_native_sched_clock(void)
>  	return static_call_query(pv_sched_clock) == native_sched_clock;
>  }
>  
> -void __paravirt_set_sched_clock(u64 (*func)(void), bool stable)
> +void __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
> +				void (*save)(void), void (*restore)(void))
>  {
>  	if (!stable)
>  		clear_sched_clock_stable();
>  
>  	static_call_update(pv_sched_clock, func);
> +	x86_platform.save_sched_clock_state = save;
> +	x86_platform.restore_sched_clock_state = restore;

Does assigning these function pointers without a null check cause a null
pointer dereference during ACPI suspend/resume or kexec paths for VMware
and Xen guests?

When VMware and Xen pass NULL for these hooks (as formalized by this patch
and introduced in an earlier patch in this series), they are stored directly
in x86_platform.

Then, unconditionally in arch/x86/power/cpu.c:save_processor_state():

void save_processor_state(void)
{
	__save_processor_state(&saved_context);
	x86_platform.save_sched_clock_state();
}

and similarly in __restore_processor_state():

	x86_platform.restore_sched_clock_state();

Will these calls jump to a null pointer since the hooks are now NULL?

Before this patch series, these platforms relied on the default
tsc_save_sched_clock_state hooks, which safely returned early. Setting the
hooks to NULL appears to convert this safe no-op into a regression that
triggers a panic.

Should __paravirt_set_sched_clock() substitute a dummy no-op function when
NULL is passed, or should the caller sites in arch/x86/power/cpu.c explicitly
check if the hooks are set before invocation?

>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=18

^ permalink raw reply

* Re: [PATCH v3 36/41] x86/kvmclock: Get local APIC bus frequency from PV CPUID Timing Info
From: sashiko-bot @ 2026-05-15 19:55 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-37-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] lapic_timer_period is incorrectly assigned the raw APIC frequency in kHz, which fails to account for the kernel's HZ configuration and breaks APIC timer calibration for any HZ != 1000.
--

commit e5bbef05c7ba30fb00c716accae48693e295289a
Author: Sean Christopherson <seanjc@google.com>

x86/kvmclock: Get local APIC bus frequency from PV CPUID Timing Info

This commit reads the local APIC bus frequency from the PV CPUID leaf
and uses it to directly set lapic_timer_period, bypassing the standard
calibration process.

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
[ ... ]
> @@ -200,10 +200,19 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
>   */
>  static unsigned long kvm_get_tsc_khz(void)
>  {
> +#ifdef CONFIG_X86_LOCAL_APIC
> +	u32 apic_khz = kvm_para_apic_bus_khz();
> +
>  	/*
> -	 * If KVM advertises the frequency directly in CPUID, use that
> -	 * instead of reverse-calculating it from the KVM clock data.
> +	 * Use the TSC frequency from KVM's (and other hypervisors') PV CPUID
> +	 * leaf when available, instead of reverse-calculating it from the KVM
> +	 * clock data.  As a bonus, the CPUID leaf also includes the local APIC
> +	 * bus/timer frequency.
>  	 */
> +	if (apic_khz)
> +		lapic_timer_period = apic_khz;

Does setting lapic_timer_period directly to apic_khz assume that HZ is
always configured to 1000?

The lapic_timer_period variable represents the number of APIC timer ticks
per jiffy. This is usually calculated as (frequency_in_hz / HZ), which would
be (apic_khz * 1000) / HZ.

Assigning the value of apic_khz directly means the period is exactly
frequency_in_hz / 1000. If the kernel is built with HZ set to 100 or 250,
could this result in a timer period that is too small, potentially causing
timer interrupts to fire much more frequently than intended?

> +#endif
> +
>  	return kvm_para_tsc_khz() ? : pvclock_tsc_khz(this_cpu_pvti());
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=36

^ permalink raw reply

* Re: [PATCH v3 17/41] x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
From: sashiko-bot @ 2026-05-15 19:55 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-18-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Setting `x86_platform.save_sched_clock_state` and `restore_sched_clock_state` to NULL for PV clocksources causes a fatal kernel panic on S3/S4 suspend and resume.
--

commit 4d4877d2e739eb2258c919e41775916ae05af732
Author: Sean Christopherson <seanjc@google.com>

x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock

This commit adds a warning to guard against regressions when the default
TSC sched_clock save and restore hooks are called while using a PV
sched_clock.

> Now that all PV clocksources override the sched_clock save/restore hooks
> when overriding sched_clock, WARN if the "default" TSC hooks are invoked
> when using a PV sched_clock, e.g. to guard against regressions.

Earlier patches in this series appear to override these hooks by setting them
to NULL, such as setting x86_platform.save_sched_clock_state = NULL. Could
this lead to a panic during suspend and resume?

Looking at the callers in arch/x86/power/cpu.c, save_processor_state() and
restore_processor_state() unconditionally invoke these function pointers
without checking for NULL:

void save_processor_state(void)
{
        __save_processor_state(&saved_context);
        x86_platform.save_sched_clock_state();
}

When VMware or Xen HVM guests attempt to suspend or hibernate, won't the
kernel execute the NULL pointer and crash?

Would it be better for PV clocksources to override the hooks to a safe no-op
function like x86_init_noop rather than NULL, or for the callers in
power/cpu.c to explicitly check for NULL before invocation?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=17

^ permalink raw reply

* Re: [PATCH v3 40/41] x86/tsc: Add standalone helper for getting CPU frequency from CPUID
From: sashiko-bot @ 2026-05-15 19:51 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-41-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Unintended clobbering of the global `cpu_khz` variable in `native_calibrate_tsc()` due to a missing local variable declaration.
--

commit 36e45379d4c9f40ae24c5c323b4f0471c94c6bef
Author: Sean Christopherson <seanjc@google.com>

x86/tsc: Add standalone helper for getting CPU frequency from CPUID

This commit extracts the logic for getting the CPU frequency via CPUID into
a new standalone helper function. It also updates the TSC calibration code
to use this new helper.

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c

[ ... ]

> @@ -754,13 +772,8 @@ unsigned long native_calibrate_tsc(void)
>  	 * clock, but we can easily calculate it to a high degree of accuracy
>  	 * by considering the crystal ratio and the CPU speed.
>  	 */
> -	if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
> -		unsigned int eax_base_mhz, ebx, ecx, edx;
> -
> -		cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
> -		info.crystal_khz = eax_base_mhz * 1000 *
> -			info.denominator / info.numerator;
> -	}
> +	if (!info.crystal_khz && !cpuid_get_cpu_freq(&cpu_khz))
> +		info.crystal_khz = cpu_khz * info.denominator / info.numerator;

Does this unintentionally clobber the global cpu_khz variable?

Because cpu_khz is not declared locally in native_calibrate_tsc(), this
appears to pass a pointer to the global cpu_khz variable to
cpuid_get_cpu_freq().

Looking at cpuid_get_cpu_freq(), it unconditionally zeroes the target
pointer before checking if CPUID_LEAF_FREQ is supported:

arch/x86/kernel/tsc.c:cpuid_get_cpu_freq() {
        ...
        *cpu_khz = 0;

        if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
                return -ENOENT;
        ...
}

If the system exposes CPUID.0x15 with a 0 crystal clock but does not
support CPUID.0x16, cpuid_get_cpu_freq() returns -ENOENT and leaves the
global cpu_khz as 0.

Could this overwrite the previously calibrated CPU frequency and force the
kernel to fall back to legacy PIT/HPET calibration? If so, this might
cause boot regressions on systems where these legacy timers are disabled
or unreliable. Even if CPUID.0x16 is supported, it seems this would silently
overwrite the dynamically calibrated global value with the CPUID base
frequency.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=40

^ permalink raw reply

* Re: [PATCH v3 29/41] x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
From: sashiko-bot @ 2026-05-15 19:48 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-30-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] kvm_sched_clock_read is registered as the system sched_clock before its required offset (kvm_sched_clock_offset) is initialized.
--

commit b6e429e4990c00d845488d46e8dc4132f3a2fd8b
Author: Sean Christopherson <seanjc@google.com>
x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()

This commit adds a return code to __paravirt_set_sched_clock() to allow the
kernel to reject attempts to use a PV sched_clock. It updates the caller
in kvmclock to check the return value and abort initialization if needed.

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index d3bb281c0805d..9b3d1ed1a96dc 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
[ ... ]
> @@ -338,10 +338,12 @@ static int kvmclock_setup_percpu(unsigned int cpu)
>  
>  static __init void kvm_sched_clock_init(bool stable)
>  {
> +	if (__paravirt_set_sched_clock(kvm_sched_clock_read, stable,
> +				       kvm_save_sched_clock_state,
> +				       kvm_restore_sched_clock_state))
> +		return;
> +
>  	kvm_sched_clock_offset = kvm_clock_read();

Could this cause a massive backward jump in system time if an NMI or early
exception calls sched_clock() right after registration?

__paravirt_set_sched_clock() uses static_call_update() to globally publish
kvm_sched_clock_read() as the live system sched_clock. Since the
initialization of kvm_sched_clock_offset is now delayed until after this
registration completes, any early caller will execute kvm_sched_clock_read()
with a zero offset, returning a huge absolute raw clock value.

When the code resumes, it sets the offset to the current raw clock value.
Subsequent sched_clock() calls will abruptly subtract this large offset,
causing time to jump backwards and violating monotonic clock guarantees.

Would it be safer to initialize kvm_sched_clock_offset before calling
__paravirt_set_sched_clock(), since calculating the offset early has no
side effects even if the registration is subsequently rejected?

> -	__paravirt_set_sched_clock(kvm_sched_clock_read, stable,
> -				   kvm_save_sched_clock_state,
> -				   kvm_restore_sched_clock_state);
>  	kvmclock_is_sched_clock = true;
>  
>  	/*

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=29

^ permalink raw reply

* Re: [PATCH v3 15/41] x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
From: sashiko-bot @ 2026-05-15 19:48 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-16-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Setting `x86_platform.save_sched_clock_state` and `x86_platform.restore_sched_clock_state` to NULL causes an unconditional NULL pointer dereference during suspend, hibernation, or kexec operations in Xen guests.
--

commit 07d607146953d7a7f53a708d78545184418c340c
Author: Sean Christopherson <seanjc@google.com>
Subject: x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks

This commit nullifies the x86_platform sched_clock save/restore hooks when
setting up Xen's PV clock. It aims to clarify that these common hooks are
not used when running as a Xen guest.

> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 3d3165eef8214..21d366d01985e 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -568,6 +568,12 @@ static void __init xen_init_time_common(void)
>  	xen_sched_clock_offset = xen_clocksource_read();
>  	static_call_update(pv_steal_clock, xen_steal_clock);
>  	paravirt_set_sched_clock(xen_sched_clock);
> +	/*
> +	 * Xen has paravirtualized suspend/resume and so doesn't use the common
> +	 * x86 sched_clock save/restore hooks.
> +	 */
> +	x86_platform.save_sched_clock_state = NULL;
> +	x86_platform.restore_sched_clock_state = NULL;

Does setting these hooks to NULL lead to an unconditional pointer dereference
regression during suspend, hibernation, or kexec operations in Xen guests?

Looking at save_processor_state() and __restore_processor_state() in
arch/x86/power/cpu.c, the hooks are called directly without being checked
for NULL first:

arch/x86/power/cpu.c:save_processor_state() {
	__save_processor_state(&saved_context);
	x86_platform.save_sched_clock_state();
}

arch/x86/power/cpu.c:__restore_processor_state() {
	...
	do_fpu_end();
	tsc_verify_tsc_adjust(true);
	x86_platform.restore_sched_clock_state();
	...
}

Will operations that trigger a generic CPU state save or restore on a Xen
guest execute a call on these NULL pointers, causing a kernel panic
regression? Should these instead be assigned to an empty stub function to
safely bypass them?

>  
>  	tsc_register_calibration_routines(xen_tsc_khz, NULL);
>  	x86_platform.get_wallclock = xen_get_wallclock;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=15

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: sashiko-bot @ 2026-05-15 19:47 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-28-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The logic to defer AP kvmclock registration to the CPUHP_AP_ONLINE_DYN state when it is not used as sched_clock is architecturally unsafe and will break timekeeping.
--

commit 88625772445f01c4397e3dfd4ee43222efd758cf
Author: Sean Christopherson <seanjc@google.com>

x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock

This commit prepares for making early_percpu_clock_init() a dedicated
sched_clock hook. It adds logic to enable kvmclock on APs during CPU online
if kvmclock is not used as sched_clock, and includes WARNs in the suspend
and resume paths.

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index e7250d21c6721..d3bb281c0805d 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
[ ... ]
> @@ -141,25 +142,51 @@ static void kvm_restore_sched_clock_state(void)
[ ... ]
>  void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
>  {
> -	/*
> -	 * Don't disable kvmclock on the BSP during suspend.  If kvmclock is
> -	 * being used for sched_clock, then it needs to be kept alive until the
> -	 * last minute, and restored as quickly as possible after resume.
> -	 */
> -	if (action != KVM_GUEST_BSP_SUSPEND)
> +	switch (action) {
[ ... ]
> +	case KVM_GUEST_AP_ONLINE:
> +		/*
> +		 * Secondary CPUs use dedicated sched_clock hooks to enable
> +		 * kvmclock early during bringup, there's nothing to be done
> +		 * when during CPU online.
> +		 */
> +		if (kvmclock_is_sched_clock)
> +			break;
> +		kvm_register_clock("secondary cpu, online");
> +		break;

Will deferring AP kvmclock registration to the CPU online phase break
timekeeping if kvmclock is still being used as the system clocksource?

If the user boots with clocksource=kvm-clock or if kvmclock is selected as a
fallback, it will be used as the clocksource despite not being sched_clock.
The kvmclock clocksource read function accesses per-CPU memory that the
hypervisor only populates after the AP registers its clock.

If early registration in start_secondary() is skipped, as anticipated here,
an AP will boot, enable local interrupts, and process timer ticks before
reaching the late CPUHP_AP_ONLINE_DYN hotplug state.

Could a call to ktime_get() during this window read uninitialized per-CPU
memory and return 0? It seems this might cause the timekeeper to compute
massive forward time jumps, leading to immediate timer expirations and
watchdogs firing.

> +	case KVM_GUEST_AP_OFFLINE:
> +	case KVM_GUEST_SHUTDOWN:
>  		kvmclock_disable();
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		break;
> +	}
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=27

^ permalink raw reply

* Re: [PATCH v3 31/41] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
From: sashiko-bot @ 2026-05-15 19:45 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-32-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Hyper-V VMs with an invariant TSC but no frequency MSRs lose the X86_FEATURE_TSC_RELIABLE capability.
--

commit 0c0fbc4f0451c30c0e87e0e3591ea4f487eff75c
Author: Sean Christopherson <seanjc@google.com>

x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration

This commit adds a tsc_properties parameter to the paravirtual TSC
calibration registration routine to consolidate feature flag tracking.
It updates various hypervisor initialization paths to pass combined flags
instead of manually setting the CPU capabilities.

> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -516,8 +516,13 @@ static void __init ms_hyperv_init_platform(void)
>  
>  	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
>  	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> -		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
> -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +		enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
> +
> +		if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
> +			tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> +
> +		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz,
> +						  tsc_properties);
>  	}

[ ... ]

> @@ -629,7 +634,6 @@ static void __init ms_hyperv_init_platform(void)
>  		 * is called.
>  		 */
>  		wrmsrq(HV_X64_MSR_TSC_INVARIANT_CONTROL, HV_EXPOSE_INVARIANT_TSC);
> -		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
>  	}

If a Hyper-V VM exposes an invariant TSC but lacks the frequency MSRs,
does it bypass the tsc_register_calibration_routines() block entirely?

Without the standalone setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE) call
here, it looks like these VMs will lose the reliable flag.

Will this inadvertently enable the TSC watchdog, potentially causing a
performance regression if the system falsely marks the TSC as unstable due
to virtualization scheduling delays?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=31

^ permalink raw reply

* Re: [PATCH net] net: mana: Fix TOCTOU double-fetch of hwc_msg_id from DMA buffer
From: sashiko-bot @ 2026-05-15 19:44 UTC (permalink / raw)
  To: Erni Sri Satya Vennela; +Cc: linux-hyperv
In-Reply-To: <20260514194156.466823-1-ernis@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Unvalidated array index `rx_req_idx` derived from untrusted DMA memory allows arbitrary out-of-bounds memory access.
--

commit b79489b09c3449da038fde04cce5dc4d6a1fe538
Author: Erni Sri Satya Vennela <ernis@linux.microsoft.com>

net: mana: Fix TOCTOU double-fetch of hwc_msg_id from DMA buffer

This commit fixes a TOCTOU vulnerability in mana_hwc_rx_event_handler() where
the hwc_msg_id is read twice from DMA-coherent memory. It reads the msg_id once
using READ_ONCE() and passes it to mana_hwc_handle_resp() to prevent a malicious
host in a Confidential VM from bypassing bounds checks.

> diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> index dbbde0fa57e71..fd8b324d7fb68 100644
> --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c

[ ... ]

> @@ -269,13 +268,17 @@ static void mana_hwc_rx_event_handler(void *ctx, u32 gdma_rxq_id,
>  	rx_req = &hwc_rxq->msg_buf->reqs[rx_req_idx];

Since this patch addresses host-controlled DMA memory vulnerabilities in
Confidential VMs, could there be another out-of-bounds access just before
the msg_id check?

Looking at how rx_req_idx is calculated earlier in this function:

	sge = (struct gdma_sge *)(wqe + 8 + dma_oob->inline_oob_size_div4 * 4);
	rq_base_addr = hwc_rxq->msg_buf->mem_info.dma_handle;
	rx_req_idx = (sge->address - rq_base_addr) / hwc->max_req_msg_size;

Because sge->address is read directly from the untrusted DMA queue, a malicious
host could provide an arbitrary value, resulting in an out-of-bounds
rx_req_idx.

Using this unvalidated index to access hwc_rxq->msg_buf->reqs[rx_req_idx]
yields an attacker-controlled rx_req pointer. When this corrupted rx_req is
subsequently passed to mana_hwc_post_rx_wqe(), the driver writes to req->sge
and req->wqe_req, which could allow arbitrary writes to guest kernel memory.

Should rx_req_idx be bounds-checked against the size of the reqs array
before it is used to index the array?

>  	resp = (struct gdma_resp_hdr *)rx_req->buf_va;
>  
> -	if (resp->response.hwc_msg_id >= hwc->num_inflight_msg) {
> -		dev_err(hwc->dev, "HWC RX: wrong msg_id=%u\n",
> -			resp->response.hwc_msg_id);
> +	/* Read msg_id once from DMA buffer to prevent TOCTOU:
> +	 * DMA memory is shared/unencrypted in CVMs - host can
> +	 * modify it between reads.
> +	 */
> +	msg_id = READ_ONCE(resp->response.hwc_msg_id);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260514194156.466823-1-ernis@linux.microsoft.com?part=1

^ permalink raw reply

* Re: [PATCH v3 16/41] x86/vmware: Nullify save/restore hooks when using VMware's sched_clock
From: sashiko-bot @ 2026-05-15 19:42 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-hyperv
In-Reply-To: <20260515191942.1892718-17-seanjc@google.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Setting `x86_platform.save_sched_clock_state` and `x86_platform.restore_sched_clock_state` to `NULL` causes a kernel panic during suspend and resume operations.
--

commit 2ff4708308b8e09ffda052258a76f33f136c2ea8
Author: Sean Christopherson <seanjc@google.com>

x86/vmware: Nullify save/restore hooks when using VMware's sched_clock

This commit sets the sched_clock save and restore hooks to null when
using the VMware sched_clock implementation. It aims to bypass these
hooks during suspend and resume operations for VMware guests.

> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
> index b88d9ca01202d..b5cb66ca022b8 100644
> --- a/arch/x86/kernel/cpu/vmware.c
> +++ b/arch/x86/kernel/cpu/vmware.c

[ ... ]

> @@ -347,8 +347,11 @@ static void __init vmware_paravirt_ops_setup(void)
>  
>  	vmware_cyc2ns_setup();
>  
> -	if (vmw_sched_clock)
> +	if (vmw_sched_clock) {
>  		paravirt_set_sched_clock(vmware_sched_clock);
> +		x86_platform.save_sched_clock_state = NULL;
> +		x86_platform.restore_sched_clock_state = NULL;

Will this cause a kernel panic during suspend and resume operations?

In arch/x86/power/cpu.c, both save_processor_state() and
__restore_processor_state() unconditionally call these function pointers
without checking if they are populated:

arch/x86/power/cpu.c:save_processor_state() {
    ...
    x86_platform.save_sched_clock_state();
    ...
}

Since VMware guests support standard x86 ACPI S3 and hibernate pathways,
executing a null function pointer here will crash the kernel during suspend
or hibernate.

Could these be assigned to an empty dummy function like x86_init_noop
instead of null to safely make them a no-op?

> +	}
>  
>  	if (vmware_is_stealclock_available()) {
>  		has_steal_clock = true;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=16

^ permalink raw reply

* [PATCH v3 41/41] x86/kvmclock: Get CPU base frequency from CPUID when it's available
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
  To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>

If CPUID.0x16 is present and valid, use the CPU frequency provided by
CPUID instead of assuming that the virtual CPU runs at the same
frequency as TSC and/or kvmclock.  Back before constant TSCs were a
thing, treating the TSC and CPU frequencies as one and the same was
somewhat reasonable, but now it's nonsensical, especially if the
hypervisor explicitly enumerates the CPU frequency.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/kvmclock.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 62c8ea2e6769..7607920ae386 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -190,6 +190,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
 	}
 }
 
+static unsigned long kvm_get_cpu_khz(void)
+{
+	unsigned int cpu_khz;
+
+	/*
+	 * Prefer CPUID over kvmclock when possible, as the base CPU frequency
+	 * isn't necessarily the same as the kvmlock "TSC" frequency.
+	 */
+	if (!cpuid_get_cpu_freq(&cpu_khz))
+		return cpu_khz;
+
+	return pvclock_tsc_khz(this_cpu_pvti());
+}
+
 /*
  * If we don't do that, there is the possibility that the guest
  * will calibrate under heavy load - thus, getting a lower lpj -
@@ -434,7 +448,7 @@ void __init kvmclock_init(void)
 		kvm_sched_clock_init(stable);
 	}
 
-	tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz,
+	tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_cpu_khz,
 					  tsc_properties);
 
 	x86_platform.get_wallclock = kvm_get_wallclock;
-- 
2.54.0.563.g4f69b47b94-goog


^ permalink raw reply related

* [PATCH v3 40/41] x86/tsc: Add standalone helper for getting CPU frequency from CPUID
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
  To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>

Extract the guts of cpu_khz_from_cpuid() to a standalone helper that
doesn't restrict the usage to Intel CPUs.  This will allow sharing the
core logic with kvmclock, as (a) CPUID.0x16 may be enumerated alongside
kvmclock, and (b) KVM generally doesn't restrict CPUID based on vendor.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/tsc.h |  1 +
 arch/x86/kernel/tsc.c      | 37 +++++++++++++++++++++++--------------
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index f458be688512..c145f5707b52 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -91,6 +91,7 @@ struct cpuid_tsc_info {
 };
 extern int cpuid_get_tsc_info(struct cpuid_tsc_info *info);
 extern int cpuid_get_tsc_freq(struct cpuid_tsc_info *info);
+extern int cpuid_get_cpu_freq(unsigned int *cpu_khz);
 
 extern void tsc_early_init(void);
 extern void tsc_init(void);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 1b569954ae5e..745fa2052c74 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -719,6 +719,24 @@ int cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
 	return 0;
 }
 
+int cpuid_get_cpu_freq(unsigned int *cpu_khz)
+{
+	unsigned int eax_base_mhz, ebx, ecx, edx;
+
+	*cpu_khz = 0;
+
+	if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
+		return -ENOENT;
+
+	cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
+
+	if (!eax_base_mhz)
+		return -ENOENT;
+
+	*cpu_khz = eax_base_mhz * 1000;
+	return 0;
+}
+
 /**
  * native_calibrate_tsc - determine TSC frequency
  * Determine TSC frequency via CPUID, else return 0.
@@ -754,13 +772,8 @@ unsigned long native_calibrate_tsc(void)
 	 * clock, but we can easily calculate it to a high degree of accuracy
 	 * by considering the crystal ratio and the CPU speed.
 	 */
-	if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
-		unsigned int eax_base_mhz, ebx, ecx, edx;
-
-		cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
-		info.crystal_khz = eax_base_mhz * 1000 *
-			info.denominator / info.numerator;
-	}
+	if (!info.crystal_khz && !cpuid_get_cpu_freq(&cpu_khz))
+		info.crystal_khz = cpu_khz * info.denominator / info.numerator;
 
 	if (!info.crystal_khz)
 		return 0;
@@ -787,19 +800,15 @@ unsigned long native_calibrate_tsc(void)
 
 static unsigned long cpu_khz_from_cpuid(void)
 {
-	unsigned int eax_base_mhz, ebx_max_mhz, ecx_bus_mhz, edx;
+	unsigned int cpu_khz;
 
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
 		return 0;
 
-	if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
+	if (cpuid_get_cpu_freq(&cpu_khz))
 		return 0;
 
-	eax_base_mhz = ebx_max_mhz = ecx_bus_mhz = edx = 0;
-
-	cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx_max_mhz, &ecx_bus_mhz, &edx);
-
-	return eax_base_mhz * 1000;
+	return cpu_khz;
 }
 
 /*
-- 
2.54.0.563.g4f69b47b94-goog


^ permalink raw reply related

* [PATCH v3 39/41] x86/paravirt: Move using_native_sched_clock() stub into timer.h
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
  To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>

Now that timer.h ended up with CONFIG_PARAVIRT #ifdeffery anyways, move the
PARAVIRT=n using_native_sched_clock() stub into timer.h as a "free"
optimization.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/timer.h | 5 +++--
 arch/x86/kernel/tsc.c        | 2 --
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index ab1271bd9c3b..d8cb9c84f2c7 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -11,9 +11,9 @@ extern void recalibrate_cpu_khz(void);
 
 extern int no_timer_check;
 
-extern bool using_native_sched_clock(void);
-
 #ifdef CONFIG_PARAVIRT
+extern bool using_native_sched_clock(void);
+
 int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
 				      void (*save)(void), void (*restore)(void),
 				      void (*start_secondary));
@@ -27,6 +27,7 @@ static __always_inline void paravirt_set_sched_clock(u64 (*func)(void),
 void paravirt_sched_clock_start_secondary(void);
 #else
 static inline void paravirt_sched_clock_start_secondary(void) { }
+static inline bool using_native_sched_clock(void) { return true; }
 #endif
 
 /*
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f78e86494dec..1b569954ae5e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -316,8 +316,6 @@ int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
 }
 #else
 u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
-
-bool using_native_sched_clock(void) { return true; }
 #endif
 
 notrace u64 sched_clock(void)
-- 
2.54.0.563.g4f69b47b94-goog


^ permalink raw reply related

* [PATCH v3 38/41] x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
  To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>

Rework the seemingly generic x86_cpuinit_ops.early_percpu_clock_init hook
into a dedicated PV sched_clock hook, as the only reason the hook exists
is to allow kvmclock to enable its PV clock on secondary CPUs before the
kernel tries to reference sched_clock, e.g. when grabbing a timestamp for
printk.

Rearranging the hook doesn't exactly reduce complexity; arguably it does
the opposite.  But as-is, it's practically impossible to understand *why*
kvmclock needs to do early configuration.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/timer.h    |  8 ++++++--
 arch/x86/include/asm/x86_init.h |  2 --
 arch/x86/kernel/kvmclock.c      | 13 ++++++-------
 arch/x86/kernel/smpboot.c       |  3 ++-
 arch/x86/kernel/tsc.c           | 16 +++++++++++++++-
 arch/x86/kernel/x86_init.c      |  1 -
 6 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index ca5c95d48c03..ab1271bd9c3b 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -15,14 +15,18 @@ extern bool using_native_sched_clock(void);
 
 #ifdef CONFIG_PARAVIRT
 int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
-				      void (*save)(void), void (*restore)(void));
+				      void (*save)(void), void (*restore)(void),
+				      void (*start_secondary));
 
 static __always_inline void paravirt_set_sched_clock(u64 (*func)(void),
 						     void (*save)(void),
 						     void (*restore)(void))
 {
-	(void)__paravirt_set_sched_clock(func, true, save, restore);
+	(void)__paravirt_set_sched_clock(func, true, save, restore, NULL);
 }
+void paravirt_sched_clock_start_secondary(void);
+#else
+static inline void paravirt_sched_clock_start_secondary(void) { }
 #endif
 
 /*
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 6c8a6ead84f6..d1b3f18ea41f 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -187,13 +187,11 @@ struct x86_init_ops {
 /**
  * struct x86_cpuinit_ops - platform specific cpu hotplug setups
  * @setup_percpu_clockev:	set up the per cpu clock event device
- * @early_percpu_clock_init:	early init of the per cpu clock event device
  * @fixup_cpu_id:		fixup function for cpuinfo_x86::topo.pkg_id
  * @parallel_bringup:		Parallel bringup control
  */
 struct x86_cpuinit_ops {
 	void (*setup_percpu_clockev)(void);
-	void (*early_percpu_clock_init)(void);
 	void (*fixup_cpu_id)(struct cpuinfo_x86 *c, int node);
 	bool parallel_bringup;
 };
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 0578bc448b1b..62c8ea2e6769 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -127,12 +127,13 @@ static void kvm_save_sched_clock_state(void)
 	kvmclock_disable();
 }
 
-#ifdef CONFIG_SMP
-static void kvm_setup_secondary_clock(void)
+static void kvm_setup_secondary_sched_clock(void)
 {
+	if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_SMP)))
+		return;
+
 	kvm_register_clock("secondary cpu, sched_clock setup");
 }
-#endif
 
 static void kvm_restore_sched_clock_state(void)
 {
@@ -352,7 +353,8 @@ static __init void kvm_sched_clock_init(bool stable)
 {
 	if (__paravirt_set_sched_clock(kvm_sched_clock_read, stable,
 				       kvm_save_sched_clock_state,
-				       kvm_restore_sched_clock_state))
+				       kvm_restore_sched_clock_state,
+				       kvm_setup_secondary_sched_clock))
 		return;
 
 	kvm_sched_clock_offset = kvm_clock_read();
@@ -437,9 +439,6 @@ void __init kvmclock_init(void)
 
 	x86_platform.get_wallclock = kvm_get_wallclock;
 	x86_platform.set_wallclock = kvm_set_wallclock;
-#ifdef CONFIG_SMP
-	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
-#endif
 	kvm_get_preset_lpj();
 
 	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 294a8ea60298..318ae70e5e7b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -78,6 +78,7 @@
 #include <asm/io_apic.h>
 #include <asm/fpu/api.h>
 #include <asm/setup.h>
+#include <asm/timer.h>
 #include <asm/uv/uv.h>
 #include <asm/microcode.h>
 #include <asm/i8259.h>
@@ -275,7 +276,7 @@ static void notrace __noendbr start_secondary(void *unused)
 	cpu_init();
 	fpu__init_cpu();
 	rcutree_report_cpu_starting(raw_smp_processor_id());
-	x86_cpuinit.early_percpu_clock_init();
+	paravirt_sched_clock_start_secondary();
 
 	ap_starting();
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7a261214fa3e..f78e86494dec 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -280,8 +280,19 @@ bool using_native_sched_clock(void)
 	return static_call_query(pv_sched_clock) == native_sched_clock;
 }
 
+#ifdef CONFIG_SMP
+static void (*pv_sched_clock_start_secondary)(void) __ro_after_init;
+
+void paravirt_sched_clock_start_secondary(void)
+{
+	if (pv_sched_clock_start_secondary)
+		pv_sched_clock_start_secondary();
+}
+#endif
+
 int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
-				      void (*save)(void), void (*restore)(void))
+				      void (*save)(void), void (*restore)(void),
+				      void (*start_secondary))
 {
 	/*
 	 * Don't replace TSC with a PV clock when running as a CoCo guest and
@@ -298,6 +309,9 @@ int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
 	static_call_update(pv_sched_clock, func);
 	x86_platform.save_sched_clock_state = save;
 	x86_platform.restore_sched_clock_state = restore;
+#ifdef CONFIG_SMP
+	pv_sched_clock_start_secondary = start_secondary;
+#endif
 	return 0;
 }
 #else
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index ebefb77c37bb..cbb5ee613ed5 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -128,7 +128,6 @@ struct x86_init_ops x86_init __initdata = {
 };
 
 struct x86_cpuinit_ops x86_cpuinit = {
-	.early_percpu_clock_init	= x86_init_noop,
 	.setup_percpu_clockev		= setup_secondary_APIC_clock,
 	.parallel_bringup		= true,
 };
-- 
2.54.0.563.g4f69b47b94-goog


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox