Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v3 41/41] x86/kvmclock: Get CPU base frequency from CPUID when it's available
From: David Woodhouse @ 2026-05-20 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-42-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1441 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> If CPUID.0x16 is present and valid, use the CPU frequency provided by
> CPUID instead of assuming that the virtual CPU runs at the same
> frequency as TSC and/or kvmclock.  Back before constant TSCs were a
> thing, treating the TSC and CPU frequencies as one and the same was
> somewhat reasonable, but now it's nonsensical, especially if the
> hypervisor explicitly enumerates the CPU frequency.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/kvmclock.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 62c8ea2e6769..7607920ae386 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -190,6 +190,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
>  	}
>  }
>  
> +static unsigned long kvm_get_cpu_khz(void)
> +{
> +	unsigned int cpu_khz;
> +
> +	/*
> +	 * Prefer CPUID over kvmclock when possible, as the base CPU frequency
> +	 * isn't necessarily the same as the kvmlock "TSC" frequency.
> +	 */
> +	if (!cpuid_get_cpu_freq(&cpu_khz))
> +		return cpu_khz;
> +
> +	return pvclock_tsc_khz(this_cpu_pvti());

I'm fine with this in principle but shouldn't the fallback be calling
kvm_get_tsc_khz() instead of directly calling pvclock_tsc_khz()?



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 01/41] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: David Woodhouse @ 2026-05-20 18:59 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-2-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 755 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Extract retrieval of TSC frequency information from CPUID into standalone
> helpers so that TDX guest support can reuse the logic.  Provide a version
> that includes the multiplier math as TDX does NOT want to use
> native_calibrate_tsc()'s fallback logic that derives the TSC frequency
> based on CPUID.0x16, when the core crystal frequency isn't known.
> 
> Opportunsitically drop native_calibrate_tsc()'s "== 0" and "!= 0" checks
> in favor of the kernel's preferred style.

"Opportunistically" ? Now that looks wrong too... 

> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* RE: [PATCH v3 31/41] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
From: Michael Kelley @ 2026-05-20 19:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: sashiko-reviews@lists.linux.dev, linux-hyperv@vger.kernel.org
In-Reply-To: <ag3kFYLrypsBnlkY@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, May 20, 2026 9:41 AM
> 
> On Tue, May 19, 2026, Michael Kelley wrote:
> > From: Sean Christopherson <seanjc@google.com> Sent: Monday, May 18, 2026 3:18 PM
> > > > > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > > > > --- a/arch/x86/kernel/cpu/mshyperv.c
> > > > > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > > > > @@ -516,8 +516,13 @@ static void __init ms_hyperv_init_platform(void)
> > > > >
> > > > >  	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
> > > > >  	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> > > > > -		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
> > > > > -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > > > > +		enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
> > > > > +
> > > > > +		if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
> > > > > +			tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > > > > +
> > > > > +		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz,
> > > > > +						  tsc_properties);
> > > > >  	}
> > > >
> > > > [ ... ]
> > > >
> > > > > @@ -629,7 +634,6 @@ static void __init ms_hyperv_init_platform(void)
> > > > >  		 * is called.
> > > > >  		 */
> > > > >  		wrmsrq(HV_X64_MSR_TSC_INVARIANT_CONTROL, HV_EXPOSE_INVARIANT_TSC);
> > > > > -		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > > > >  	}
> > > >
> > > > If a Hyper-V VM exposes an invariant TSC but lacks the frequency MSRs,
> > > > does it bypass the tsc_register_calibration_routines() block entirely?
> > >
> > > Yes.
> > >
> > > > Without the standalone setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE) call
> > > > here, it looks like these VMs will lose the reliable flag.
> > > >
> > > > Will this inadvertently enable the TSC watchdog, potentially causing a
> > > > performance regression if the system falsely marks the TSC as unstable due
> > > > to virtualization scheduling delays?
> > >
> > > Hmm, I was going to say that the change was intentional and desriable, but looking
> > > at this yet again, I don't think that's true.  Enabling HV_EXPOSE_INVARIANT_TSC
> > > just means the kernel will (probably) set X86_FEATURE_CONSTANT_TSC and
> > > X86_FEATURE_NONSTOP_TSC during early_init_intel(), AFAICT it doesn't lead to
> > > X86_FEATURE_TSC_RELIABLE being set.  And I think in this case, marking the TSC
> > > as reliable makes sense; even if the kernel doesn't user Hyper-V's calibration
> > > info, the host is still clearly telling the guest that the TSC is reliable.
> >
> > Yes, I agree. But I'm doubtful that such a combination ever occurs in practice.
> > I've never seen an occurrence of a Hyper-V (even really old versions) guest
> > where the frequency MSRs are not available or are not accessible. The
> > Hyper-V spec allows for either condition, so we have the code to test the
> > flags. I've thought of the flags as always set, though I suppose one never
> > knows what the future holds.
> >
> > >
> > > Michael, does keeping the
> > >
> > > 		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > >
> > > but also passing TSC_FREQ_KNOWN_AND_RELIABLE to the calibration routine make
> > > sense?
> >
> > I don't see that it would break anything. But it seems a bit disjointed in
> > that HV_ACCESS_TSC_INVARIANT is tested in two places in
> > ms_hyperv_init_platform(). Does TSC_RELIABLE *need* to be passed to
> > tsc_register_calibration_routines() if later code in ms_hyperv_init_platform()
> > does setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)?
> 
> Sort of?  It's not strictly necessary, but passing in TSC_RELIABLE allows
> tsc_register_calibration_routines() to ensure it doesn't clobber a more robust
> calibration routine with a "lesser" routine.  I.e. not passing TSC_RELIABLE in
> this case would trigger a false positive (and break Hyper-V).
> 
> In other words, invoking setup_force_cpu_cap() is a (happy, desirable) side effect,
> not the primary goal.
> 
> > In other words, I'm suggesting let tsc_register_calibration_routines() handle
> > the TSC_FREQ_KNOWN case since that's what the calibration routines are all
> > about. Leave the setting of X86_FEATURE_TSC_RELIABLE to the later code that
> > tests HV_ACCESS_TSC_INVARIANT, instead of duplicating the
> > setup_force_cpu_cap() operation.
> >
> > While combining FREQUENCY_KNOWN and RELIABLE into
> > tsc_register_calibration_routines() is convenient, the two
> > concepts turn out to be independent when looking strictly at
> > the Hyper-V spec and code written to follow that spec.
> > Combining them into the same function ends up being clumsy
> 
> Yeah, it's a bit awkward for Hyper-V, but Hyper-V is definitely the odd one out
> here, in that it has an "out-of-band" feature that marks the TSC as reliable.
> All other PV features that trigger overrides of the calibration routines bundle
> the RELIABLE aspect with the feature itself.

Indeed, Hyper-V is definitely the odd one here. Keeping the
"setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)" in the later code
block where HV_ACCESS_TSC_INVARIANT is tested works well enough.
In the real world, the frequency MSRs are available and accessible,
so the additional "setup_force_cpu_cap()" is duplicative, but doesn't
hurt anything.

Michael

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-20 19:04 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-3-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 2333 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Add a helper to register non-native, i.e. PV and CoCo, CPU and TSC
> frequency calibration routines.  This will allow consolidating handling
> of common TSC properties that are forced by hypervisor (PV routines),
> and will also allow adding sanity checks to guard against overriding a
> TSC calibration routine with a routine that is less robust/trusted.
> 
> Make the CPU calibration routine optional, as Xen (very sanely) doesn't
> assume the CPU runs as the same frequency as the TSC.
> 
> Wrap the helper in an #ifdef to document that the kernel overrides
> the native routines when running as a VM, and to guard against unwanted
> usage.  Add a TODO to call out that AMD_MEM_ENCRYPT is a mess and doesn't
> depend on HYPERVISOR_GUEST because it gates both guest and host code.
> 
> No functional change intended.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Mildly concerned that we might want to support multiple options — does
it have CPUID 0x15? Does it have 0x40000x10? Does it have a pvclock?
There are various permutations of those which are perhaps best handled
by *trying* each one, in some order, and populating a struct with the
answers?

But on the basis that perfect is the enemy of good,

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index b5991d53fc0e..e9e7394140dd 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -321,8 +321,8 @@ void __init kvmclock_init(void)
>  	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
>  	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
>  
> -	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> -	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
> +	tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
> +
>  	x86_platform.get_wallclock = kvm_get_wallclock;
>  	x86_platform.set_wallclock = kvm_set_wallclock;
>  #ifdef CONFIG_X86_LOCAL_APIC

Can we move those (and maybe everything in the context there too) up
*before* the check for no-kvmclock at the top of the function? Probably
in a separate patch.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 41/41] x86/kvmclock: Get CPU base frequency from CPUID when it's available
From: Sean Christopherson @ 2026-05-20 19:06 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <0a3aa07a1d3c4bec2b89f8026093969155b73caa.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > If CPUID.0x16 is present and valid, use the CPU frequency provided by
> > CPUID instead of assuming that the virtual CPU runs at the same
> > frequency as TSC and/or kvmclock.  Back before constant TSCs were a
> > thing, treating the TSC and CPU frequencies as one and the same was
> > somewhat reasonable, but now it's nonsensical, especially if the
> > hypervisor explicitly enumerates the CPU frequency.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kernel/kvmclock.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index 62c8ea2e6769..7607920ae386 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -190,6 +190,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
> >  	}
> >  }
> >  
> > +static unsigned long kvm_get_cpu_khz(void)
> > +{
> > +	unsigned int cpu_khz;
> > +
> > +	/*
> > +	 * Prefer CPUID over kvmclock when possible, as the base CPU frequency
> > +	 * isn't necessarily the same as the kvmlock "TSC" frequency.
> > +	 */
> > +	if (!cpuid_get_cpu_freq(&cpu_khz))
> > +		return cpu_khz;
> > +
> > +	return pvclock_tsc_khz(this_cpu_pvti());
> 
> I'm fine with this in principle but shouldn't the fallback be calling
> kvm_get_tsc_khz() instead of directly calling pvclock_tsc_khz()?

Oh, yeah, for this patch, definitely yes, so that there's no side effects.  The
question really should be answered in the context of "x86/kvmclock: Obtain TSC
frequency from CPUID if present", which subtly impacts the CPU frequency, but I
think the answer is "yes" there as well.

^ permalink raw reply

* Re: [PATCH v3 03/41] x86/sev: Mark TSC as reliable when configuring Secure TSC
From: David Woodhouse @ 2026-05-20 19:06 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-4-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 653 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move the code to mark the TSC as reliable from sme_early_init() to
> snp_secure_tsc_init().  The only reader of TSC_RELIABLE is the aptly
> named check_system_tsc_reliable(), which runs in tsc_init(), i.e.
> after snp_secure_tsc_init().
> 
> This will allow consolidating the handling of TSC_KNOWN_FREQ and
> TSC_RELIABLE when overriding the TSC calibration routine.
> 
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>


Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 04/41] x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
From: David Woodhouse @ 2026-05-20 19:16 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-5-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 572 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move the check on having a Secure TSC to the common tsc_early_init() so
> that it's obvious that having a Secure TSC is conditional, and to prepare
> for adding TDX to the mix (blindly initializing *both* SNP and TDX TSC
> logic looks especially weird).
> 
> No functional change intended.
> 
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* RE: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Michael Kelley @ 2026-05-20 19:26 UTC (permalink / raw)
  To: Yu Zhang, Jason Gunthorpe
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <lxmfd2ml5dafkxquuf5i45uqgh6wxl44hlqphu77kvximqrnmi@b3htoyjc6d5z>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:15 AM
> 
> On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > > +					  unsigned long start,
> > > +					  unsigned long end)
> > > +{
> > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > +	u16 count = 0;
> > > +
> > > +	while (nr_pages > 0) {
> > > +		unsigned long flush_pages;
> > > +		int order;
> > > +		unsigned long pfn_align;
> > > +		unsigned long size_align;
> > > +
> > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > +			break;
> > > +		}
> > > +
> > > +		if (start_pfn)
> > > +			pfn_align = __ffs(start_pfn);
> > > +		else
> > > +			pfn_align = BITS_PER_LONG - 1;
> > > +
> > > +		size_align = __fls(nr_pages);
> > > +		order = min(pfn_align, size_align);
> > > +		iova_list[count].page_mask_shift = order;
> > > +		iova_list[count].page_number = start_pfn;
> > > +
> > > +		flush_pages = 1UL << order;
> > > +		start_pfn += flush_pages;
> > > +		nr_pages -= flush_pages;
> > > +		count++;
> > > +	}
> >
> > This seems like a really silly hypervisor interface. Why doesn't it
> > just accept a normal range? Splitting it into power of two aligned
> > ranges is very inefficient.
> 
> Fair point. I'm not sure how much flexibility we have to change
> this hypercall interface at the moment - it predates the pvIOMMU
> work and may have other consumers beyond Linux guest. On the other
> hand, having the guest specify 2^N-aligned blocks does save the
> hypervisor from having to decompose ranges itself before issuing
> hardware invalidation commands - the guest-provided entries can be
> fed to the HW more or less directly.
> 
> That said, the way I'm currently using this interface may be
> more precise than necessary. Maybe we have 2 options:
> 
> 1) Current approach: decompose the range into multiple exact
>    2^N-aligned blocks with no over-flush, but at the cost of
>    more complex calculations and more entries.
> 
> 2) Follow what Intel/AMD drivers do: find a single minimal
>    2^N-aligned block that covers the entire range, but may
>    over-flush.
> 
> Any preference?
> 
> @Michael, since you've also been reviewing this patch, I'd
> appreciate your thoughts on the above as well. :)
> 

I'm just guessing, but perhaps flushing an aligned power-of-2
range can be processed by the hypervisor at a relatively fixed
cost, regardless of the size. Having the guest do the decomposing
of an arbitrary range allows the hypervisor to make use of the
existing "rep" hypercall mechanism if the hypercall is taking
"too long". The hypervisor can pause its processing, return to
the guest temporarily, and then continue the hypercall. If the
arbitrary range were passed into the hypercall for the hypervisor
to do the decomposing, that pause-and-restart mechanism
wouldn't be available.

Of course, Linux doesn't really take advantage of the pause to
reduce guest interrupt latency because the Hyper-V code in
Linux typically disable interrupts around a hypercall due to the
way the hypercall input page is allocated. But other guest
operating systems might benefit from such a pause. And we could
probably fix the Hyper-V code in Linux to allow interrupts during a
hypercall pause/restart if long-running hypercalls turn out to be
a problem.

Regarding proposal (1) vs. (2), perhaps you could confirm with
the Hyper-V team that flushing an aligned power-of-2 range
has relatively fixed cost, regardless of the size. And what do the
flush requests coming from the generic IOMMU subsystem look
like? Do they match dma_unmap() ranges, which are probably
dominated by relatively small ranges of a few pages at most,
with a few outliers for disk I/O requests of 1 MiB or some such?
If the dominant flush request is for a few pages at most, then
doing (2) seems reasonable.

Michael

^ permalink raw reply

* Re: [PATCH v3 05/41] x86/tdx: Override PV calibration routines with CPUID-based calibration
From: David Woodhouse @ 2026-05-20 19:52 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-6-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 2003 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> When running as a TDX guest, explicitly override the TSC frequency
> calibration routine with CPUID-based calibration instead of potentially
> relying on a hypervisor-controlled PV routine.  For TDX guests, CPUID.0x15
> is always emulated by the TDX-Module, i.e. the information from CPUID is
> more trustworthy than the information provided by the hypervisor.
> 
> To maintain backwards compatibility with TDX guest kernels that use native
> calibration, and because it's the least awful option, retain
> native_calibrate_tsc()'s stuffing of the local APIC bus period using the
> core crystal frequency.  While it's entirely possible for the hypervisor
> to emulate the APIC timer at a different frequency than the core crystal
> frequency, the commonly accepted interpretation of Intel's SDM is that APIC
> timer runs at the core crystal frequency when that latter is enumerated via
> CPUID:
> 
>   The APIC timer frequency will be the processor’s bus clock or core
>   crystal clock frequency (when TSC/core crystal clock ratio is enumerated
>   in CPUID leaf 0x15).
> 
> If the hypervisor is malicious and deliberately runs the APIC timer at the
> wrong frequency, nothing would stop the hypervisor from modifying the
> frequency at any time, i.e. attempting to manually calibrate the frequency
> out of paranoia would be futile.
> 
> Deliberately leave the CPU frequency calibration routine as is, since the
> TDX-Module doesn't provide any guarantees with respect to CPUID.0x16.
> 
> Opportunistically add a comment explaining that CoCo TSC initialization
> needs to come after hypervisor specific initialization.
> 
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I don't much like stuffing the lapic_timer_period... but I'll give you
'least awful option'. For now.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 06/41] x86/acrn: Mark TSC frequency as known when using ACRN for calibration
From: David Woodhouse @ 2026-05-20 20:01 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-7-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Mark the TSC frequency as known when using ACRN's PV CPUID information.
> Per commit 81a71f51b89e ("x86/acrn: Set up timekeeping") and common sense,
> the TSC freq is explicitly provided by the hypervisor.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/acrn.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> index c1506cb87d8c..2da3de4d470e 100644
> --- a/arch/x86/kernel/cpu/acrn.c
> +++ b/arch/x86/kernel/cpu/acrn.c
> @@ -29,6 +29,7 @@ static void __init acrn_init_platform(void)
>  	/* Install system interrupt handler for ACRN hypervisor callback */
>  	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
>  
> +	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>  	tsc_register_calibration_routines(acrn_get_tsc_khz,
>  					  acrn_get_tsc_khz);

I'd feel slightly happier doing that from within acrn_get_tsc_khz()
itself.... which I note is 'static inline'. I'm vaguely surprised that
even builds (although it does). Probably nicer to move it explicitly
out of line in acrn.c though.

Most of that should be a comment on patch  2 of the series, I guess?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Jacob Pan @ 2026-05-20 20:40 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Yu Zhang, Jason Gunthorpe, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com, jacob.pan
In-Reply-To: <SN6PR02MB4157C1EC7F5F69C5ABDA9C7FD4012@SN6PR02MB4157.namprd02.prod.outlook.com>

Hi Michael,

On Wed, 20 May 2026 19:26:24 +0000
Michael Kelley <mhklinux@outlook.com> wrote:

> From: Michael Kelley <mhklinux@outlook.com>
> To: Yu Zhang <zhangyu1@linux.microsoft.com>, Jason Gunthorpe
> <jgg@ziepe.ca> CC: "linux-kernel@vger.kernel.org"
> <linux-kernel@vger.kernel.org>,  "linux-hyperv@vger.kernel.org"
> <linux-hyperv@vger.kernel.org>,  "iommu@lists.linux.dev"
> <iommu@lists.linux.dev>, "linux-pci@vger.kernel.org"
> <linux-pci@vger.kernel.org>, "linux-arch@vger.kernel.org"
> <linux-arch@vger.kernel.org>, "wei.liu@kernel.org"
> <wei.liu@kernel.org>,  "kys@microsoft.com" <kys@microsoft.com>,
> "haiyangz@microsoft.com"  <haiyangz@microsoft.com>,
> "decui@microsoft.com" <decui@microsoft.com>,  "longli@microsoft.com"
> <longli@microsoft.com>, "joro@8bytes.org"  <joro@8bytes.org>,
> "will@kernel.org" <will@kernel.org>,  "robin.murphy@arm.com"
> <robin.murphy@arm.com>, "bhelgaas@google.com"  <bhelgaas@google.com>,
> "kwilczynski@kernel.org" <kwilczynski@kernel.org>,
> "lpieralisi@kernel.org" <lpieralisi@kernel.org>, "mani@kernel.org"
> <mani@kernel.org>, "robh@kernel.org" <robh@kernel.org>,
> "arnd@arndb.de"  <arnd@arndb.de>, "jacob.pan@linux.microsoft.com"
> <jacob.pan@linux.microsoft.com>, "tgopinath@linux.microsoft.com"
> <tgopinath@linux.microsoft.com>,
> "easwar.hariharan@linux.microsoft.com"
> <easwar.hariharan@linux.microsoft.com> Subject: RE: [PATCH v1 4/4]
> iommu/hyperv: Add page-selective IOTLB flush  support Date: Wed, 20
> May 2026 19:26:24 +0000
> 
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May
> 20, 2026 10:15 AM
> > 
> > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:  
> > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:  
> > > > +static inline u16 hv_iommu_fill_iova_list(union
> > > > hv_iommu_flush_va *iova_list,
> > > > +					  unsigned long start,
> > > > +					  unsigned long end)
> > > > +{
> > > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > +	u16 count = 0;
> > > > +
> > > > +	while (nr_pages > 0) {
> > > > +		unsigned long flush_pages;
> > > > +		int order;
> > > > +		unsigned long pfn_align;
> > > > +		unsigned long size_align;
> > > > +
> > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		if (start_pfn)
> > > > +			pfn_align = __ffs(start_pfn);
> > > > +		else
> > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > +		size_align = __fls(nr_pages);
> > > > +		order = min(pfn_align, size_align);
> > > > +		iova_list[count].page_mask_shift = order;
> > > > +		iova_list[count].page_number = start_pfn;
> > > > +
> > > > +		flush_pages = 1UL << order;
> > > > +		start_pfn += flush_pages;
> > > > +		nr_pages -= flush_pages;
> > > > +		count++;
> > > > +	}  
> > >
> > > This seems like a really silly hypervisor interface. Why doesn't
> > > it just accept a normal range? Splitting it into power of two
> > > aligned ranges is very inefficient.  
> > 
> > Fair point. I'm not sure how much flexibility we have to change
> > this hypercall interface at the moment - it predates the pvIOMMU
> > work and may have other consumers beyond Linux guest. On the other
> > hand, having the guest specify 2^N-aligned blocks does save the
> > hypervisor from having to decompose ranges itself before issuing
> > hardware invalidation commands - the guest-provided entries can be
> > fed to the HW more or less directly.
> > 
> > That said, the way I'm currently using this interface may be
> > more precise than necessary. Maybe we have 2 options:
> > 
> > 1) Current approach: decompose the range into multiple exact
> >    2^N-aligned blocks with no over-flush, but at the cost of
> >    more complex calculations and more entries.
> > 
> > 2) Follow what Intel/AMD drivers do: find a single minimal
> >    2^N-aligned block that covers the entire range, but may
> >    over-flush.
> > 
> > Any preference?
> > 
> > @Michael, since you've also been reviewing this patch, I'd
> > appreciate your thoughts on the above as well. :)
> >   
> 
> I'm just guessing, but perhaps flushing an aligned power-of-2
> range can be processed by the hypervisor at a relatively fixed
> cost, regardless of the size. Having the guest do the decomposing
> of an arbitrary range allows the hypervisor to make use of the
> existing "rep" hypercall mechanism if the hypercall is taking
> "too long". The hypervisor can pause its processing, return to
> the guest temporarily, and then continue the hypercall. If the
> arbitrary range were passed into the hypercall for the hypervisor
> to do the decomposing, that pause-and-restart mechanism
> wouldn't be available.
> 
> Of course, Linux doesn't really take advantage of the pause to
> reduce guest interrupt latency because the Hyper-V code in
> Linux typically disable interrupts around a hypercall due to the
> way the hypercall input page is allocated. But other guest
> operating systems might benefit from such a pause. And we could
> probably fix the Hyper-V code in Linux to allow interrupts during a
> hypercall pause/restart if long-running hypercalls turn out to be
> a problem.
I am not sure if this pause feature is suitable for IOTLB flush at all
since it is inherently synchronous — the caller must block until all
invalidations complete. Pausing mid-flush to return to the guest
doesn't help if the guest can't make forward progress anyway.

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-20 20:44 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <44e0d60548d317fd59895f18bd17220dfb2f834b.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index b5991d53fc0e..e9e7394140dd 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -321,8 +321,8 @@ void __init kvmclock_init(void)
> >  	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> >  	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> >  
> > -	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> > -	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
> > +	tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
> > +
> >  	x86_platform.get_wallclock = kvm_get_wallclock;
> >  	x86_platform.set_wallclock = kvm_set_wallclock;
> >  #ifdef CONFIG_X86_LOCAL_APIC
> 
> Can we move those (and maybe everything in the context there too) up
> *before* the check for no-kvmclock at the top of the function?

Oof, I was going to say "no", but disabling kvmclock is exactly the workaround
I've told people to use to get the kernel to use the TSC instead of kvmclock.

> Probably in a separate patch.

Ya.  I think this?

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 08ee4bc304c8..92a1ebf31e4d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -27,6 +27,7 @@ static int kvmclock_vsyscall __initdata = 1;
 static int msr_kvm_system_time __ro_after_init;
 static int msr_kvm_wall_clock __ro_after_init;
 static u64 kvm_sched_clock_offset __ro_after_init;
+static unsigned int kvm_tsc_khz_cpuid __ro_after_init;
 
 static int __init parse_no_kvmclock(char *arg)
 {
@@ -207,7 +208,7 @@ static unsigned long kvm_get_tsc_khz(void)
                lapic_timer_period = apic_khz * 1000 / HZ;
 #endif
 
-       return kvm_para_tsc_khz() ? : pvclock_tsc_khz(this_cpu_pvti());
+       return kvm_tsc_khz_cpuid ? : pvclock_tsc_khz(this_cpu_pvti());
 }
 
 static unsigned long kvm_get_cpu_khz(void)
@@ -387,9 +388,39 @@ void __init kvmclock_init(void)
        enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
        bool stable = false;
 
-       if (!kvm_para_available() || !kvmclock)
+       if (!kvm_para_available())
                return;
 
+       /*
+        * If the TSC counts at a constant frequency across P/T states, counts
+        * in deep C-states, and the TSC hasn't been marked unstable, treat the
+        * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
+        * exists purely to honor the TSC being marked unstable via command
+        * line, any runtime detection of an unstable will happen after this.
+        */
+       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+           !check_tsc_unstable())
+               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
+
+       kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
+
+       /*
+        * If provided, use the TSC (and APIC bus) frequency provided in KVM's
+        * PV CPUID leaf even if kvmclock itself is disabled via command line.
+        * The PV CPUID information isn't dependent on kvmclock in any way, and
+        * in fact using the precise information is *more* important when the
+        * user has explicitly disabled kvmclock to force the kernel to use the
+        * TSC as its clocksource.
+        */
+       if (!kvmclock) {
+               if (kvm_tsc_khz_cpuid)
+                       tsc_register_calibration_routines(kvm_get_tsc_khz,
+                                                         kvm_get_cpu_khz,
+                                                         tsc_properties);
+               return;
+       }
+
        if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
                msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
                msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
@@ -424,21 +455,14 @@ void __init kvmclock_init(void)
        }
 
        /*
-        * If the TSC counts at a constant frequency across P/T states, counts
-        * in deep C-states, and the TSC hasn't been marked unstable, prefer
-        * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
-        * that TSC is chosen as the clocksource.  Note, the TSC unstable check
-        * exists purely to honor the TSC being marked unstable via command
-        * line, any runtime detection of an unstable will happen after this.
+        * If the TSC is reliable (see above), prefer the TSC over kvmclock for
+        * sched_clock and drop kvmclock's rating so that TSC is chosen as the
+        * clocksource.
         */
-       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
-           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
-           !check_tsc_unstable()) {
+       if (tsc_properties & TSC_RELIABLE)
                kvm_clock.rating = 299;
-               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
-       } else {
+       else
                kvm_sched_clock_init(stable);
-       }
 
        tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_cpu_khz,
                                          tsc_properties);

^ permalink raw reply related

* Re: [PATCH v3 06/41] x86/acrn: Mark TSC frequency as known when using ACRN for calibration
From: Sean Christopherson @ 2026-05-20 21:02 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <c347d65f555ad1e10a0e87ee57c5879c7046d0e7.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Mark the TSC frequency as known when using ACRN's PV CPUID information.
> > Per commit 81a71f51b89e ("x86/acrn: Set up timekeeping") and common sense,
> > the TSC freq is explicitly provided by the hypervisor.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kernel/cpu/acrn.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> > index c1506cb87d8c..2da3de4d470e 100644
> > --- a/arch/x86/kernel/cpu/acrn.c
> > +++ b/arch/x86/kernel/cpu/acrn.c
> > @@ -29,6 +29,7 @@ static void __init acrn_init_platform(void)
> >  	/* Install system interrupt handler for ACRN hypervisor callback */
> >  	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
> >  
> > +	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> >  	tsc_register_calibration_routines(acrn_get_tsc_khz,
> >  					  acrn_get_tsc_khz);
> 
> I'd feel slightly happier doing that from within acrn_get_tsc_khz()
> itself....

Ya, though as you note below, this is really a comment on the overall series.

Waiting to set the cap until the calibration routine is actually run does prevent
the case where the something registers a calibration routine, but its routine is
never run.

However, because the cap is sticky, it doesn't handle the scenario where its
routine _does_ run, but the kernel ultimately throws away its calibration in favor
of something else.

Further complicating things is that ~half of the paravirt flows already force set
caps before their routines are invoked:

  snp_secure_tsc_init(), jailhouse_init_platform(), ms_hyperv_init_platform(),
  vmware_set_capabilities()

Rather than trying to convince everyone that waiting is better, despite that
approach still being flawed, I chose to handle this by ensuring that once the TSC
is marked known and/or reliable, the kernel won't replace the calibration routine
with a "lesser" source:

  x86/tsc: Rejects attempts to override TSC calibration with lesser routine

That doesn't completely prevent the kernel from being stupid, but it should prevent
both the case where X's routine is registered but never run, as well as the case
where it's run but ultimately ignored.

> which I note is 'static inline'. I'm vaguely surprised that even builds
> (although it does).

Heh, KVM x86 (in the host) does stupid things like this too.  E.g. kvm_pdptr_read()
is a static inline, but then wired up as a function pointer in three different
places.

> Probably nicer to move it explicitly out of line in acrn.c though.
> 
> Most of that should be a comment on patch  2 of the series, I guess?

^ permalink raw reply

* Re: [PATCH v3 41/41] x86/kvmclock: Get CPU base frequency from CPUID when it's available
From: Sean Christopherson @ 2026-05-20 21:06 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515195934.C790BC2BCB0@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index 62c8ea2e67691..7607920ae386e 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -190,6 +190,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
> >  	}
> >  }
> >  
> > +static unsigned long kvm_get_cpu_khz(void)
> > +{
> > +	unsigned int cpu_khz;
> > +
> > +	/*
> > +	 * Prefer CPUID over kvmclock when possible, as the base CPU frequency
> > +	 * isn't necessarily the same as the kvmlock "TSC" frequency.
> > +	 */
> > +	if (!cpuid_get_cpu_freq(&cpu_khz))
> > +		return cpu_khz;
> > +
> > +	return pvclock_tsc_khz(this_cpu_pvti());
> 
> Before this patch, both calibrate_cpu and calibrate_tsc used
> kvm_get_tsc_khz(), which checks kvm_para_tsc_khz() to get the exact TSC
> frequency from KVM's PV CPUID timing leaf, and only falls back to
> pvclock_tsc_khz() if it is unavailable.
> 
> Does the fallback path in kvm_get_cpu_khz() bypass kvm_para_tsc_khz()
> entirely? For VMs that expose KVM's PV CPUID timing leaf but not
> CPUID.0x16, it looks like cpu_khz will now be reverse-calculated from
> pvclock multipliers instead of using the exact hypervisor-provided value.
> 
> Could this introduce a precision regression due to truncation loss, causing
> cpu_khz and tsc_khz to needlessly diverge on the fallback path?

David pointed this out too, I'll fix in the next version.

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-20 21:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag4dMc2B3JQi4vxU@google.com>

[-- Attachment #1: Type: text/plain, Size: 2384 bytes --]

On Wed, 2026-05-20 at 13:44 -0700, Sean Christopherson wrote:
> 
> +       /*
> +        * If the TSC counts at a constant frequency across P/T states, counts
> +        * in deep C-states, and the TSC hasn't been marked unstable, treat the
> +        * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
> +        * exists purely to honor the TSC being marked unstable via command
> +        * line, any runtime detection of an unstable will happen after this.
> +        */
> +       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> +           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> +           !check_tsc_unstable())
    { 
> +               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;

    kvmclock = 0; /* Why use it if the TSC works? The kvmclock exists
                     *purely* to work around a TSC which *doesn't*
                     have those properties checked above. */
    }

I was going to say PVCLOCK_TSC_STABLE_BIT, and maybe we should check
that *too* for paranoia? But hopefully the checks you have above are
equivalent?

> +
> +       kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
> +
> +       /*
> +        * If provided, use the TSC (and APIC bus) frequency provided in KVM's
> +        * PV CPUID leaf even if kvmclock itself is disabled via command line.
> +        * The PV CPUID information isn't dependent on kvmclock in any way, and
> +        * in fact using the precise information is *more* important when the
> +        * user has explicitly disabled kvmclock to force the kernel to use the
> +        * TSC as its clocksource.
> +        */
> +       if (!kvmclock) {
> +               if (kvm_tsc_khz_cpuid)
> +                       tsc_register_calibration_routines(kvm_get_tsc_khz,
> +                                                         kvm_get_cpu_khz,
> +                                                         tsc_properties);
> +               return;
> +       }
> +


Regardless of the above, why not just register these here
unconditionally, and remove the later call that does the same?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 07/41] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
From: David Woodhouse @ 2026-05-20 21:27 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-8-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Register the Hyper-V reference counter (refcounter) callbacks for saving
> and restoring its PV sched_clock, if and only if the refcounter is
> actually being used for sched_clock.  Currently, Hyper-V overrides the
> save/restore hooks if the reference TSC available, whereas the Hyper-V
> refcounter code only overrides sched_clock if the reference TSC is
> available *and* it's not invariant.  The flaw is effectively papered over
> by invoking the "old" save/restore callbacks as part of save/restore, but
> that's unnecessary and fragile.
> 
> To avoid introducing more complexity, and to allow for additional cleanups
> of the PV sched_clock code, move the save/restore hooks and logic into
> hyperv_timer.c and simply wire up the hooks when overriding sched_clock
> itself.
> 
> Note, while the Hyper-V refcounter code is intended to be architecture
> neutral, CONFIG_PARAVIRT is firmly x86-only, i.e. adding a small amount of
> x86 specific code (which will be reduced in future cleanups) doesn't
> meaningfully pollute generic code.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 08/41] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers
From: David Woodhouse @ 2026-05-20 21:29 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-9-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 477 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that all of the Hyper-V reference counter sched_clock code is located
> in a single file, drop the superfluous wrappers for the save/restore flows.
> 
> No functional change intended.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 09/41] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock
From: David Woodhouse @ 2026-05-20 21:30 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-10-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 957 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that Hyper-V overrides the sched_clock save/restore hooks if and only
> sched_clock itself is set to the Hyper-V reference counter, drop the
> invocation of the "old" save/restore callbacks.  When the registration of
> the PV sched_clock was done separately from overriding the save/restore
> hooks, it was possible for Hyper-V to clobber the TSC save/restore
> callbacks without actually switching to the Hyper-V refcounter.
> 
> Enabling a PV sched_clock is a one-way street, i.e. the kernel will never
> revert to using TSC for sched_clock, and so there is no need to invoke the
> TSC save/restore hooks (and if there was, it belongs in common PV code).
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-20 21:33 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <621e10bdc9e297c6c600b561d8fa25c3b62968bc.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Wed, 2026-05-20 at 13:44 -0700, Sean Christopherson wrote:
> > 
> > +       /*
> > +        * If the TSC counts at a constant frequency across P/T states, counts
> > +        * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > +        * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
> > +        * exists purely to honor the TSC being marked unstable via command
> > +        * line, any runtime detection of an unstable will happen after this.
> > +        */
> > +       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > +           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > +           !check_tsc_unstable())
>     { 
> > +               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> 
>     kvmclock = 0; /* Why use it if the TSC works? The kvmclock exists
>                      *purely* to work around a TSC which *doesn't*
>                      have those properties checked above. */

kvmclock still provides SYSTEM_TIME and WALL_CLOCK :-/

>     }
> 
> I was going to say PVCLOCK_TSC_STABLE_BIT, and maybe we should check
> that *too* for paranoia?

No? PVCLOCK_TSC_STABLE is property of kvmclock more than it's a property of the
TSC itself.  And for the no-kvmclock case, we most definitely don't want to setup
kvmclock just to query that flag.

> But hopefully the checks you have above are equivalent?

They aren't as paranoid, but if the host enumerates CONSTANT+NONSTOP TSC despite
KVM-the-host not being able to advertise PVCLOCK_TSC_STABLE_BIT, then the VMM done
messed up.

> > +
> > +       kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
> > +
> > +       /*
> > +        * If provided, use the TSC (and APIC bus) frequency provided in KVM's
> > +        * PV CPUID leaf even if kvmclock itself is disabled via command line.
> > +        * The PV CPUID information isn't dependent on kvmclock in any way, and
> > +        * in fact using the precise information is *more* important when the
> > +        * user has explicitly disabled kvmclock to force the kernel to use the
> > +        * TSC as its clocksource.
> > +        */
> > +       if (!kvmclock) {
> > +               if (kvm_tsc_khz_cpuid)
> > +                       tsc_register_calibration_routines(kvm_get_tsc_khz,
> > +                                                         kvm_get_cpu_khz,
> > +                                                         tsc_properties);
> > +               return;
> > +       }
> > +
> 
> 
> Regardless of the above, why not just register these here
> unconditionally, and remove the later call that does the same?

Because if kvmclock=n, it's only safe to call kvm_get_tsc_khz() if kvm_tsc_khz_cpuid
is non-zero, other wise the "else" path will hit a NULL pointer deref when trying
to get the frequency from the PV clock struct:

	return kvm_tsc_khz_cpuid ? : pvclock_tsc_khz(this_cpu_pvti());

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-20 21:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag4or9-9c6VZxqya@google.com>

[-- Attachment #1: Type: text/plain, Size: 3077 bytes --]

On Wed, 2026-05-20 at 14:33 -0700, Sean Christopherson wrote:
> On Wed, May 20, 2026, David Woodhouse wrote:
> > On Wed, 2026-05-20 at 13:44 -0700, Sean Christopherson wrote:
> > > 
> > > +       /*
> > > +        * If the TSC counts at a constant frequency across P/T states, counts
> > > +        * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > > +        * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
> > > +        * exists purely to honor the TSC being marked unstable via command
> > > +        * line, any runtime detection of an unstable will happen after this.
> > > +        */
> > > +       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > > +           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > > +           !check_tsc_unstable())
> >     { 
> > > +               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > 
> >     kvmclock = 0; /* Why use it if the TSC works? The kvmclock exists
> >                      *purely* to work around a TSC which *doesn't*
> >                      have those properties checked above. */
> 
> kvmclock still provides SYSTEM_TIME and WALL_CLOCK :-/

Um, SYSTEM_TIME *is* the kvmclock of which we speak, isn't it?

WALL_CLOCK is something else, and I guess we should still consume that,
yes.

> 
> > > +
> > > +       kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
> > > +
> > > +       /*
> > > +        * If provided, use the TSC (and APIC bus) frequency provided in KVM's
> > > +        * PV CPUID leaf even if kvmclock itself is disabled via command line.
> > > +        * The PV CPUID information isn't dependent on kvmclock in any way, and
> > > +        * in fact using the precise information is *more* important when the
> > > +        * user has explicitly disabled kvmclock to force the kernel to use the
> > > +        * TSC as its clocksource.
> > > +        */
> > > +       if (!kvmclock) {
> > > +               if (kvm_tsc_khz_cpuid)
> > > +                       tsc_register_calibration_routines(kvm_get_tsc_khz,
> > > +                                                         kvm_get_cpu_khz,
> > > +                                                         tsc_properties);
> > > +               return;
> > > +       }
> > > +
> > 
> > 
> > Regardless of the above, why not just register these here
> > unconditionally, and remove the later call that does the same?
> 
> Because if kvmclock=n, it's only safe to call kvm_get_tsc_khz() if kvm_tsc_khz_cpuid
> is non-zero, other wise the "else" path will hit a NULL pointer deref when trying
> to get the frequency from the PV clock struct:
> 
> 	return kvm_tsc_khz_cpuid ? : pvclock_tsc_khz(this_cpu_pvti());

Ah, right. Thanks. Ick though :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 10/41] x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
From: David Woodhouse @ 2026-05-20 21:45 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-11-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 514 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Gate kvmclock's secondary CPU code on CONFIG_SMP, not CONFIG_X86_LOCAL_APIC.
> Originally, kvmclock piggybacked PV APIC ops to setup secondary CPUs.
> When that wart was fixed by commit df156f90a0f9 ("x86: Introduce
> x86_cpuinit.early_percpu_clock_init hook"), the dependency on a local APIC
> got carried forward unnecessarily.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 11/41] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
From: David Woodhouse @ 2026-05-20 21:51 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-12-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1214 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Don't disable kvmclock on the BSP during syscore_suspend(), as the BSP's
> clock is NOT restored during syscore_resume(), but is instead restored
> earlier via the sched_clock restore callback.  If suspend is aborted, e.g.
> due to a late wakeup, the BSP will run without its clock enabled, which
> "works" only because KVM-the-hypervisor is kind enough to not clobber the
> shared memory when the clock is disabled.  But over time, the BSP's view
> of time will drift from APs.

The work that Dongli and I have been doing elsewhere makes me want to
phrase that last sentence far more judgmentally, as the kvmclock
absolutely should *not* be randomly drifting over time without good
reason, and what was written to it earlier really *ought* to still be
valid.

But OK :)

> Plumb in an "action" to KVM-as-a-guest and kvmclock code in preparation
> for additional cleanups to kvmclock's suspend/resume logic.
> 
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 12/41] x86/paravirt: Remove unnecessary PARAVIRT=n stub for paravirt_set_sched_clock()
From: David Woodhouse @ 2026-05-20 21:53 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-13-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 526 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Remove the unnecessary paravirt_set_sched_clock() stub for PARAVIRT=n, as
> all callers are gated by PARAVIRT=y.  Eliminating the stub will avoid a
> pile of pointless churn as the "real" implementation evolves.
> 
> No functional change intended.
> 
> Fixes: 39965afb1151 ("x86/paravirt: Move paravirt_sched_clock() related code into tsc.c")
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 13/41] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock()
From: David Woodhouse @ 2026-05-20 21:57 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-14-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 853 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move the handling of unstable PV clocks, of which kvmclock is the only
> example, into paravirt_set_sched_clock().  This will allow modifying
> paravirt_set_sched_clock() to keep using the TSC for sched_clock in
> certain scenarios without unintentionally marking the TSC-based clock as
> unstable.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> -void paravirt_set_sched_clock(u64 (*func)(void));
> +void __paravirt_set_sched_clock(u64 (*func)(void), bool stable);
> +
> +static inline void paravirt_set_sched_clock(u64 (*func)(void))
> +{
> +	__paravirt_set_sched_clock(func, true);
> +}

One of the few things I actually like about C++ is default arguments...


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 14/41] x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
From: David Woodhouse @ 2026-05-20 21:59 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-15-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 345 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move kvmclock's sched_clock save/restore helper "up" so that they can
> (eventually) be referenced by kvm_sched_clock_init().
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox