Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Yu Zhang @ 2026-05-21 14:34 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jason Gunthorpe, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157C1EC7F5F69C5ABDA9C7FD4012@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, May 20, 2026 at 07:26:24PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:15 AM
> > 
> > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > > > +					  unsigned long start,
> > > > +					  unsigned long end)
> > > > +{
> > > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > +	u16 count = 0;
> > > > +
> > > > +	while (nr_pages > 0) {
> > > > +		unsigned long flush_pages;
> > > > +		int order;
> > > > +		unsigned long pfn_align;
> > > > +		unsigned long size_align;
> > > > +
> > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		if (start_pfn)
> > > > +			pfn_align = __ffs(start_pfn);
> > > > +		else
> > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > +		size_align = __fls(nr_pages);
> > > > +		order = min(pfn_align, size_align);
> > > > +		iova_list[count].page_mask_shift = order;
> > > > +		iova_list[count].page_number = start_pfn;
> > > > +
> > > > +		flush_pages = 1UL << order;
> > > > +		start_pfn += flush_pages;
> > > > +		nr_pages -= flush_pages;
> > > > +		count++;
> > > > +	}
> > >
> > > This seems like a really silly hypervisor interface. Why doesn't it
> > > just accept a normal range? Splitting it into power of two aligned
> > > ranges is very inefficient.
> > 
> > Fair point. I'm not sure how much flexibility we have to change
> > this hypercall interface at the moment - it predates the pvIOMMU
> > work and may have other consumers beyond Linux guest. On the other
> > hand, having the guest specify 2^N-aligned blocks does save the
> > hypervisor from having to decompose ranges itself before issuing
> > hardware invalidation commands - the guest-provided entries can be
> > fed to the HW more or less directly.
> > 
> > That said, the way I'm currently using this interface may be
> > more precise than necessary. Maybe we have 2 options:
> > 
> > 1) Current approach: decompose the range into multiple exact
> >    2^N-aligned blocks with no over-flush, but at the cost of
> >    more complex calculations and more entries.
> > 
> > 2) Follow what Intel/AMD drivers do: find a single minimal
> >    2^N-aligned block that covers the entire range, but may
> >    over-flush.
> > 
> > Any preference?
> > 
> > @Michael, since you've also been reviewing this patch, I'd
> > appreciate your thoughts on the above as well. :)
> > 
> 
> I'm just guessing, but perhaps flushing an aligned power-of-2
> range can be processed by the hypervisor at a relatively fixed
> cost, regardless of the size. Having the guest do the decomposing
> of an arbitrary range allows the hypervisor to make use of the
> existing "rep" hypercall mechanism if the hypercall is taking
> "too long". The hypervisor can pause its processing, return to
> the guest temporarily, and then continue the hypercall. If the
> arbitrary range were passed into the hypercall for the hypervisor
> to do the decomposing, that pause-and-restart mechanism
> wouldn't be available.
> 
> Of course, Linux doesn't really take advantage of the pause to
> reduce guest interrupt latency because the Hyper-V code in
> Linux typically disable interrupts around a hypercall due to the
> way the hypercall input page is allocated. But other guest
> operating systems might benefit from such a pause. And we could
> probably fix the Hyper-V code in Linux to allow interrupts during a
> hypercall pause/restart if long-running hypercalls turn out to be
> a problem.
> 
> Regarding proposal (1) vs. (2), perhaps you could confirm with
> the Hyper-V team that flushing an aligned power-of-2 range
> has relatively fixed cost, regardless of the size. And what do the
> flush requests coming from the generic IOMMU subsystem look
> like? Do they match dma_unmap() ranges, which are probably
> dominated by relatively small ranges of a few pages at most,
> with a few outliers for disk I/O requests of 1 MiB or some such?
> If the dominant flush request is for a few pages at most, then
> doing (2) seems reasonable.

Thanks for the thoughtful suggestions, Michael!

I believe the time might be dominated by the number of descriptors,
instead of the size of each range, especially when device TLB
invalidations are involved.

Here's my understanding of what hypervisor does in its handler:

Hyper-V constructs one IOTLB invalidation descriptor (and possibly
a Device TLB invalidation descriptor as well) per iova_list entry
and submits them to the HW invalidation queue, then synchronously
waits for completion. So multiple 2^N-aligned entries should be less
efficient than a single larger 2^N aligned one.

Since both options submit 2^N-aligned entries to the hypervisor,
either one single coarser-grained entry or a precise decomposition,
I'm now also leaning towards option 2, which is also what Intel/AMD
drivers do for page-selective IOTLB flush. Simpler guest code, faster
flush, and the hypervisor can feed the single entry almost directly
to HW.

Yu

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: David Woodhouse @ 2026-05-21 14:13 UTC (permalink / raw)
  To: Sean Christopherson, Peter Zijlstra
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Juergen Gross,
	Daniel Lezcano, Thomas Gleixner, John Stultz, Rick Edgecombe,
	Vitaly Kuznetsov, Broadcom internal kernel review list,
	Boris Ostrovsky, Stephen Boyd, x86, linux-coco, kvm, linux-hyperv,
	virtualization, linux-kernel, xen-devel, Michael Kelley,
	Tom Lendacky, Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag8K2FRGcoEa-D2Y@google.com>

[-- Attachment #1: Type: text/plain, Size: 2440 bytes --]

On Thu, 2026-05-21 at 06:38 -0700, Sean Christopherson wrote:
> On Thu, May 21, 2026, Peter Zijlstra wrote:
> > On Thu, May 21, 2026 at 05:59:17AM -0700, Sean Christopherson wrote:
> > > On Thu, May 21, 2026, David Woodhouse wrote:
> > > > On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > > > > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > > > > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > > > > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > > > > their kvmclock during CPU online.  While a redundant write to the MSR is
> > > > > technically ok, skip the registration when kvmclock is sched_clock so that
> > > > > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > > > > bringup when it's being used as sched_clock.
> > > > > 
> > > > > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > > > > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > > > > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > > > > work today, because KVM's hooks happen to run after/before timekeeping's
> > > > > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > > > > which syscore_ops are invoked depends entirely on when a subsystem is
> > > > > initialized and thus registers its hooks.
> > > > > 
> > > > > Opportunsitically make the registration messages more precise to help
> > > > > debug issues where kvmclock is enabled too late.
> > > > 
> > > > That's a hard word to type, isn't it?
> > > 
> > > Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
> > > while, it is time...
> > 
> > 'z=' not good enough?
> 
> You people and your fancy ways.  I'm just happy I can get in and out of the editor :-)

I reached the peak of my vi knowledge in about 1995 when I learned that
I could log in on another terminal, kill it from there, and then set
EDITOR=emacs.

Ironically I still find myself doing that kind of thing when I'm
composing a git-send-email cover letter and decide I don't want to send
the series as-is at all. Maybe there's a way to put a poison pill in
the message (or save it unchanged?) to make git *NOT* send anything...
but I always err on the side of caution and just nuke it from orbit, or
at least from another terminal.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Sean Christopherson @ 2026-05-21 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Woodhouse, Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Dave Hansen, Andy Lutomirski,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260521131019.GI3126523@noisy.programming.kicks-ass.net>

On Thu, May 21, 2026, Peter Zijlstra wrote:
> On Thu, May 21, 2026 at 05:59:17AM -0700, Sean Christopherson wrote:
> > On Thu, May 21, 2026, David Woodhouse wrote:
> > > On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > > > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > > > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > > > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > > > their kvmclock during CPU online.  While a redundant write to the MSR is
> > > > technically ok, skip the registration when kvmclock is sched_clock so that
> > > > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > > > bringup when it's being used as sched_clock.
> > > > 
> > > > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > > > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > > > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > > > work today, because KVM's hooks happen to run after/before timekeeping's
> > > > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > > > which syscore_ops are invoked depends entirely on when a subsystem is
> > > > initialized and thus registers its hooks.
> > > > 
> > > > Opportunsitically make the registration messages more precise to help
> > > > debug issues where kvmclock is enabled too late.
> > > 
> > > That's a hard word to type, isn't it?
> > 
> > Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
> > while, it is time...
> 
> 'z=' not good enough?

You people and your fancy ways.  I'm just happy I can get in and out of the editor :-)

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Peter Zijlstra @ 2026-05-21 13:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Woodhouse, Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Dave Hansen, Andy Lutomirski,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag8Bpc_uVNrNWqfX@google.com>

On Thu, May 21, 2026 at 05:59:17AM -0700, Sean Christopherson wrote:
> On Thu, May 21, 2026, David Woodhouse wrote:
> > On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > > their kvmclock during CPU online.  While a redundant write to the MSR is
> > > technically ok, skip the registration when kvmclock is sched_clock so that
> > > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > > bringup when it's being used as sched_clock.
> > > 
> > > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > > work today, because KVM's hooks happen to run after/before timekeeping's
> > > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > > which syscore_ops are invoked depends entirely on when a subsystem is
> > > initialized and thus registers its hooks.
> > > 
> > > Opportunsitically make the registration messages more precise to help
> > > debug issues where kvmclock is enabled too late.
> > 
> > That's a hard word to type, isn't it?
> 
> Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
> while, it is time...

'z=' not good enough?


^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Sean Christopherson @ 2026-05-21 12:59 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <423b37f056f0d4d596d5f4cc73802fb1079ecf63.camel@infradead.org>

On Thu, May 21, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > their kvmclock during CPU online.  While a redundant write to the MSR is
> > technically ok, skip the registration when kvmclock is sched_clock so that
> > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > bringup when it's being used as sched_clock.
> > 
> > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > work today, because KVM's hooks happen to run after/before timekeeping's
> > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > which syscore_ops are invoked depends entirely on when a subsystem is
> > initialized and thus registers its hooks.
> > 
> > Opportunsitically make the registration messages more precise to help
> > debug issues where kvmclock is enabled too late.
> 
> That's a hard word to type, isn't it?

Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
while, it is time...

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-21 12:27 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jason Gunthorpe, linux-kernel, linux-hyperv, iommu, linux-pci,
	linux-arch, wei.liu, kys, haiyangz, decui, longli, joro, will,
	robin.murphy, bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd,
	mhklinux, tgopinath, easwar.hariharan
In-Reply-To: <20260520112708.00003640@linux.microsoft.com>

On Wed, May 20, 2026 at 11:27:08AM -0700, Jacob Pan wrote:
> Hi Yu,
> 
> On Wed, 20 May 2026 23:25:43 +0800
> Yu Zhang <zhangyu1@linux.microsoft.com> wrote:
> 
> > > > +static const struct iommu_domain_ops
> > > > hv_iommu_identity_domain_ops = {
> > > > +	.attach_dev	= hv_iommu_attach_dev,
> > > > +};
> > > > +
> > > > +static const struct iommu_domain_ops
> > > > hv_iommu_blocking_domain_ops = {
> > > > +	.attach_dev	= hv_iommu_attach_dev,
> > > > +};  
> > > 
> > > Usually I would expect these to have their own attach
> > > functions. blocking in particular must have an attach op that cannot
> > > fail. It is used to recover the device back to a known translation
> > > in case of cascading other errors.
> > >   
> > 
> > For blocking domain, the hypercall handler of such attach essentially
> > disable the translation and IOPF for the device.
> I think this should disable all faults, including unrecoverable fault
> reporting. right?

Yes, it should (e.g., my understanding is it shall set the FPD in scalable
mode context entry). Will double confirm with hypervisor team and make
sure it behaves so.

B.R.
Yu

^ permalink raw reply

* Re: [PATCH v3 37/41] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
From: Dongli Zhang @ 2026-05-21  9:14 UTC (permalink / raw)
  To: Sean Christopherson, kvm
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-38-seanjc@google.com>



On 2026-05-15 12:19 PM, Sean Christopherson wrote:
> Prefer the TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked unstable via command line.  I.e. use the same
> criteria as tweaking the clocksource rating so that TSC is preferred over
> kvmclock.  Per the below comment from native_sched_clock(), sched_clock
> is more tolerant of slop than clocksource; using TSC for clocksource but
> not sched_clock makes little to no sense, especially now that KVM CoCo
> guests with a trusted TSC use TSC, not kvmclock.
> 
>         /*
>          * Fall back to jiffies if there's no TSC available:
>          * ( But note that we still use it if the TSC is marked
>          *   unstable. We do this because unlike Time Of Day,
>          *   the scheduler clock tolerates small errors and it's
>          *   very important for it to be as fast as the platform
>          *   can achieve it. )
>          */
> 
> The only advantage of using kvmclock is that doing so allows for early
> and common detection of PVCLOCK_GUEST_STOPPED, but that code has been
> broken for over two years with nary a complaint, i.e. it can't be
> _that_ valuable.  And as above, certain types of KVM guests are losing
> the functionality regardless, i.e. acknowledging PVCLOCK_GUEST_STOPPED
> needs to be decoupled from sched_clock() no matter what.

Has it been broken for two years because of pvclock_clocksource_read_nowd()?

Thank you very much!

Dongli Zhang

^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: Jakub Kicinski @ 2026-05-21  0:17 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260518194654.735580-3-dipayanroy@linux.microsoft.com>

On Mon, 18 May 2026 12:43:51 -0700 Dipayaan Roy wrote:
> +	/* If already detached (indicates detach succeeded but attach failed
> +	 * previously). Now skip mana detach and just retry mana_attach.
> +	 */
> +	if (!netif_device_present(ndev))
> +		goto attach;
> +
>  	err = mana_detach(ndev, false);
>  	if (err) {
>  		netdev_err(ndev, "mana_detach failed: %d\n", err);
>  		goto dealloc_pre_rxbufs;
>  	}
>  
> +attach:

goto's are acceptable for error unwinding, not to jump around 
a function seemingly to avoid indenting something. Please use
normal constructs or perhaps move the netif_device_present()
into mana_detach() as an early exit condition? 

>  	err = mana_attach(ndev);

^ permalink raw reply

* Re: [PATCH v3 39/41] x86/paravirt: Move using_native_sched_clock() stub into timer.h
From: David Woodhouse @ 2026-05-21  0:00 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-40-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 382 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that timer.h ended up with CONFIG_PARAVIRT #ifdeffery anyways, move the
> PARAVIRT=n using_native_sched_clock() stub into timer.h as a "free"
> optimization.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 38/41] x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock
From: David Woodhouse @ 2026-05-20 23:59 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-39-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 699 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Rework the seemingly generic x86_cpuinit_ops.early_percpu_clock_init hook
> into a dedicated PV sched_clock hook, as the only reason the hook exists
> is to allow kvmclock to enable its PV clock on secondary CPUs before the
> kernel tries to reference sched_clock, e.g. when grabbing a timestamp for
> printk.
> 
> Rearranging the hook doesn't exactly reduce complexity; arguably it does
> the opposite.  But as-is, it's practically impossible to understand *why*
> kvmclock needs to do early configuration.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 37/41] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
From: David Woodhouse @ 2026-05-20 23:56 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-38-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Prefer the TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked unstable via command line.  I.e. use the same
> criteria as tweaking the clocksource rating so that TSC is preferred over
> kvmclock.  Per the below comment from native_sched_clock(), sched_clock
> is more tolerant of slop than clocksource; using TSC for clocksource but
> not sched_clock makes little to no sense, especially now that KVM CoCo
> guests with a trusted TSC use TSC, not kvmclock.
> 
>         /*
>          * Fall back to jiffies if there's no TSC available:
>          * ( But note that we still use it if the TSC is marked
>          *   unstable. We do this because unlike Time Of Day,
>          *   the scheduler clock tolerates small errors and it's
>          *   very important for it to be as fast as the platform
>          *   can achieve it. )
>          */
> 
> The only advantage of using kvmclock is that doing so allows for early
> and common detection of PVCLOCK_GUEST_STOPPED, but that code has been
> broken for over two years with nary a complaint, i.e. it can't be
> _that_ valuable.  And as above, certain types of KVM guests are losing
> the functionality regardless, i.e. acknowledging PVCLOCK_GUEST_STOPPED
> needs to be decoupled from sched_clock() no matter what.
> 
> Link: https://lore.kernel.org/all/Z4hDK27OV7wK572A@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Yay! (Albeit only for sched_clock, and we should do Xen too)

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 36/41] x86/kvmclock: Get local APIC bus frequency from PV CPUID Timing Info
From: Woodhouse, David @ 2026-05-20 23:55 UTC (permalink / raw)
  To: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	seanjc@google.com, pbonzini@redhat.com, kys@microsoft.com,
	decui@microsoft.com, daniel.lezcano@kernel.org,
	wei.liu@kernel.org, peterz@infradead.org, jgross@suse.com
  Cc: boris.ostrovsky@oracle.com, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, mhklinux@outlook.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <20260515191942.1892718-37-seanjc@google.com>


[-- Attachment #1.1: Type: text/plain, Size: 647 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> When running as a KVM guest with kvmclock support enabled, stuff the APIC
> timer period/frequency with the local APIC bus frequency reported in
> CPUID.0x40000010.EBX instead of trying to calibrate/guess the frequency.
> 
> See Documentation/virt/kvm/x86/cpuid.rst for details.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I still don't much like the way this is done inside kvm_get_tsc_khz().

We also probably ought to be looking for the timing leaf on other
hypervisors including VMware and probably Bhyve too. Should it be done
somewhere else?



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5964 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 33/41] x86/kvmclock: Mark TSC as reliable when it's constant and nonstop
From: David Woodhouse @ 2026-05-20 23:51 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-34-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Mark the TSC as reliable if the hypervisor (KVM) has enumerated the TSC
> as constant and nonstop, and the admin hasn't explicitly marked the TSC
> as unstable.  Like most (all?) virtualization setups, any secondary
> clocksource that's used as a watchdog is guaranteed to be less reliable
> than a constant, nonstop TSC, as all clocksources the kernel uses as a
> watchdog are all but guaranteed to be emulated when running as a KVM
> guest.  I.e. any observed discrepancies between the TSC and watchdog will
> be due to jitter in the watchdog.
> 
> This is especially true for KVM, as the watchdog clocksource is usually
> emulated in host userspace, i.e. reading the clock incurs a roundtrip
> cost of thousands of cycles.
> 
> Marking the TSC reliable addresses a flaw where the TSC will occasionally
> be marked unstable if the host is under moderate/heavy load.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 32/41] x86/tsc: Rejects attempts to override TSC calibration with lesser routine
From: David Woodhouse @ 2026-05-20 23:50 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-33-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> When registering a TSC frequency calibration routine, sanity check that
> the incoming routine is as robust as the outgoing routine, and reject the
> incoming routine if the sanity check fails.
> 
> Because native calibration routines only mark the TSC frequency as known
> and reliable when they actually run, the effective progression of
> capabilities is: None (native) => Known and maybe Reliable (PV) =>
> Known and Reliable (CoCo).  Violating that progression for a PV override
> is relatively benign, but messing up the progression when CoCo is
> involved is more problematic, as it likely means a trusted source of
> information (hardware/firmware) is being discarded in favor of a less
> trusted source (hypervisor).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 31/41] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
From: Woodhouse, David @ 2026-05-20 23:49 UTC (permalink / raw)
  To: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	seanjc@google.com, pbonzini@redhat.com, kys@microsoft.com,
	decui@microsoft.com, daniel.lezcano@kernel.org,
	wei.liu@kernel.org, peterz@infradead.org, jgross@suse.com
  Cc: boris.ostrovsky@oracle.com, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, mhklinux@outlook.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <20260515191942.1892718-32-seanjc@google.com>


[-- Attachment #1.1: Type: text/plain, Size: 1363 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Add a "tsc_properties" set of flags and use it to annotate whether the
> TSC operates at a known and/or reliable frequency when registering a
> paravirtual TSC calibration routine.  Currently, each PV flow manually
> sets the associated feature flags, but often in haphazard fashion that
> makes it difficult for unfamiliar readers to see the properties of the
> TSC when running under a particular hypervisor.
> 
> The other, bigger issue with manually setting the feature flags is that
> it decouples the flags from the calibration routine.  E.g. in theory, PV
> code could mark the TSC as having a known frequency, but then have its
> PV calibration discarded in favor of a method that doesn't use that known
> frequency.  Passing the TSC properties along with the calibration routine
> will allow adding sanity checks to guard against replacing a "better"
> calibration routine with a "worse" routine.
> 
> As a bonus, the flags also give developers working on new PV code a heads
> up that they should at least mark the TSC as having a known frequency.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5964 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 30/41] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC
From: David Woodhouse @ 2026-05-20 23:45 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-31-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1554 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Silently ignore attempts to switch to a paravirt sched_clock when running
> as a CoCo guest with trusted TSC.  In hand-wavy theory, a misbehaving
> hypervisor could attack the guest by manipulating the PV clock to affect
> guest scheduling in some weird and/or predictable way.  More importantly,
> reading TSC on such platforms is faster than any PV clock, and sched_clock
> is all about speed.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

And kvmclock. And Xen.

Are there *any* reasons we'd use a PV sched_clock when the TSC is
usable?

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> ---
>  arch/x86/kernel/tsc.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 3c15fc10e501..ac4abfec1f05 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -283,6 +283,15 @@ bool using_native_sched_clock(void)
>  int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
>  				      void (*save)(void), void (*restore)(void))
>  {
> +	/*
> +	 * Don't replace TSC with a PV clock when running as a CoCo guest and
> +	 * the TSC is secure/trusted; PV clocks are emulated by the hypervisor,
> +	 * which isn't in the guest's TCB.
> +	 */
> +	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC) ||
> +	    boot_cpu_has(X86_FEATURE_TDX_GUEST))
> +		return -EPERM;
> +
>  	if (!stable)
>  		clear_sched_clock_stable();
>  


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 29/41] x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
From: David Woodhouse @ 2026-05-20 23:44 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-30-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Add a return code to __paravirt_set_sched_clock() so that the kernel can
> reject attempts to use a PV sched_clock without breaking the caller.  E.g.
> when running as a CoCo VM with a secure TSC, using a PV clock is generally
> undesirable.
> 
> Note, kvmclock is the only PV clock that does anything "extra" beyond
> simply registering itself as sched_clock, i.e. is the only caller that
> needs to check the new return value.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Oooh... can we use this to reject the kvmclock when we have a stable
and reliable TSC even for non-CoCo guests?

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 28/41] x86/paravirt: Mark __paravirt_set_sched_clock() as __init
From: David Woodhouse @ 2026-05-20 23:42 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-29-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 485 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Annotate __paravirt_set_sched_clock() as __init, and make its wrapper
> __always_inline to ensure sanitizers don't result in a non-inline version
> hanging around.  All callers run during __init, and changing sched_clock
> after boot would be all kinds of crazy.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: David Woodhouse @ 2026-05-20 23:27 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-28-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> their kvmclock during CPU online.  While a redundant write to the MSR is
> technically ok, skip the registration when kvmclock is sched_clock so that
> it's somewhat obvious that kvmclock *needs* to be enabled during early
> bringup when it's being used as sched_clock.
> 
> Plumb in the BSP's resume path purely for documentation purposes.  Both
> KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> not super obvious that using KVM's hooks would be flawed.  E.g. it would
> work today, because KVM's hooks happen to run after/before timekeeping's
> hooks during suspend/resume, but that's sheer dumb luck as the order in
> which syscore_ops are invoked depends entirely on when a subsystem is
> initialized and thus registers its hooks.
> 
> Opportunsitically make the registration messages more precise to help
> debug issues where kvmclock is enabled too late.

That's a hard word to type, isn't it?

> Opportunstically WARN in kvmclock_{suspend,resume}() to harden against
> future bugs.

So is that :)

> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 26/41] x86/kvmclock: WARN if wall clock is read while kvmclock is suspended
From: David Woodhouse @ 2026-05-20 23:19 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-27-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 766 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> WARN if kvmclock is still suspended when its wallclock is read, i.e. when
> the kernel reads its persistent clock.  The wallclock subtly depends on
> the BSP's kvmclock being enabled, and returns garbage if kvmclock is
> disabled.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


Although I still hate the whole KVM wallclock thing, as the kvmclock
itself is monotonic_raw, so adding that to the wallclock epoch is kind
of wrong.

Maybe the host should updated the wallclock occasionally to keep it up
to date...


Or maybe the guest should prefer the KVM_HC_CLOCK_PAIRING hypercall if
it exists, over kvm-wallclock.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 25/41] x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't sched_clock
From: David Woodhouse @ 2026-05-20 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-26-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 409 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Save/restore kvmclock across suspend/resume via clocksource hooks when
> kvmclock isn't being used for sched_clock.  This will allow using kvmclock
> as a clocksource (or for wallclock!) without also using it for sched_clock.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 25/41] x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't sched_clock
From: David Woodhouse @ 2026-05-20 23:01 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-26-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 409 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Save/restore kvmclock across suspend/resume via clocksource hooks when
> kvmclock isn't being used for sched_clock.  This will allow using kvmclock
> as a clocksource (or for wallclock!) without also using it for sched_clock.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 15/41] x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
From: David Woodhouse @ 2026-05-20 22:54 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <93f8fad19e2c617ca17f544d11483c517024bfc4.camel@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]

On Wed, 2026-05-20 at 23:11 +0100, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Nullify the x86_platform sched_clock save/restore hooks when setting up
> > Xen's PV clock to make it somewhat obvious the hooks aren't used when
> > running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> Hm. Xen HVM guests *do* use a standard ACPI S4 hibernation flow.
> 
> Does that need testing?

Ah, this looks as sane as we can expect.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> 
> > ---
> >  arch/x86/xen/time.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> > index 3d3165eef821..21d366d01985 100644
> > --- a/arch/x86/xen/time.c
> > +++ b/arch/x86/xen/time.c
> > @@ -568,6 +568,12 @@ static void __init xen_init_time_common(void)
> >  	xen_sched_clock_offset = xen_clocksource_read();
> >  	static_call_update(pv_steal_clock, xen_steal_clock);
> >  	paravirt_set_sched_clock(xen_sched_clock);
> > +	/*
> > +	 * Xen has paravirtualized suspend/resume and so doesn't use the common
> > +	 * x86 sched_clock save/restore hooks.
> > +	 */
> > +	x86_platform.save_sched_clock_state = NULL;
> > +	x86_platform.restore_sched_clock_state = NULL;
> >  
> >  	tsc_register_calibration_routines(xen_tsc_khz, NULL);
> >  	x86_platform.get_wallclock = xen_get_wallclock;
> 


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 24/41] timekeeping: Resume clocksources before reading persistent clock
From: David Woodhouse @ 2026-05-20 22:52 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-25-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> When resuming timekeeping after suspend, restore clocksources prior to
> reading the persistent clock.  Paravirt clocks, e.g. kvmclock, tie the
> validity of a PV persistent clock to a clocksource, i.e. reading the PV
> persistent clock will return garbage if the underlying PV clocksource
> hasn't been enabled.  The flaw has gone unnoticed because kvmclock is a
> mess and uses its own suspend/resume hooks instead of the clocksource
> suspend/resume hooks, which happens to work by sheer dumb luck (the
> kvmclock resume hook runs before timekeeping_resume()).
> 
> Note, there is no evidence that any clocksource supported by the kernel
> depends on a persistent clock.
> 
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 23/41] x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during kvmclock_init()
From: David Woodhouse @ 2026-05-20 22:46 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-24-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 660 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Clean up the setting of PVCLOCK_TSC_STABLE_BIT during kvmclock init to
> make it somewhat obvious that pvclock_read_flags() must be called *after*
> pvclock_set_flags().
> 
> Note, in theory, a different PV clock could have set PVCLOCK_TSC_STABLE_BIT
> in the supported flags, i.e. reading flags only if
> KVM_FEATURE_CLOCKSOURCE_STABLE_BIT is set could very, very theoretically
> result in a change in behavior.  In practice, the kernel only supports a
> single PV clock.

> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox