Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH 1/1] mshv: Add conditional VMBus dependency
From: Michael Kelley @ 2026-05-21 16:49 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, jloeser, linux-hyperv
  Cc: linux-kernel, arnd, hamzamahfooz

From: Michael Kelley <mhklinux@outlook.com>

When the VMBus driver is not part of the kernel (CONFIG_HYPERV_VMBUS=n),
the MSHV root driver fails to link:

ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!

Fix this while meeting these requirements:
* It must be possible to include the MSHV root driver without the
  VMBus driver. In such case, the MSHV root driver can be built-in
  to the kernel image, or it can be built as a separate module.
* If both the MSHV root driver and the VMBus driver are present, the
  MSHV root driver and VMBus driver can both be built-in, or they can
  both be separate modules. Or the MSHV root driver can be a module
  while the VMBus driver can be built-in, but the reverse is
  disallowed. Regardless of the build choices, the VMBus driver must
  be loaded before the MSHV driver in order for the SynIC to be
  managed properly (see comments in the MSHV SynIC code).

The fix has two parts:
* Add a Kconfig entry for MSHV_ROOT to depend on HYPERV_VMBUS if
  HYPERV_VMBUS is present. The entry disallows MSHV_ROOT being
  built-in when HYPERV_VMBUS is a module, but without requiring that
  HYPERV_VMBUS be built.
* Add #ifdefs around MSHV SynIC calls to hv_vmbus_exists(). When
  the VMBus driver is present, these calls establish a module
  dependency to ensure that the VMBus driver loads first when both
  are built as modules. But if the VMBus driver is not present,
  the behavior is as if hv_vmbus_exists() returned "false", and
  there is no module dependency.

Existing code ensures that the VMBus driver loads first if it is
built-in. The VMBus driver uses subsys_initcall(), which is
initcall level 4. The MSHV root driver uses module_init(), which
becomes device_init() when built-in, and device_init() is
initcall level 6.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Closes: https://lore.kernel.org/all/20260520074044.923728-1-arnd@kernel.org/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 drivers/hv/Kconfig      |  1 +
 drivers/hv/mshv_synic.c | 12 ++++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 2d0b3fcb0ff8..aa11bcefddf2 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -74,6 +74,7 @@ config MSHV_ROOT
 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
 	# no particular order, making it impossible to reassemble larger pages
 	depends on PAGE_SIZE_4KB
+	depends on HYPERV_VMBUS if HYPERV_VMBUS
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
 	select HMM_MIRROR
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 88170ce6b83f..3f72a3dd232d 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -463,11 +463,15 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 			&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 			&spages->synic_event_ring_page;
+	bool vmbus_active = false;
+
 	/*
 	 * VMBus owns SIMP/SIEFP/SCONTROL when it is active.
 	 * See hv_hyp_synic_enable_regs() for that initialization.
 	 */
-	bool vmbus_active = hv_vmbus_exists();
+#if IS_ENABLED(CONFIG_HYPERV_VMBUS)
+	vmbus_active = hv_vmbus_exists();
+#endif
 
 	/*
 	 * Map the SYNIC message page. When VMBus is not active the
@@ -587,8 +591,12 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
 		&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 		&spages->synic_event_ring_page;
+	bool vmbus_active = false;
+
 	/* VMBus owns SIMP/SIEFP/SCONTROL when it is active */
-	bool vmbus_active = hv_vmbus_exists();
+#if IS_ENABLED(CONFIG_HYPERV_VMBUS)
+	vmbus_active = hv_vmbus_exists();
+#endif
 
 	/* Disable the interrupt */
 	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
-- 
2.25.1


^ permalink raw reply related

* RE: [PATCH] mshv: add vmbus dependency
From: Michael Kelley @ 2026-05-21 15:56 UTC (permalink / raw)
  To: Jork Loeser, Arnd Bergmann
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Anirudh Rayabharam (Microsoft), Stanislav Kinsburskii,
	Arnd Bergmann, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <52a29c5-715e-8ea-af1-dafebfca7a84@linux.microsoft.com>

From: Jork Loeser <jloeser@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:16 AM
> 
> On Wed, 20 May 2026, Arnd Bergmann wrote:
> 
> > From: Arnd Bergmann <arnd@arndb.de>
> >
> > When the vmbus driver is not part of the kernel, the mvhv_root
> > driver now fails to link:
> >
> > ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
> >
> > Avoid this by adding an explicit Kconfig dependency. Note that
> > stubbing out the hv_vmbus_exists() based on configuration would
> > also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.
> >
> > Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > ---
> > drivers/hv/Kconfig | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > index 52af086fdeb2..21193b571a80 100644
> > --- a/drivers/hv/Kconfig
> > +++ b/drivers/hv/Kconfig
> > @@ -75,6 +75,7 @@ config MSHV_ROOT
> > 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > 	# no particular order, making it impossible to reassemble larger pages
> > 	depends on PAGE_SIZE_4KB
> > +	depends on HYPERV_VMBUS
> > 	select EVENTFD
> > 	select VIRT_XFER_TO_GUEST_WORK
> > 	select HMM_MIRROR
> > --
> > 2.39.5
> >
> 
> Yes, this is the right short-term fix. We will need to solve the root case
> (no VMBUS required) with a separate SYNIC driver abstraction.
> 
> Reviewed-by: Jork Loeser <jloeser@linux.microsoft.com>
> 

I have what I think is a better way to fix this. It preserves the
ability to build MSHV without VMBus, while also guaranteeing
that VMBus loads first when present. And it is relatively simple --
hv_vmbus_exists() does not need to be moved out of the
VMBus module. Later today I'll post a separate patch for
consideration.

The separate SynIC driver abstraction can still come later
and improve things further.

Michael

^ permalink raw reply

* RE: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Michael Kelley @ 2026-05-21 15:45 UTC (permalink / raw)
  To: Jacob Pan, Michael Kelley
  Cc: Yu Zhang, Jason Gunthorpe, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <20260520134027.00005e91@linux.microsoft.com>

From: Jacob Pan <jacob.pan@linux.microsoft.com> Sent: Wednesday, May 20, 2026 1:40 PM
> 
> Hi Michael,
> 
> On Wed, 20 May 2026 19:26:24 +0000
> Michael Kelley <mhklinux@outlook.com> wrote:
> 
> > From: Michael Kelley <mhklinux@outlook.com> To: Yu Zhang <zhangyu1@linux.microsoft.com>, Jason Gunthorpe
> >
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:15 AM
> > >
> > > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > > +static inline u16 hv_iommu_fill_iova_list(union
> > > > > hv_iommu_flush_va *iova_list,
> > > > > +					  unsigned long start,
> > > > > +					  unsigned long end)
> > > > > +{
> > > > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > > +	u16 count = 0;
> > > > > +
> > > > > +	while (nr_pages > 0) {
> > > > > +		unsigned long flush_pages;
> > > > > +		int order;
> > > > > +		unsigned long pfn_align;
> > > > > +		unsigned long size_align;
> > > > > +
> > > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > > +			break;
> > > > > +		}
> > > > > +
> > > > > +		if (start_pfn)
> > > > > +			pfn_align = __ffs(start_pfn);
> > > > > +		else
> > > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > > +
> > > > > +		size_align = __fls(nr_pages);
> > > > > +		order = min(pfn_align, size_align);
> > > > > +		iova_list[count].page_mask_shift = order;
> > > > > +		iova_list[count].page_number = start_pfn;
> > > > > +
> > > > > +		flush_pages = 1UL << order;
> > > > > +		start_pfn += flush_pages;
> > > > > +		nr_pages -= flush_pages;
> > > > > +		count++;
> > > > > +	}
> > > >
> > > > This seems like a really silly hypervisor interface. Why doesn't
> > > > it just accept a normal range? Splitting it into power of two
> > > > aligned ranges is very inefficient.
> > >
> > > Fair point. I'm not sure how much flexibility we have to change
> > > this hypercall interface at the moment - it predates the pvIOMMU
> > > work and may have other consumers beyond Linux guest. On the other
> > > hand, having the guest specify 2^N-aligned blocks does save the
> > > hypervisor from having to decompose ranges itself before issuing
> > > hardware invalidation commands - the guest-provided entries can be
> > > fed to the HW more or less directly.
> > >
> > > That said, the way I'm currently using this interface may be
> > > more precise than necessary. Maybe we have 2 options:
> > >
> > > 1) Current approach: decompose the range into multiple exact
> > >    2^N-aligned blocks with no over-flush, but at the cost of
> > >    more complex calculations and more entries.
> > >
> > > 2) Follow what Intel/AMD drivers do: find a single minimal
> > >    2^N-aligned block that covers the entire range, but may
> > >    over-flush.
> > >
> > > Any preference?
> > >
> > > @Michael, since you've also been reviewing this patch, I'd
> > > appreciate your thoughts on the above as well. :)
> > >
> >
> > I'm just guessing, but perhaps flushing an aligned power-of-2
> > range can be processed by the hypervisor at a relatively fixed
> > cost, regardless of the size. Having the guest do the decomposing
> > of an arbitrary range allows the hypervisor to make use of the
> > existing "rep" hypercall mechanism if the hypercall is taking
> > "too long". The hypervisor can pause its processing, return to
> > the guest temporarily, and then continue the hypercall. If the
> > arbitrary range were passed into the hypercall for the hypervisor
> > to do the decomposing, that pause-and-restart mechanism
> > wouldn't be available.
> >
> > Of course, Linux doesn't really take advantage of the pause to
> > reduce guest interrupt latency because the Hyper-V code in
> > Linux typically disable interrupts around a hypercall due to the
> > way the hypercall input page is allocated. But other guest
> > operating systems might benefit from such a pause. And we could
> > probably fix the Hyper-V code in Linux to allow interrupts during a
> > hypercall pause/restart if long-running hypercalls turn out to be
> > a problem.

> I am not sure if this pause feature is suitable for IOTLB flush at all
> since it is inherently synchronous — the caller must block until all
> invalidations complete. Pausing mid-flush to return to the guest
> doesn't help if the guest can't make forward progress anyway.

I agree that hypercall pause/resume doesn't help with
forward progress. But it could help with interrupt latency in the
guest if the hypercall executes with interrupts enabled in the
guest. During the pause when control returns to the guest,
the guest could take an interrupt, versus the interrupt having
to wait until the entire hypercall completes. And if preemption
is enabled in the guest thread executing the hypercall, the thread
could be descheduled, potentially improving scheduling latency.

At least that's my understanding of why Hyper-V has this pause/
resume mechanism for "rep" hypercalls. :-)

Michael




^ permalink raw reply

* RE: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Michael Kelley @ 2026-05-21 15:39 UTC (permalink / raw)
  To: Yu Zhang, Michael Kelley
  Cc: Jason Gunthorpe, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <pxod76qh3jtpvnxdlflvntc5svqgibaeu6tywn2ejrlnea65w3@djehcr3vidnk>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Thursday, May 21, 2026 7:34 AM
> 
> On Wed, May 20, 2026 at 07:26:24PM +0000, Michael Kelley wrote:
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:15 AM
> > >
> > > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > > > > +					  unsigned long start,
> > > > > +					  unsigned long end)
> > > > > +{
> > > > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > > +	u16 count = 0;
> > > > > +
> > > > > +	while (nr_pages > 0) {
> > > > > +		unsigned long flush_pages;
> > > > > +		int order;
> > > > > +		unsigned long pfn_align;
> > > > > +		unsigned long size_align;
> > > > > +
> > > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > > +			break;
> > > > > +		}
> > > > > +
> > > > > +		if (start_pfn)
> > > > > +			pfn_align = __ffs(start_pfn);
> > > > > +		else
> > > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > > +
> > > > > +		size_align = __fls(nr_pages);
> > > > > +		order = min(pfn_align, size_align);
> > > > > +		iova_list[count].page_mask_shift = order;
> > > > > +		iova_list[count].page_number = start_pfn;
> > > > > +
> > > > > +		flush_pages = 1UL << order;
> > > > > +		start_pfn += flush_pages;
> > > > > +		nr_pages -= flush_pages;
> > > > > +		count++;
> > > > > +	}
> > > >
> > > > This seems like a really silly hypervisor interface. Why doesn't it
> > > > just accept a normal range? Splitting it into power of two aligned
> > > > ranges is very inefficient.
> > >
> > > Fair point. I'm not sure how much flexibility we have to change
> > > this hypercall interface at the moment - it predates the pvIOMMU
> > > work and may have other consumers beyond Linux guest. On the other
> > > hand, having the guest specify 2^N-aligned blocks does save the
> > > hypervisor from having to decompose ranges itself before issuing
> > > hardware invalidation commands - the guest-provided entries can be
> > > fed to the HW more or less directly.
> > >
> > > That said, the way I'm currently using this interface may be
> > > more precise than necessary. Maybe we have 2 options:
> > >
> > > 1) Current approach: decompose the range into multiple exact
> > >    2^N-aligned blocks with no over-flush, but at the cost of
> > >    more complex calculations and more entries.
> > >
> > > 2) Follow what Intel/AMD drivers do: find a single minimal
> > >    2^N-aligned block that covers the entire range, but may
> > >    over-flush.
> > >
> > > Any preference?
> > >
> > > @Michael, since you've also been reviewing this patch, I'd
> > > appreciate your thoughts on the above as well. :)
> > >
> >
> > I'm just guessing, but perhaps flushing an aligned power-of-2
> > range can be processed by the hypervisor at a relatively fixed
> > cost, regardless of the size. Having the guest do the decomposing
> > of an arbitrary range allows the hypervisor to make use of the
> > existing "rep" hypercall mechanism if the hypercall is taking
> > "too long". The hypervisor can pause its processing, return to
> > the guest temporarily, and then continue the hypercall. If the
> > arbitrary range were passed into the hypercall for the hypervisor
> > to do the decomposing, that pause-and-restart mechanism
> > wouldn't be available.
> >
> > Of course, Linux doesn't really take advantage of the pause to
> > reduce guest interrupt latency because the Hyper-V code in
> > Linux typically disable interrupts around a hypercall due to the
> > way the hypercall input page is allocated. But other guest
> > operating systems might benefit from such a pause. And we could
> > probably fix the Hyper-V code in Linux to allow interrupts during a
> > hypercall pause/restart if long-running hypercalls turn out to be
> > a problem.
> >
> > Regarding proposal (1) vs. (2), perhaps you could confirm with
> > the Hyper-V team that flushing an aligned power-of-2 range
> > has relatively fixed cost, regardless of the size. And what do the
> > flush requests coming from the generic IOMMU subsystem look
> > like? Do they match dma_unmap() ranges, which are probably
> > dominated by relatively small ranges of a few pages at most,
> > with a few outliers for disk I/O requests of 1 MiB or some such?
> > If the dominant flush request is for a few pages at most, then
> > doing (2) seems reasonable.
> 
> Thanks for the thoughtful suggestions, Michael!
> 
> I believe the time might be dominated by the number of descriptors,
> instead of the size of each range, especially when device TLB
> invalidations are involved.
> 
> Here's my understanding of what hypervisor does in its handler:
> 
> Hyper-V constructs one IOTLB invalidation descriptor (and possibly
> a Device TLB invalidation descriptor as well) per iova_list entry
> and submits them to the HW invalidation queue, then synchronously
> waits for completion. So multiple 2^N-aligned entries should be less
> efficient than a single larger 2^N aligned one.

Agreed. The hypercall time should be roughly linear in the number
of descriptors.

If the approach is to do a precise flush, my argument is that
it is better for the guest to construct the 2^N aligned descriptors
instead of having the host do it. In the former case, the hypercall
can do the pause/resume thing, which provides the opportunity
to reduce interrupt latency in the guest. In the latter case, it cannot.

But my argument is moot if you do Option 2. And I'm fine with
Option 2 if the assumptions about it are true.

Michael

> 
> Since both options submit 2^N-aligned entries to the hypervisor,
> either one single coarser-grained entry or a precise decomposition,
> I'm now also leaning towards option 2, which is also what Intel/AMD
> drivers do for page-selective IOTLB flush. Simpler guest code, faster
> flush, and the hypervisor can feed the single entry almost directly
> to HW.
> 
> Yu


^ permalink raw reply

* Re: [PATCH net] net: mana: validate rx_req_idx to prevent out-of-bounds array access
From: patchwork-bot+netdevbpf @ 2026-05-21 15:20 UTC (permalink / raw)
  To: Aditya Garg
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, horms, ernis, gargaditya,
	kees, stephen, shacharr, ssengar, linux-hyperv, netdev,
	linux-kernel
In-Reply-To: <20260520051553.857120-1-gargaditya@linux.microsoft.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 19 May 2026 22:15:53 -0700 you wrote:
> In mana_hwc_rx_event_handler(), rx_req_idx is derived from
> sge->address in DMA-coherent memory. In Confidential VMs
> (SEV-SNP/TDX), this memory is shared unencrypted and HW can modify
> WQE contents at any time. No bounds check exists on rx_req_idx,
> which can lead to an out-of-bounds access into reqs[].
> 
> Add bounds check on rx_req_idx in mana_hwc_rx_event_handler() before
> using it to index the reqs[] array.
> 
> [...]

Here is the summary with links:
  - [net] net: mana: validate rx_req_idx to prevent out-of-bounds array access
    https://git.kernel.org/netdev/net/c/b809d0409991

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Yu Zhang @ 2026-05-21 14:34 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jason Gunthorpe, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157C1EC7F5F69C5ABDA9C7FD4012@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, May 20, 2026 at 07:26:24PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:15 AM
> > 
> > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > > > +					  unsigned long start,
> > > > +					  unsigned long end)
> > > > +{
> > > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > +	u16 count = 0;
> > > > +
> > > > +	while (nr_pages > 0) {
> > > > +		unsigned long flush_pages;
> > > > +		int order;
> > > > +		unsigned long pfn_align;
> > > > +		unsigned long size_align;
> > > > +
> > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		if (start_pfn)
> > > > +			pfn_align = __ffs(start_pfn);
> > > > +		else
> > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > +		size_align = __fls(nr_pages);
> > > > +		order = min(pfn_align, size_align);
> > > > +		iova_list[count].page_mask_shift = order;
> > > > +		iova_list[count].page_number = start_pfn;
> > > > +
> > > > +		flush_pages = 1UL << order;
> > > > +		start_pfn += flush_pages;
> > > > +		nr_pages -= flush_pages;
> > > > +		count++;
> > > > +	}
> > >
> > > This seems like a really silly hypervisor interface. Why doesn't it
> > > just accept a normal range? Splitting it into power of two aligned
> > > ranges is very inefficient.
> > 
> > Fair point. I'm not sure how much flexibility we have to change
> > this hypercall interface at the moment - it predates the pvIOMMU
> > work and may have other consumers beyond Linux guest. On the other
> > hand, having the guest specify 2^N-aligned blocks does save the
> > hypervisor from having to decompose ranges itself before issuing
> > hardware invalidation commands - the guest-provided entries can be
> > fed to the HW more or less directly.
> > 
> > That said, the way I'm currently using this interface may be
> > more precise than necessary. Maybe we have 2 options:
> > 
> > 1) Current approach: decompose the range into multiple exact
> >    2^N-aligned blocks with no over-flush, but at the cost of
> >    more complex calculations and more entries.
> > 
> > 2) Follow what Intel/AMD drivers do: find a single minimal
> >    2^N-aligned block that covers the entire range, but may
> >    over-flush.
> > 
> > Any preference?
> > 
> > @Michael, since you've also been reviewing this patch, I'd
> > appreciate your thoughts on the above as well. :)
> > 
> 
> I'm just guessing, but perhaps flushing an aligned power-of-2
> range can be processed by the hypervisor at a relatively fixed
> cost, regardless of the size. Having the guest do the decomposing
> of an arbitrary range allows the hypervisor to make use of the
> existing "rep" hypercall mechanism if the hypercall is taking
> "too long". The hypervisor can pause its processing, return to
> the guest temporarily, and then continue the hypercall. If the
> arbitrary range were passed into the hypercall for the hypervisor
> to do the decomposing, that pause-and-restart mechanism
> wouldn't be available.
> 
> Of course, Linux doesn't really take advantage of the pause to
> reduce guest interrupt latency because the Hyper-V code in
> Linux typically disable interrupts around a hypercall due to the
> way the hypercall input page is allocated. But other guest
> operating systems might benefit from such a pause. And we could
> probably fix the Hyper-V code in Linux to allow interrupts during a
> hypercall pause/restart if long-running hypercalls turn out to be
> a problem.
> 
> Regarding proposal (1) vs. (2), perhaps you could confirm with
> the Hyper-V team that flushing an aligned power-of-2 range
> has relatively fixed cost, regardless of the size. And what do the
> flush requests coming from the generic IOMMU subsystem look
> like? Do they match dma_unmap() ranges, which are probably
> dominated by relatively small ranges of a few pages at most,
> with a few outliers for disk I/O requests of 1 MiB or some such?
> If the dominant flush request is for a few pages at most, then
> doing (2) seems reasonable.

Thanks for the thoughtful suggestions, Michael!

I believe the time might be dominated by the number of descriptors,
instead of the size of each range, especially when device TLB
invalidations are involved.

Here's my understanding of what hypervisor does in its handler:

Hyper-V constructs one IOTLB invalidation descriptor (and possibly
a Device TLB invalidation descriptor as well) per iova_list entry
and submits them to the HW invalidation queue, then synchronously
waits for completion. So multiple 2^N-aligned entries should be less
efficient than a single larger 2^N aligned one.

Since both options submit 2^N-aligned entries to the hypervisor,
either one single coarser-grained entry or a precise decomposition,
I'm now also leaning towards option 2, which is also what Intel/AMD
drivers do for page-selective IOTLB flush. Simpler guest code, faster
flush, and the hypervisor can feed the single entry almost directly
to HW.

Yu

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: David Woodhouse @ 2026-05-21 14:13 UTC (permalink / raw)
  To: Sean Christopherson, Peter Zijlstra
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Juergen Gross,
	Daniel Lezcano, Thomas Gleixner, John Stultz, Rick Edgecombe,
	Vitaly Kuznetsov, Broadcom internal kernel review list,
	Boris Ostrovsky, Stephen Boyd, x86, linux-coco, kvm, linux-hyperv,
	virtualization, linux-kernel, xen-devel, Michael Kelley,
	Tom Lendacky, Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag8K2FRGcoEa-D2Y@google.com>

[-- Attachment #1: Type: text/plain, Size: 2440 bytes --]

On Thu, 2026-05-21 at 06:38 -0700, Sean Christopherson wrote:
> On Thu, May 21, 2026, Peter Zijlstra wrote:
> > On Thu, May 21, 2026 at 05:59:17AM -0700, Sean Christopherson wrote:
> > > On Thu, May 21, 2026, David Woodhouse wrote:
> > > > On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > > > > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > > > > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > > > > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > > > > their kvmclock during CPU online.  While a redundant write to the MSR is
> > > > > technically ok, skip the registration when kvmclock is sched_clock so that
> > > > > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > > > > bringup when it's being used as sched_clock.
> > > > > 
> > > > > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > > > > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > > > > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > > > > work today, because KVM's hooks happen to run after/before timekeeping's
> > > > > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > > > > which syscore_ops are invoked depends entirely on when a subsystem is
> > > > > initialized and thus registers its hooks.
> > > > > 
> > > > > Opportunsitically make the registration messages more precise to help
> > > > > debug issues where kvmclock is enabled too late.
> > > > 
> > > > That's a hard word to type, isn't it?
> > > 
> > > Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
> > > while, it is time...
> > 
> > 'z=' not good enough?
> 
> You people and your fancy ways.  I'm just happy I can get in and out of the editor :-)

I reached the peak of my vi knowledge in about 1995 when I learned that
I could log in on another terminal, kill it from there, and then set
EDITOR=emacs.

Ironically I still find myself doing that kind of thing when I'm
composing a git-send-email cover letter and decide I don't want to send
the series as-is at all. Maybe there's a way to put a poison pill in
the message (or save it unchanged?) to make git *NOT* send anything...
but I always err on the side of caution and just nuke it from orbit, or
at least from another terminal.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Sean Christopherson @ 2026-05-21 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Woodhouse, Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Dave Hansen, Andy Lutomirski,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260521131019.GI3126523@noisy.programming.kicks-ass.net>

On Thu, May 21, 2026, Peter Zijlstra wrote:
> On Thu, May 21, 2026 at 05:59:17AM -0700, Sean Christopherson wrote:
> > On Thu, May 21, 2026, David Woodhouse wrote:
> > > On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > > > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > > > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > > > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > > > their kvmclock during CPU online.  While a redundant write to the MSR is
> > > > technically ok, skip the registration when kvmclock is sched_clock so that
> > > > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > > > bringup when it's being used as sched_clock.
> > > > 
> > > > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > > > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > > > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > > > work today, because KVM's hooks happen to run after/before timekeeping's
> > > > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > > > which syscore_ops are invoked depends entirely on when a subsystem is
> > > > initialized and thus registers its hooks.
> > > > 
> > > > Opportunsitically make the registration messages more precise to help
> > > > debug issues where kvmclock is enabled too late.
> > > 
> > > That's a hard word to type, isn't it?
> > 
> > Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
> > while, it is time...
> 
> 'z=' not good enough?

You people and your fancy ways.  I'm just happy I can get in and out of the editor :-)

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Peter Zijlstra @ 2026-05-21 13:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Woodhouse, Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Dave Hansen, Andy Lutomirski,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag8Bpc_uVNrNWqfX@google.com>

On Thu, May 21, 2026 at 05:59:17AM -0700, Sean Christopherson wrote:
> On Thu, May 21, 2026, David Woodhouse wrote:
> > On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > > their kvmclock during CPU online.  While a redundant write to the MSR is
> > > technically ok, skip the registration when kvmclock is sched_clock so that
> > > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > > bringup when it's being used as sched_clock.
> > > 
> > > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > > work today, because KVM's hooks happen to run after/before timekeeping's
> > > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > > which syscore_ops are invoked depends entirely on when a subsystem is
> > > initialized and thus registers its hooks.
> > > 
> > > Opportunsitically make the registration messages more precise to help
> > > debug issues where kvmclock is enabled too late.
> > 
> > That's a hard word to type, isn't it?
> 
> Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
> while, it is time...

'z=' not good enough?


^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Sean Christopherson @ 2026-05-21 12:59 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <423b37f056f0d4d596d5f4cc73802fb1079ecf63.camel@infradead.org>

On Thu, May 21, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > their kvmclock during CPU online.  While a redundant write to the MSR is
> > technically ok, skip the registration when kvmclock is sched_clock so that
> > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > bringup when it's being used as sched_clock.
> > 
> > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > work today, because KVM's hooks happen to run after/before timekeeping's
> > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > which syscore_ops are invoked depends entirely on when a subsystem is
> > initialized and thus registers its hooks.
> > 
> > Opportunsitically make the registration messages more precise to help
> > debug issues where kvmclock is enabled too late.
> 
> That's a hard word to type, isn't it?

Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
while, it is time...

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-21 12:27 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jason Gunthorpe, linux-kernel, linux-hyperv, iommu, linux-pci,
	linux-arch, wei.liu, kys, haiyangz, decui, longli, joro, will,
	robin.murphy, bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd,
	mhklinux, tgopinath, easwar.hariharan
In-Reply-To: <20260520112708.00003640@linux.microsoft.com>

On Wed, May 20, 2026 at 11:27:08AM -0700, Jacob Pan wrote:
> Hi Yu,
> 
> On Wed, 20 May 2026 23:25:43 +0800
> Yu Zhang <zhangyu1@linux.microsoft.com> wrote:
> 
> > > > +static const struct iommu_domain_ops
> > > > hv_iommu_identity_domain_ops = {
> > > > +	.attach_dev	= hv_iommu_attach_dev,
> > > > +};
> > > > +
> > > > +static const struct iommu_domain_ops
> > > > hv_iommu_blocking_domain_ops = {
> > > > +	.attach_dev	= hv_iommu_attach_dev,
> > > > +};  
> > > 
> > > Usually I would expect these to have their own attach
> > > functions. blocking in particular must have an attach op that cannot
> > > fail. It is used to recover the device back to a known translation
> > > in case of cascading other errors.
> > >   
> > 
> > For blocking domain, the hypercall handler of such attach essentially
> > disable the translation and IOPF for the device.
> I think this should disable all faults, including unrecoverable fault
> reporting. right?

Yes, it should (e.g., my understanding is it shall set the FPD in scalable
mode context entry). Will double confirm with hypervisor team and make
sure it behaves so.

B.R.
Yu

^ permalink raw reply

* Re: [PATCH v3 37/41] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
From: Dongli Zhang @ 2026-05-21  9:14 UTC (permalink / raw)
  To: Sean Christopherson, kvm
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-38-seanjc@google.com>



On 2026-05-15 12:19 PM, Sean Christopherson wrote:
> Prefer the TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked unstable via command line.  I.e. use the same
> criteria as tweaking the clocksource rating so that TSC is preferred over
> kvmclock.  Per the below comment from native_sched_clock(), sched_clock
> is more tolerant of slop than clocksource; using TSC for clocksource but
> not sched_clock makes little to no sense, especially now that KVM CoCo
> guests with a trusted TSC use TSC, not kvmclock.
> 
>         /*
>          * Fall back to jiffies if there's no TSC available:
>          * ( But note that we still use it if the TSC is marked
>          *   unstable. We do this because unlike Time Of Day,
>          *   the scheduler clock tolerates small errors and it's
>          *   very important for it to be as fast as the platform
>          *   can achieve it. )
>          */
> 
> The only advantage of using kvmclock is that doing so allows for early
> and common detection of PVCLOCK_GUEST_STOPPED, but that code has been
> broken for over two years with nary a complaint, i.e. it can't be
> _that_ valuable.  And as above, certain types of KVM guests are losing
> the functionality regardless, i.e. acknowledging PVCLOCK_GUEST_STOPPED
> needs to be decoupled from sched_clock() no matter what.

Has it been broken for two years because of pvclock_clocksource_read_nowd()?

Thank you very much!

Dongli Zhang

^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: Jakub Kicinski @ 2026-05-21  0:17 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260518194654.735580-3-dipayanroy@linux.microsoft.com>

On Mon, 18 May 2026 12:43:51 -0700 Dipayaan Roy wrote:
> +	/* If already detached (indicates detach succeeded but attach failed
> +	 * previously). Now skip mana detach and just retry mana_attach.
> +	 */
> +	if (!netif_device_present(ndev))
> +		goto attach;
> +
>  	err = mana_detach(ndev, false);
>  	if (err) {
>  		netdev_err(ndev, "mana_detach failed: %d\n", err);
>  		goto dealloc_pre_rxbufs;
>  	}
>  
> +attach:

goto's are acceptable for error unwinding, not to jump around 
a function seemingly to avoid indenting something. Please use
normal constructs or perhaps move the netif_device_present()
into mana_detach() as an early exit condition? 

>  	err = mana_attach(ndev);

^ permalink raw reply

* Re: [PATCH v3 39/41] x86/paravirt: Move using_native_sched_clock() stub into timer.h
From: David Woodhouse @ 2026-05-21  0:00 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-40-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 382 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that timer.h ended up with CONFIG_PARAVIRT #ifdeffery anyways, move the
> PARAVIRT=n using_native_sched_clock() stub into timer.h as a "free"
> optimization.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 38/41] x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock
From: David Woodhouse @ 2026-05-20 23:59 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-39-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 699 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Rework the seemingly generic x86_cpuinit_ops.early_percpu_clock_init hook
> into a dedicated PV sched_clock hook, as the only reason the hook exists
> is to allow kvmclock to enable its PV clock on secondary CPUs before the
> kernel tries to reference sched_clock, e.g. when grabbing a timestamp for
> printk.
> 
> Rearranging the hook doesn't exactly reduce complexity; arguably it does
> the opposite.  But as-is, it's practically impossible to understand *why*
> kvmclock needs to do early configuration.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 37/41] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
From: David Woodhouse @ 2026-05-20 23:56 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-38-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Prefer the TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked unstable via command line.  I.e. use the same
> criteria as tweaking the clocksource rating so that TSC is preferred over
> kvmclock.  Per the below comment from native_sched_clock(), sched_clock
> is more tolerant of slop than clocksource; using TSC for clocksource but
> not sched_clock makes little to no sense, especially now that KVM CoCo
> guests with a trusted TSC use TSC, not kvmclock.
> 
>         /*
>          * Fall back to jiffies if there's no TSC available:
>          * ( But note that we still use it if the TSC is marked
>          *   unstable. We do this because unlike Time Of Day,
>          *   the scheduler clock tolerates small errors and it's
>          *   very important for it to be as fast as the platform
>          *   can achieve it. )
>          */
> 
> The only advantage of using kvmclock is that doing so allows for early
> and common detection of PVCLOCK_GUEST_STOPPED, but that code has been
> broken for over two years with nary a complaint, i.e. it can't be
> _that_ valuable.  And as above, certain types of KVM guests are losing
> the functionality regardless, i.e. acknowledging PVCLOCK_GUEST_STOPPED
> needs to be decoupled from sched_clock() no matter what.
> 
> Link: https://lore.kernel.org/all/Z4hDK27OV7wK572A@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Yay! (Albeit only for sched_clock, and we should do Xen too)

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 36/41] x86/kvmclock: Get local APIC bus frequency from PV CPUID Timing Info
From: Woodhouse, David @ 2026-05-20 23:55 UTC (permalink / raw)
  To: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	seanjc@google.com, pbonzini@redhat.com, kys@microsoft.com,
	decui@microsoft.com, daniel.lezcano@kernel.org,
	wei.liu@kernel.org, peterz@infradead.org, jgross@suse.com
  Cc: boris.ostrovsky@oracle.com, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, mhklinux@outlook.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <20260515191942.1892718-37-seanjc@google.com>


[-- Attachment #1.1: Type: text/plain, Size: 647 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> When running as a KVM guest with kvmclock support enabled, stuff the APIC
> timer period/frequency with the local APIC bus frequency reported in
> CPUID.0x40000010.EBX instead of trying to calibrate/guess the frequency.
> 
> See Documentation/virt/kvm/x86/cpuid.rst for details.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I still don't much like the way this is done inside kvm_get_tsc_khz().

We also probably ought to be looking for the timing leaf on other
hypervisors including VMware and probably Bhyve too. Should it be done
somewhere else?



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5964 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 33/41] x86/kvmclock: Mark TSC as reliable when it's constant and nonstop
From: David Woodhouse @ 2026-05-20 23:51 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-34-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Mark the TSC as reliable if the hypervisor (KVM) has enumerated the TSC
> as constant and nonstop, and the admin hasn't explicitly marked the TSC
> as unstable.  Like most (all?) virtualization setups, any secondary
> clocksource that's used as a watchdog is guaranteed to be less reliable
> than a constant, nonstop TSC, as all clocksources the kernel uses as a
> watchdog are all but guaranteed to be emulated when running as a KVM
> guest.  I.e. any observed discrepancies between the TSC and watchdog will
> be due to jitter in the watchdog.
> 
> This is especially true for KVM, as the watchdog clocksource is usually
> emulated in host userspace, i.e. reading the clock incurs a roundtrip
> cost of thousands of cycles.
> 
> Marking the TSC reliable addresses a flaw where the TSC will occasionally
> be marked unstable if the host is under moderate/heavy load.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 32/41] x86/tsc: Rejects attempts to override TSC calibration with lesser routine
From: David Woodhouse @ 2026-05-20 23:50 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-33-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> When registering a TSC frequency calibration routine, sanity check that
> the incoming routine is as robust as the outgoing routine, and reject the
> incoming routine if the sanity check fails.
> 
> Because native calibration routines only mark the TSC frequency as known
> and reliable when they actually run, the effective progression of
> capabilities is: None (native) => Known and maybe Reliable (PV) =>
> Known and Reliable (CoCo).  Violating that progression for a PV override
> is relatively benign, but messing up the progression when CoCo is
> involved is more problematic, as it likely means a trusted source of
> information (hardware/firmware) is being discarded in favor of a less
> trusted source (hypervisor).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 31/41] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
From: Woodhouse, David @ 2026-05-20 23:49 UTC (permalink / raw)
  To: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	seanjc@google.com, pbonzini@redhat.com, kys@microsoft.com,
	decui@microsoft.com, daniel.lezcano@kernel.org,
	wei.liu@kernel.org, peterz@infradead.org, jgross@suse.com
  Cc: boris.ostrovsky@oracle.com, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, mhklinux@outlook.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <20260515191942.1892718-32-seanjc@google.com>


[-- Attachment #1.1: Type: text/plain, Size: 1363 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Add a "tsc_properties" set of flags and use it to annotate whether the
> TSC operates at a known and/or reliable frequency when registering a
> paravirtual TSC calibration routine.  Currently, each PV flow manually
> sets the associated feature flags, but often in haphazard fashion that
> makes it difficult for unfamiliar readers to see the properties of the
> TSC when running under a particular hypervisor.
> 
> The other, bigger issue with manually setting the feature flags is that
> it decouples the flags from the calibration routine.  E.g. in theory, PV
> code could mark the TSC as having a known frequency, but then have its
> PV calibration discarded in favor of a method that doesn't use that known
> frequency.  Passing the TSC properties along with the calibration routine
> will allow adding sanity checks to guard against replacing a "better"
> calibration routine with a "worse" routine.
> 
> As a bonus, the flags also give developers working on new PV code a heads
> up that they should at least mark the TSC as having a known frequency.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5964 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 30/41] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC
From: David Woodhouse @ 2026-05-20 23:45 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-31-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1554 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Silently ignore attempts to switch to a paravirt sched_clock when running
> as a CoCo guest with trusted TSC.  In hand-wavy theory, a misbehaving
> hypervisor could attack the guest by manipulating the PV clock to affect
> guest scheduling in some weird and/or predictable way.  More importantly,
> reading TSC on such platforms is faster than any PV clock, and sched_clock
> is all about speed.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

And kvmclock. And Xen.

Are there *any* reasons we'd use a PV sched_clock when the TSC is
usable?

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> ---
>  arch/x86/kernel/tsc.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 3c15fc10e501..ac4abfec1f05 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -283,6 +283,15 @@ bool using_native_sched_clock(void)
>  int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
>  				      void (*save)(void), void (*restore)(void))
>  {
> +	/*
> +	 * Don't replace TSC with a PV clock when running as a CoCo guest and
> +	 * the TSC is secure/trusted; PV clocks are emulated by the hypervisor,
> +	 * which isn't in the guest's TCB.
> +	 */
> +	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC) ||
> +	    boot_cpu_has(X86_FEATURE_TDX_GUEST))
> +		return -EPERM;
> +
>  	if (!stable)
>  		clear_sched_clock_stable();
>  


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 29/41] x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
From: David Woodhouse @ 2026-05-20 23:44 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-30-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Add a return code to __paravirt_set_sched_clock() so that the kernel can
> reject attempts to use a PV sched_clock without breaking the caller.  E.g.
> when running as a CoCo VM with a secure TSC, using a PV clock is generally
> undesirable.
> 
> Note, kvmclock is the only PV clock that does anything "extra" beyond
> simply registering itself as sched_clock, i.e. is the only caller that
> needs to check the new return value.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Oooh... can we use this to reject the kvmclock when we have a stable
and reliable TSC even for non-CoCo guests?

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 28/41] x86/paravirt: Mark __paravirt_set_sched_clock() as __init
From: David Woodhouse @ 2026-05-20 23:42 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-29-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 485 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Annotate __paravirt_set_sched_clock() as __init, and make its wrapper
> __always_inline to ensure sanitizers don't result in a non-inline version
> hanging around.  All callers run during __init, and changing sched_clock
> after boot would be all kinds of crazy.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: David Woodhouse @ 2026-05-20 23:27 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-28-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> their kvmclock during CPU online.  While a redundant write to the MSR is
> technically ok, skip the registration when kvmclock is sched_clock so that
> it's somewhat obvious that kvmclock *needs* to be enabled during early
> bringup when it's being used as sched_clock.
> 
> Plumb in the BSP's resume path purely for documentation purposes.  Both
> KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> not super obvious that using KVM's hooks would be flawed.  E.g. it would
> work today, because KVM's hooks happen to run after/before timekeeping's
> hooks during suspend/resume, but that's sheer dumb luck as the order in
> which syscore_ops are invoked depends entirely on when a subsystem is
> initialized and thus registers its hooks.
> 
> Opportunsitically make the registration messages more precise to help
> debug issues where kvmclock is enabled too late.

That's a hard word to type, isn't it?

> Opportunstically WARN in kvmclock_{suspend,resume}() to harden against
> future bugs.

So is that :)

> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 26/41] x86/kvmclock: WARN if wall clock is read while kvmclock is suspended
From: David Woodhouse @ 2026-05-20 23:19 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-27-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 766 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> WARN if kvmclock is still suspended when its wallclock is read, i.e. when
> the kernel reads its persistent clock.  The wallclock subtly depends on
> the BSP's kvmclock being enabled, and returns garbage if kvmclock is
> disabled.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


Although I still hate the whole KVM wallclock thing, as the kvmclock
itself is monotonic_raw, so adding that to the wallclock epoch is kind
of wrong.

Maybe the host should updated the wallclock occasionally to keep it up
to date...


Or maybe the guest should prefer the KVM_HC_CLOCK_PAIRING hypercall if
it exists, over kvm-wallclock.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox