Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH] mshv: add vmbus dependency
From: Arnd Bergmann @ 2026-05-20  7:40 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Anirudh Rayabharam (Microsoft), Jork Loeser,
	Stanislav Kinsburskii
  Cc: Arnd Bergmann, linux-hyperv, linux-kernel

From: Arnd Bergmann <arnd@arndb.de>

When the vmbus driver is not part of the kernel, the mvhv_root
driver now fails to link:

ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!

Avoid this by adding an explicit Kconfig dependency. Note that
stubbing out the hv_vmbus_exists() based on configuration would
also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.

Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/hv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 52af086fdeb2..21193b571a80 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -75,6 +75,7 @@ config MSHV_ROOT
 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
 	# no particular order, making it impossible to reassemble larger pages
 	depends on PAGE_SIZE_4KB
+	depends on HYPERV_VMBUS
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
 	select HMM_MIRROR
-- 
2.39.5


^ permalink raw reply related

* Re: [PATCH] drivers: hv: use kmalloc_array in mshv_root_scheduler_init
From: sashiko-bot @ 2026-05-20  7:55 UTC (permalink / raw)
  To: Can Peng; +Cc: linux-hyperv
In-Reply-To: <20260520071632.557990-1-pengcan@kylinos.cn>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] A CPU hotplug offline operation can lead to a NULL pointer dereference in `mshv_vp_dispatch` because user tasks can still execute on the offlining CPU after the per-CPU buffers have been freed.
--

commit 182b63fe189dbbf1da723a03d972eaa7cbb71f56
Author: Can Peng <pengcan@kylinos.cn>

drivers: hv: use kmalloc_array in mshv_root_scheduler_init

Replace kmalloc() with kmalloc_array() to prevent potential overflow, as
recommended in Documentation/process/deprecated.rst.

> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index bd1359eb58dd4..146726cc4e9ba 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2241,7 +2241,7 @@ static int mshv_root_scheduler_init(unsigned int cpu)
>  	outputarg = (void **)this_cpu_ptr(root_scheduler_output);
>  
>  	/* Allocate two consecutive pages. One for input, one for output. */
> -	p = kmalloc(2 * HV_HYP_PAGE_SIZE, GFP_KERNEL);
> +	p = kmalloc_array(2, HV_HYP_PAGE_SIZE, GFP_KERNEL);
        ^^^^^^^^^^^^^

[Severity: High]
This isn't a bug introduced by this patch, but is there a pre-existing
issue where a CPU hotplug offline operation can lead to a NULL pointer
dereference in mshv_vp_dispatch()?

The CPU hotplug teardown callback mshv_root_scheduler_cleanup() is registered
at CPUHP_AP_ONLINE_DYN. During a CPU offline operation, it frees the per-CPU
buffers and sets the pointers to NULL:

drivers/hv/mshv_root_main.c:mshv_root_scheduler_cleanup() {
    ...
    *inputarg = NULL;
    *outputarg = NULL;
    ...
}

User tasks can still execute on the offlining CPU at this high state in
the offline sequence. If a vCPU thread scheduled on the offlining CPU invokes
mshv_vp_dispatch() after the teardown callback has run, it unconditionally
dereferences the input buffer:

drivers/hv/mshv_root_main.c:mshv_vp_dispatch() {
    ...
    preempt_disable();
    input = *this_cpu_ptr(root_scheduler_input);
    ...
    memset(input, 0, sizeof(*input));
    ...
}

Does this sequence result in a NULL pointer dereference if the input buffer
pointer has already been cleared by the cleanup callback?

>  	if (!p)
>  		return -ENOMEM;
>

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260520071632.557990-1-pengcan@kylinos.cn?part=1

^ permalink raw reply

* Re: [PATCH v3 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Berkant Koc @ 2026-05-20 13:23 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat, sashiko-reviews
In-Reply-To: <20260519213450.50E611F000E9@smtp.kernel.org>

sashiko-bot@kernel.org wrote:
> - [Critical] Using `bytes_recvd` for `memcpy()` without checking
>   `vmbus_recvpacket()` return value leads to a massive heap buffer
>   overflow.

This one is bounded on this channel. hyperv_connect_vsp() calls
vmbus_open() without setting max_pkt_size, so the inbound ring uses
VMBUS_DEFAULT_MAX_PKT_SIZE (4096) and hv_pkt_iter_first() clamps the
packet length to pkt_buffer_size. bytes_recvd therefore cannot exceed
4096, well under the 16 KiB recv_buf and init_buf, and
vmbus_recvpacket() does not return -ENOBUFS here, so the memcpy length
stays bounded.

I will still gate the dispatch on a successful vmbus_recvpacket()
return in the next revision, as defense in depth, so the bound is
local instead of relying on the ring clamp.

> - [High] Strict sizeof() validation incorrectly rejects
>   dynamically-sized SYNTHVID_RESOLUTION_RESPONSE packets.

Agreed. The response carries resolution_count entries, not the full
SYNTHVID_MAX_RESOLUTION_COUNT array, so checking against
sizeof(struct synthvid_supported_resolution_resp) is too strict. The
next revision validates the fixed prefix, reads and bounds
resolution_count, then requires only the count-sized array.

> - [High] Concurrent lockless write to `hv->init_buf` from VMBus
>   callback allows a malicious host to overwrite data while the guest
>   is validating it.
> - [High] Missing `reinit_completion()` before reusing the shared
>   `hv->wait` completion object.

Both pre-existing. On v2 Michael Kelley suggested splitting the
completion reinit into a separate patch on the resume path. The
init_buf reuse sits in the same area, so I plan to send the reinit and
the related response-type handling as a separate follow-up rather than
fold them into this size-validation change.

Thanks for the review.

Berkant

^ permalink raw reply

* Re: [PATCH v1 1/4] iommu: Move Hyper-V IOMMU driver to its own subdirectory
From: Jason Gunthorpe @ 2026-05-20 13:38 UTC (permalink / raw)
  To: Yu Zhang
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan
In-Reply-To: <elo7enhxk5m7x4rc4quiqnkgbwsqa3ex7di2ksv7szu75xqe6x@zum5nwmkl6kn>

On Wed, May 20, 2026 at 02:37:03PM +0800, Yu Zhang wrote:
> On Fri, May 15, 2026 at 07:19:18PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 12, 2026 at 12:24:05AM +0800, Yu Zhang wrote:
> > > From: Easwar Hariharan <eahariha@linux.microsoft.com>
> > > 
> > > The Hyper-V IOMMU driver currently only supports IRQ remapping.
> > > As it will be adding DMA remapping support, prepare a directory
> > > to contain all the different feature files.
> > 
> > Any possibility we could put the irq remapping thing under the irq
> > directory?
> > 
> > The other drivers have it here because they are co-mingled with their
> > iommu HW, will hyperv have the same issue?
> > 
> 
> Good question. I don't think Hyper-V have the same co-mingling issue.
> 
> But from a code organization perspective, I think drivers/iommu/hyperv/
> is still the most natural place:
> 
> - The IRQ remapping framework itself (drivers/iommu/irq_remapping.c
>   and its internal header irq_remapping.h) lives under drivers/iommu/,
>   and all three backends (intel/, amd/, hyperv/) sit there today.
>   hyperv/irq_remapping.c includes that internal header directly.

IMHO it is a fair question if that even belongs under iommu.

I think it was dumped into here because of the co-mingled drivers..

Jason

^ permalink raw reply

* RE: [PATCH] mshv: add vmbus dependency
From: Michael Kelley @ 2026-05-20 13:46 UTC (permalink / raw)
  To: Arnd Bergmann, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Anirudh Rayabharam (Microsoft), Jork Loeser,
	Stanislav Kinsburskii
  Cc: Arnd Bergmann, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260520074044.923728-1-arnd@kernel.org>

From: Arnd Bergmann <arnd@kernel.org> Sent: Wednesday, May 20, 2026 12:40 AM
> 
> When the vmbus driver is not part of the kernel, the mvhv_root
> driver now fails to link:
> 
> ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
> 
> Avoid this by adding an explicit Kconfig dependency. Note that
> stubbing out the hv_vmbus_exists() based on configuration would
> also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.

Conceptually, the MSHV root code should not have a dependency on
VMBus. The "does VMBus exist?" question should handled differently
by setting up a boolean in the core Hyper-V code that defaults to "false".
If the VMBus driver loads, it would set the boolean to "true". MSHV
root code would query the boolean.

Michael

> 
> Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/hv/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 52af086fdeb2..21193b571a80 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -75,6 +75,7 @@ config MSHV_ROOT
>  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>  	# no particular order, making it impossible to reassemble larger pages
>  	depends on PAGE_SIZE_4KB
> +	depends on HYPERV_VMBUS
>  	select EVENTFD
>  	select VIRT_XFER_TO_GUEST_WORK
>  	select HMM_MIRROR
> --
> 2.39.5
> 


^ permalink raw reply

* Re: [PATCH] mshv: add vmbus dependency
From: Arnd Bergmann @ 2026-05-20 13:51 UTC (permalink / raw)
  To: Michael Kelley, Arnd Bergmann, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Anirudh Rayabharam (Microsoft),
	Jork Loeser, Stanislav Kinsburskii
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157CDF10D33D62DD8709D63D4012@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, May 20, 2026, at 15:46, Michael Kelley wrote:
> From: Arnd Bergmann <arnd@kernel.org> Sent: Wednesday, May 20, 2026 12:40 AM
>> 
>> When the vmbus driver is not part of the kernel, the mvhv_root
>> driver now fails to link:
>> 
>> ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
>> 
>> Avoid this by adding an explicit Kconfig dependency. Note that
>> stubbing out the hv_vmbus_exists() based on configuration would
>> also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.
>
> Conceptually, the MSHV root code should not have a dependency on
> VMBus. The "does VMBus exist?" question should handled differently
> by setting up a boolean in the core Hyper-V code that defaults to "false".
> If the VMBus driver loads, it would set the boolean to "true". MSHV
> root code would query the boolean.

That makes sense to me, but it's outside of my area for a drive-by
build fix. Please treat my patch as a 'Reported-by' then, I expect
this can easily be addressed by Jork or someone else in the
hyperv team.

      Arnd

^ permalink raw reply

* RE: [PATCH v3 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Michael Kelley @ 2026-05-20 14:24 UTC (permalink / raw)
  To: Berkant Koc, Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Michael Kelley, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat, sashiko-reviews@lists.linux.dev
In-Reply-To: <177928341826.371979.14701698047864220449@berkoc.com>

From: Berkant Koc <me@berkoc.com> Sent: Wednesday, May 20, 2026 6:24 AM
> 
> sashiko-bot@kernel.org wrote:
> > - [Critical] Using `bytes_recvd` for `memcpy()` without checking
> >   `vmbus_recvpacket()` return value leads to a massive heap buffer
> >   overflow.
> 
> This one is bounded on this channel. hyperv_connect_vsp() calls
> vmbus_open() without setting max_pkt_size, so the inbound ring uses
> VMBUS_DEFAULT_MAX_PKT_SIZE (4096) and hv_pkt_iter_first() clamps the
> packet length to pkt_buffer_size. bytes_recvd therefore cannot exceed
> 4096, well under the 16 KiB recv_buf and init_buf, and
> vmbus_recvpacket() does not return -ENOBUFS here, so the memcpy length
> stays bounded.

Actually, the behavior of vmbus_recvpacket() is more subtle,
and the problem pointed out by Sashiko AI is real. I had forgotten
about this subtle behavior in my first reply to you on this topic. :-(

The incoming message from Hyper-V has a length encoded in it.
hv_ringbuffer_read() gets that message length, and if it is larger
than the buflen_parameter, -ENOBUFS is returned. But the
returned buffer_actual_len is also set to the incoming message
length provided by Hyper-V. This allows the caller to realize
it didn't provide enough buffer space, and also tells it how much
buffer space is needed. hv_pci_onchannelcallback() uses
this functionality to retry the receive operation with a larger
buffer. I think all the other callers just treat -ENOBUFS as a
fatal error, which is also fine.

> 
> I will still gate the dispatch on a successful vmbus_recvpacket()
> return in the next revision, as defense in depth, so the bound is
> local instead of relying on the ring clamp.

Indeed, the error check is needed, and it's not just defense in depth.

> 
> > - [High] Strict sizeof() validation incorrectly rejects
> >   dynamically-sized SYNTHVID_RESOLUTION_RESPONSE packets.
> 
> Agreed. The response carries resolution_count entries, not the full
> SYNTHVID_MAX_RESOLUTION_COUNT array, so checking against
> sizeof(struct synthvid_supported_resolution_resp) is too strict. The
> next revision validates the fixed prefix, reads and bounds
> resolution_count, then requires only the count-sized array.

OK, good.

> 
> > - [High] Concurrent lockless write to `hv->init_buf` from VMBus
> >   callback allows a malicious host to overwrite data while the guest
> >   is validating it.

I don't think this actually happens. In hv_pkt_iter_first(), the message
is copied out of the ring buffer into a temporary buffer that is not
explicitly shared with the host. This temporary copy prevents a
malicious host from modifying the data while the guest is validating it.

> > - [High] Missing `reinit_completion()` before reusing the shared
> >   `hv->wait` completion object.
> 
> Both pre-existing. On v2 Michael Kelley suggested splitting the
> completion reinit into a separate patch on the resume path. The
> init_buf reuse sits in the same area, so I plan to send the reinit and
> the related response-type handling as a separate follow-up rather than
> fold them into this size-validation change.

Handling a timeout, and then the host providing a belated response
is a really messy problem. There's some mechanism using request IDs
in hv_ringbuffer_write() and vmbus_sendpacket_getid() to help match
up requests and responses, but my recollection is that even this
extra machinery is not 100% foolproof. You may or may not want
to go down the path of trying to fix it. :-)

A process comment:  The emails for v2 and v3 of your patch set
are being threaded in an unexpected way. They are showing up
as replies under the original v1. See https://lore.kernel.org/linux-hyperv/.
The preferred approach is for each version to start a new email
thread. The cover letter should start the new thread, and the
patches should show up as threaded under the cover letter.

Also, don't post a new version more frequently than every
24 hours at a minimum, and there's no problem with waiting
2 to 3 days. The idea is to give people a chance to review and
provide comments so that you can batch any changes in
response to the feedback.

Michael

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-20 15:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan
In-Reply-To: <20260515223139.GK7702@ziepe.ca>

On Fri, May 15, 2026 at 07:31:39PM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2026 at 12:24:07AM +0800, Yu Zhang wrote:
> > +/*
> > + * Identity and blocking domains are static singletons: identity is a 1:1
> > + * passthrough with no page table, blocking rejects all DMA. Neither holds
> > + * per-IOMMU state, so one instance suffices even with multiple vIOMMUs.
> > + */
> > +static struct hv_iommu_domain hv_identity_domain;
> > +static struct hv_iommu_domain hv_blocking_domain;
> > +static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
> > +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;
> 
> Please follow the format of other drivers and statically initialize
> these here instead of in C code.
> 

Thanks! Will do.

> > +static struct iommu_ops hv_iommu_ops;
> > +static LIST_HEAD(hv_iommu_pci_bus_list);
> > +static DEFINE_SPINLOCK(hv_iommu_pci_bus_lock);
> > +
> > +#define hv_iommu_present(iommu_cap) (iommu_cap & HV_IOMMU_CAP_PRESENT)
> > +#define hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1)
> > +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1_5LVL)
> > +#define hv_iommu_ats_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_ATS)
> 
> prefer to see static inlines
> 

Yes. Thanks.

> > +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev)
> > +{
> > +	u64 status;
> > +	unsigned long flags;
> > +	struct hv_input_detach_device_domain *input;
> > +	struct pci_dev *pdev;
> > +	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> > +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +
> > +	/* See the attach function, only PCI devices for now */
> > +	if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> > +		return;
> > +
> > +	pdev = to_pci_dev(dev);
> > +
> > +	dev_dbg(dev, "detaching from domain %d\n", hv_domain->device_domain.domain_id.id);
> > +
> > +	local_irq_save(flags);
> > +
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	memset(input, 0, sizeof(*input));
> > +	input->partition_id = HV_PARTITION_ID_SELF;
> > +	if (hv_iommu_lookup_logical_dev_id(pdev, &input->device_id.as_uint64)) {
> > +		local_irq_restore(flags);
> > +		dev_warn(&pdev->dev, "no IOMMU registration for vPCI bus on detach\n");
> > +		return;
> > +	}
> > +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> 
> FWIW the hypervisor cannot implement the linux attach semantics if it
> has detach/attach. It must support simply 'attach' which atomically
> changes the translation. With detach you have a confusing problem if
> errors happen in the middle of the sequence the device is left in an
> unclear state. You should at least document what state the hypervisor
> is supposed to leaave the translation iin during these failures..
> 

Thanks for raising this!

My understanding is that the detach hypercall just clears the HW
IOMMU entry for the device (e.g., context table entry on VT-d, DTE
on AMD IOMMU) and flushes the IOTLB. So after detach and before the
subsequent attach, DMA from the device is blocked by the HW IOMMU
- the device is not in an unprotected state.

The detach shall not fail unless the pvIOMMU driver passes incorrect
parameters or is otherwise in a buggy state, or Hyper-V itself has
a bug. 

But making attach atomic might be a cleaner approach (drop the
explicit detach and let the hypervisor handle the domain switch
internally)? I'll look into making that work, though it needs
verification (and possibly changes) on the hypervisor side (these
hypercalls are not exclusively for Linux guest). If that doesn't
work out, I'll keep the two-step approach with comments documenting
the intermediate translation state.

> > +static void hv_iommu_release_device(struct device *dev)
> > +{
> > +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	if (pdev->ats_enabled)
> > +		pci_disable_ats(pdev);
> > +
> > +	dev_iommu_priv_set(dev, NULL);
> > +	set_dma_ops(dev, NULL);
> 
> Does the driver really need to mess with set_dma_ops ? I'd rather not
> see that in new iommu drivers if at all possible :|
> 

No. I should have removed it. Thanks!

> > +static int __init hv_initialize_static_domains(void)
> > +{
> > +	int ret;
> > +	struct hv_iommu_domain *hv_domain;
> > +
> > +	/* Default stage-1 identity domain */
> > +	hv_domain = &hv_identity_domain;
> > +
> > +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY);
> > +	if (ret)
> > +		goto delete_identity_domain;
> > +
> > +	hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY;
> > +	hv_domain->domain.ops = &hv_iommu_identity_domain_ops;
> > +	hv_domain->domain.owner = &hv_iommu_ops;
> > +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> > +	hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> 
> identity doesn't use geometry or pgsize_bitmap
> 

Sure. Will remove it.

> > +	/* Default stage-1 blocked domain */
> > +	hv_domain = &hv_blocking_domain;
> > +
> > +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > +	if (ret)
> > +		goto delete_identity_domain;
> > +
> > +	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED);
> > +	if (ret)
> > +		goto delete_blocked_domain;
> > +
> > +	hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED;
> > +	hv_domain->domain.ops = &hv_iommu_blocking_domain_ops;
> > +	hv_domain->domain.owner = &hv_iommu_ops;
> > +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> > +	hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> 
> Nor does blocked
> 

Same. Will remove.

> > +#define INTERRUPT_RANGE_START	(0xfee00000)
> > +#define INTERRUPT_RANGE_END	(0xfeefffff)
> > +static void hv_iommu_get_resv_regions(struct device *dev,
> > +		struct list_head *head)
> > +{
> > +	struct iommu_resv_region *region;
> > +
> > +	region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
> > +				      INTERRUPT_RANGE_END - INTERRUPT_RANGE_START + 1,
> > +				      0, IOMMU_RESV_MSI, GFP_KERNEL);
> 
> Surprised these constants are not discovered from the hv?
> 

Thanks. Since the pvIOMMU currently only targets x86, hardcoding
the x86 architectural MSI range should be reasonable?

If in the future we need additional reserved regions (e.g.,
RMRR-style identity-mapped ranges), those would be discovered
from the hypervisor.

> > +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
> > +{
> > +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> > +}
> > +
> > +static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
> > +				struct iommu_iotlb_gather *iotlb_gather)
> > +{
> > +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> > +
> > +	iommu_put_pages_list(&iotlb_gather->freelist);
> > +}
> 
> Full invalidation only huh?
> 

Page-selective IOTLB flush is added in patch 4/4.

> > +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
> > +	.attach_dev	= hv_iommu_attach_dev,
> > +};
> > +
> > +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
> > +	.attach_dev	= hv_iommu_attach_dev,
> > +};
> 
> Usually I would expect these to have their own attach
> functions. blocking in particular must have an attach op that cannot
> fail. It is used to recover the device back to a known translation in
> case of cascading other errors.
> 

For blocking domain, the hypercall handler of such attach essentially
disable the translation and IOPF for the device. It should not fail
unless there is a bug in the driver or the hypervisor.
How about something like this?

	static int hv_blocking_attach_dev(domain, dev, old)
	{
		...
		status = hv_do_hypercall(HVCALL_ATTACH, blocking, dev);
		WARN_ON(!hv_result_success(status));
		vdev->hv_domain = blocking;
		return 0;
	}

For identity domain, would it be fine to keep sharing the same attach
callback with paging domain?

> > +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
> > +	.attach_dev	= hv_iommu_attach_dev,
> > +	IOMMU_PT_DOMAIN_OPS(x86_64),
> > +	.flush_iotlb_all = hv_iommu_flush_iotlb_all,
> > +	.iotlb_sync = hv_iommu_iotlb_sync,
> > +	.free = hv_iommu_paging_domain_free,
> > +};
> > +
> > +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> > +{
> > +	int ret;
> > +	struct hv_iommu_domain *hv_domain;
> > +	struct pt_iommu_x86_64_cfg cfg = {};
> > +
> > +	hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
> > +	if (!hv_domain)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > +	if (ret) {
> > +		kfree(hv_domain);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> 
> geoemtry shouldn't be set here, it is overriden by
> pt_iommu_x86_64_init() with the exact page table configuration
> 

Right. Will remove.
> > +static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
> > +			struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> > +{
> > +	ida_init(&hv_iommu->domain_ids);
> > +
> > +	hv_iommu->cap = hv_iommu_cap->iommu_cap;
> > +	hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
> > +	if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
> > +	    hv_iommu->max_iova_width > 48) {
> > +		pr_info("5-level paging not supported, limiting iova width to 48.\n");
> > +		hv_iommu->max_iova_width = 48;
> > +	}
> > +
> > +	hv_iommu->geometry = (struct iommu_domain_geometry) {
> > +		.aperture_start = 0,
> > +		.aperture_end = (((u64)1) << hv_iommu->max_iova_width) - 1,
> > +		.force_aperture = true,
> > +	};
> > +
> > +	hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
> > +	hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
> > +	/* Only x86 page sizes (4K/2M/1G) are supported */
> > +	hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap &
> > +				  (SZ_4K | SZ_2M | SZ_1G);
> > +	if (hv_iommu->pgsize_bitmap != hv_iommu_cap->pgsize_bitmap)
> > +		pr_warn("unsupported page sizes masked: 0x%llx -> 0x%llx\n",
> > +			hv_iommu_cap->pgsize_bitmap, hv_iommu->pgsize_bitmap);
> 
> IKD if you need this logic, the way the page table code is used it
> just sort of falls out naturally that other page sizes are ignored.
> 

Right, the page table code can handle 4K/2M/1G naturally. We'd like
hypervisor to be able to choose to support only a subset of {4K, 2M,
1G}. This masking filters out any unexpected page sizes and falls
back to 4K if nothing valid remains.

The if statement may be a bit paranoid perhaps, but it felt safer
to validate explicitly. Happy to drop it if you think it's unnecessary.


> > +struct hv_iommu_domain {
> > +	union {
> > +		struct iommu_domain    domain;
> > +		struct pt_iommu        pt_iommu;
> > +		struct pt_iommu_x86_64 pt_iommu_x86_64;
> > +	};
> 
> You should retain the build assertions here
> 

Sure. Thanks!

B.R.
Yu

> Jason
> 

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-20 15:50 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com, Mukesh R
In-Reply-To: <SN6PR02MB4157A1B2D9B56062A0917BC5D4042@SN6PR02MB4157.namprd02.prod.outlook.com>

On Fri, May 15, 2026 at 05:36:40PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 9:54 AM
> > 
> > On Fri, May 15, 2026 at 02:51:38PM +0000, Michael Kelley wrote:
> > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 7:00 AM
> > > >
> > > > On Thu, May 14, 2026 at 06:13:24PM +0000, Michael Kelley wrote:
> > > > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> 
> [....]
> 
> > > > >
> > > > > Previous versions of this function did hv_iommu_detach_dev(). With that call
> > > > > removed from here, hv_iommu_detach_dev() is only called when attaching a
> > > > > domain to a device that already has a domain attached. Is it the case that
> > > > > Hyper-V doesn't require the detach as a cleanup step?
> > > > >
> > > >
> > > > The IOMMU core attaches the device to release_domain (our blocking domain)
> > > > before calling release_device(), so I believe the explicit detach in the RFC
> > > > was redundant. I simply didn't realize that at the time.
> > > >
> > >
> > > Got it. But after the IOMMU core attaches the device to the blocking
> > > domain, there's the possibility that the vPCI device is rescinded by
> > > Hyper-V and it goes away entirely. Or the device might be subjected
> > > to an "unbind/bind" cycle in Linux. Does the detach need to be done
> > > on the blocking domain in such cases? In this version of the patches, the
> > > Hyper-V "attach" and "detach" hypercalls still end up unbalanced. That
> > > seems a bit untidy at best, and I wonder if there are scenarios where
> > > Hyper-V will complain about the lack of balance.
> > >
> > 
> > Thank you, Michael. May I ask what "the vPCI device is rescinded by
> > Hyper-V and it goes away entirely" mean?
> > 
> 
> See the documentation at Documentation/virt/hyperv/vpci.rst in a
> kernel source code tree, and particularly the section entitled "PCI Device
> Removal". Such removals can and do happen in running Azure guest
> VMs. Start with that info and then I'll do my best to answer follow-up
> questions you may have.
> 
> The unbind/bind case is separate, but has some of the same effects in
> that Linux should be removing all setup of the PCI device. There's actually
> two unbind steps -- one to unbind the device-specific driver (e.g., the
> Mellanox MLX5 driver or the NMVe driver) driver from the device, and
> potentially a second to unbind the VMBus vPCI driver from the device.
> These unbind/bind sequences can be done in the Linux guest without
> the Hyper-V host rescinding the device.

Thanks for pointing me to the vpci.rst doc, Michael!

IIUC, for the vPCI rescind case and the VMBus vPCI driver unbind
case, both will trigger iommu_deinit_device(), in which IOMMU core
attaches the device to our blocking domain. And hypervisor will
handle such attaching to the blocking domain by essentially disable
the DMA translation and IOPF for the device. Since the device is
already in a safe state after that, I don't think an additional
detach is necessary.

For the device-specific driver unbind (e.g., unbind nvme then
bind back, or bind to vfio-pci), that is transparent to the IOMMU
layer.

Also, while looking at Jason's comments on the detach/attach
semantics, I'm now thinking that making the attach hypercall
atomic (having the hypervisor handle the domain switch
internally) might be a cleaner approach. That also eliminates
the attach/detach imbalance issue entirely. I need to do some
verification on the hypervisor side first, but will keep you
posted. Thanks!

B.R.
Yu

> 
> Michael
> 

^ permalink raw reply

* [PATCH] hyperv: move vmbus_root_device outside of the VMBus driver
From: Hamza Mahfooz @ 2026-05-20 15:49 UTC (permalink / raw)
  To: linux-hyperv
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Stanislav Kinsburskii, Anirudh Rayabharam (Microsoft),
	Jork Loeser, Michael Kelley, Hamza Mahfooz, Arnd Bergmann

Conceptually, we shouldn't have to build the VMBus driver to answer the
question "does VMBus exist?" So, move vmbus_root_device and the
functions that access it directly to hyperv.h. Also, introduce
hv_set_vmbus_root_device() to allow vmbus_drv to set the root device.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Closes: https://lore.kernel.org/r/20260520074044.923728-1-arnd@kernel.org/
Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
---
 drivers/hv/vmbus_drv.c | 27 ++++++---------------------
 include/linux/hyperv.h | 18 ++++++++++++++++--
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 71f1b4d52f7f..b26e6b42fa49 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -47,9 +47,6 @@ struct vmbus_dynid {
 	struct hv_vmbus_device_id id;
 };
 
-/* VMBus Root Device */
-static struct device  *vmbus_root_device;
-
 static int hyperv_cpuhp_online;
 
 static DEFINE_PER_CPU(long, vmbus_evt);
@@ -95,18 +92,6 @@ static struct resource *fb_mmio;
 static struct resource *hyperv_mmio;
 static DEFINE_MUTEX(hyperv_mmio_lock);
 
-struct device *hv_get_vmbus_root_device(void)
-{
-	return vmbus_root_device;
-}
-EXPORT_SYMBOL_GPL(hv_get_vmbus_root_device);
-
-bool hv_vmbus_exists(void)
-{
-	return vmbus_root_device != NULL;
-}
-EXPORT_SYMBOL_GPL(hv_vmbus_exists);
-
 static u8 channel_monitor_group(const struct vmbus_channel *channel)
 {
 	return (u8)channel->offermsg.monitorid / 32;
@@ -903,7 +888,7 @@ static int vmbus_dma_configure(struct device *child_device)
 	 * On x86/x64 coherence is assumed and these calls have no effect.
 	 */
 	hv_setup_dma_ops(child_device,
-		device_get_dma_attr(vmbus_root_device) == DEV_DMA_COHERENT);
+		device_get_dma_attr(hv_get_vmbus_root_device()) == DEV_DMA_COHERENT);
 	return 0;
 }
 
@@ -2172,7 +2157,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
 		     &child_device_obj->channel->offermsg.offer.if_instance);
 
 	child_device_obj->device.bus = &hv_bus;
-	child_device_obj->device.parent = vmbus_root_device;
+	child_device_obj->device.parent = hv_get_vmbus_root_device();
 	child_device_obj->device.release = vmbus_device_release;
 
 	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
@@ -2561,7 +2546,7 @@ static int vmbus_acpi_add(struct platform_device *pdev)
 	struct acpi_device *ancestor;
 	struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
 
-	vmbus_root_device = &device->dev;
+	hv_set_vmbus_root_device(&device->dev);
 
 	/*
 	 * Older versions of Hyper-V for ARM64 fail to include the _CCA
@@ -2648,7 +2633,7 @@ static int vmbus_device_add(struct platform_device *pdev)
 	struct device_node *np = pdev->dev.of_node;
 	int ret;
 
-	vmbus_root_device = &pdev->dev;
+	hv_set_vmbus_root_device(&pdev->dev);
 
 	ret = of_range_parser_init(&parser, np);
 	if (ret)
@@ -2969,7 +2954,7 @@ static int __init hv_acpi_init(void)
 	if (ret)
 		return ret;
 
-	if (!vmbus_root_device) {
+	if (!hv_vmbus_exists()) {
 		ret = -ENODEV;
 		goto cleanup;
 	}
@@ -3000,7 +2985,7 @@ static int __init hv_acpi_init(void)
 
 cleanup:
 	platform_driver_unregister(&vmbus_platform_driver);
-	vmbus_root_device = NULL;
+	hv_set_vmbus_root_device(NULL);
 	return ret;
 }
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 5459e776ec17..3e8ea7d9653c 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1302,9 +1302,23 @@ static inline void *hv_get_drvdata(struct hv_device *dev)
 	return dev_get_drvdata(&dev->device);
 }
 
-struct device *hv_get_vmbus_root_device(void);
+/* Do *not* use this outside of hyperv.h */
+static struct device *__vmbus_root_device __read_mostly;
 
-bool hv_vmbus_exists(void);
+static inline void hv_set_vmbus_root_device(struct device *dev)
+{
+	__vmbus_root_device = dev;
+}
+
+static inline struct device *hv_get_vmbus_root_device(void)
+{
+	return __vmbus_root_device;
+}
+
+static inline bool hv_vmbus_exists(void)
+{
+	return hv_get_vmbus_root_device();
+}
 
 struct hv_ring_buffer_debug_info {
 	u32 current_interrupt_mask;
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Yu Zhang @ 2026-05-20 16:27 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <BN7PR02MB4148D820AEBBBB9A1268D640D4042@BN7PR02MB4148.namprd02.prod.outlook.com>

On Fri, May 15, 2026 at 11:33:21PM +0000, Michael Kelley wrote:
> From: Michael Kelley <mhklinux@outlook.com> Sent: Friday, May 15, 2026 11:00 AM
> > 
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 9:24 AM
> > >
> > > On Thu, May 14, 2026 at 06:14:22PM +0000, Michael Kelley wrote:
> > > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> > > > >
> > 
> > [....]
> > 
> > > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > > +	u16 count = 0;
> > > > > +
> > > > > +	while (nr_pages > 0) {
> > > > > +		unsigned long flush_pages;
> > > > > +		int order;
> > > > > +		unsigned long pfn_align;
> > > > > +		unsigned long size_align;
> > > > > +
> > > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > > +			break;
> > > > > +		}
> > > > > +
> > > > > +		if (start_pfn)
> > > > > +			pfn_align = __ffs(start_pfn);
> > > >
> > > > I don't understand why __ffs() is correct here. I would expect
> > > > __fls() so it is consistent with the calculation of size_align. But I
> > > > can only surmise how the hypercall works since there's no
> > > > documentation, so maybe my understanding of the hypercall is
> > > > wrong.   If __ffs really is correct, a comment explaining why
> > > > would help. :-)
> > > >
> > >
> > > The use of __ffs() is intentional. Each flush entry invalidates a
> > > naturally aligned 2^N page block, and the hypervisor requires the
> > > page_number to be aligned to 2^page_mask_shift.
> > >
> > > Here __ffs() and __fls() serve different purposes:
> > > - __ffs(start_pfn) is about the alignment constraint, e.g.,  how
> > > large a block can this address support?
> > > - __fls(nr_pages) is about  the size constraint, e.g., how large
> > > a block can the remaining range hold?
> > >
> > > Taking min() of both ensures each entry is both properly aligned
> > > and within bounds.
> > >
> > > Thanks for raising this - it definitely deserves a comment. I had to
> > > stare at it for a while myself to remember why. :)
> > 
> > Hmmm. Something about this still nags at me. I'll run some
> > experiments to either convince myself that you are right, or to
> > come up with a counterexample.
> 
> I have resolved what was nagging at me. My understanding of how
> _ffs() and __fls() work was wrong. :-(  Your code is correct.
> 
> > 
> > A related thought occurred to me. If each flush entry that is passed
> > to Hyper-V describes a naturally aligned 2^N page block, I don't
> > think the HV_IOMMU_MAX_FLUSH_VA_COUNT can ever
> > be reached. The number of entries is limited by the number of
> > bits in a PFN and the pages count, both of which are 64. And with
> > 52 bit physical addressing and 4KiB pages, the actual size of
> > a PFN and pages count is even smaller than 64.
> > HV_IOMMU_MAX_FLUSH_VA_COUNT is the number of 8 byte
> > union hv_iommu_flush_va entries that fit in a 4KiB page, so
> > it's ~500.
> > 
> > My statement applies to a single flush range. If multiple flush
> > ranges were strung together in a single hypercall, a larger count
> > could be reached, but hv_flush_device_domain_list() only does
> > a single range. So I don't think the overflow case in
> > hv_flush_device_domain_list() can ever happen. But let me
> > do my experiments, and I will also look at this aspect to confirm
> > if it's right.
> 
> My experiments also convince me that the overflow case can't
> happen as long as the hypercall is being populated with entries
> derived from a single range.

Gosh. You're right. I think I've been overthinking this :( 

So we can simplify this to just do the list flush and fall back
to HVCALL_FLUSH_DEVICE_DOMAIN only if the list flush fails.

B.R.
Yu

^ permalink raw reply

* Re: [PATCH v3 31/41] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
From: Sean Christopherson @ 2026-05-20 16:40 UTC (permalink / raw)
  To: Michael Kelley
  Cc: sashiko-reviews@lists.linux.dev, linux-hyperv@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D50A944A2794475E32FED4002@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, May 19, 2026, Michael Kelley wrote:
> From: Sean Christopherson <seanjc@google.com> Sent: Monday, May 18, 2026 3:18 PM
> > > > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > > > --- a/arch/x86/kernel/cpu/mshyperv.c
> > > > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > > > @@ -516,8 +516,13 @@ static void __init ms_hyperv_init_platform(void)
> > > >
> > > >  	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
> > > >  	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> > > > -		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
> > > > -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > > > +		enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
> > > > +
> > > > +		if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
> > > > +			tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > > > +
> > > > +		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz,
> > > > +						  tsc_properties);
> > > >  	}
> > >
> > > [ ... ]
> > >
> > > > @@ -629,7 +634,6 @@ static void __init ms_hyperv_init_platform(void)
> > > >  		 * is called.
> > > >  		 */
> > > >  		wrmsrq(HV_X64_MSR_TSC_INVARIANT_CONTROL, HV_EXPOSE_INVARIANT_TSC);
> > > > -		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > > >  	}
> > >
> > > If a Hyper-V VM exposes an invariant TSC but lacks the frequency MSRs,
> > > does it bypass the tsc_register_calibration_routines() block entirely?
> > 
> > Yes.
> > 
> > > Without the standalone setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE) call
> > > here, it looks like these VMs will lose the reliable flag.
> > >
> > > Will this inadvertently enable the TSC watchdog, potentially causing a
> > > performance regression if the system falsely marks the TSC as unstable due
> > > to virtualization scheduling delays?
> > 
> > Hmm, I was going to say that the change was intentional and desriable, but looking
> > at this yet again, I don't think that's true.  Enabling HV_EXPOSE_INVARIANT_TSC
> > just means the kernel will (probably) set X86_FEATURE_CONSTANT_TSC and
> > X86_FEATURE_NONSTOP_TSC during early_init_intel(), AFAICT it doesn't lead to
> > X86_FEATURE_TSC_RELIABLE being set.  And I think in this case, marking the TSC
> > as reliable makes sense; even if the kernel doesn't user Hyper-V's calibration
> > info, the host is still clearly telling the guest that the TSC is reliable.
> 
> Yes, I agree. But I'm doubtful that such a combination ever occurs in practice.
> I've never seen an occurrence of a Hyper-V (even really old versions) guest
> where the frequency MSRs are not available or are not accessible. The
> Hyper-V spec allows for either condition, so we have the code to test the
> flags. I've thought of the flags as always set, though I suppose one never
> knows what the future holds.
> 
> > 
> > Michael, does keeping the
> > 
> > 		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > 
> > but also passing TSC_FREQ_KNOWN_AND_RELIABLE to the calibration routine make
> > sense?
> 
> I don't see that it would break anything. But it seems a bit disjointed in
> that HV_ACCESS_TSC_INVARIANT is tested in two places in
> ms_hyperv_init_platform(). Does TSC_RELIABLE *need* to be passed to
> tsc_register_calibration_routines() if later code in ms_hyperv_init_platform()
> does setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)?

Sort of?  It's not strictly necessary, but passing in TSC_RELIABLE allows
tsc_register_calibration_routines() to ensure it doesn't clobber a more robust
calibration routine with a "lesser" routine.  I.e. not passing TSC_RELIABLE in
this case would trigger a false positive (and break Hyper-V).

In other words, invoking setup_force_cpu_cap() is a (happy, desirable) side effect,
not the primary goal.
 
> In other words, I'm suggesting let tsc_register_calibration_routines() handle
> the TSC_FREQ_KNOWN case since that's what the calibration routines are all
> about. Leave the setting of X86_FEATURE_TSC_RELIABLE to the later code that
> tests HV_ACCESS_TSC_INVARIANT, instead of duplicating the
> setup_force_cpu_cap() operation.
>
> While combining FREQUENCY_KNOWN and RELIABLE into
> tsc_register_calibration_routines() is convenient, the two
> concepts turn out to be independent when looking strictly at
> the Hyper-V spec and code written to follow that spec.
> Combining them into the same function ends up being clumsy

Yeah, it's a bit awkward for Hyper-V, but Hyper-V is definitely the odd one out
here, in that it has an "out-of-band" feature that marks the TSC as reliable.
All other PV features that trigger overrides of the calibration routines bundle
the RELIABLE aspect with the feature itself.

^ permalink raw reply

* Re: [PATCH] mshv: add vmbus dependency
From: Jork Loeser @ 2026-05-20 17:13 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Michael Kelley, Arnd Bergmann, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Anirudh Rayabharam (Microsoft),
	Stanislav Kinsburskii, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <56938994-3795-415a-8be3-16e28236082b@app.fastmail.com>

On Wed, 20 May 2026, Arnd Bergmann wrote:

> On Wed, May 20, 2026, at 15:46, Michael Kelley wrote:
>> From: Arnd Bergmann <arnd@kernel.org> Sent: Wednesday, May 20, 2026 12:40 AM
>>>
>>> When the vmbus driver is not part of the kernel, the mvhv_root
>>> driver now fails to link:
>>>
>>> ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
>>>
>>> Avoid this by adding an explicit Kconfig dependency. Note that
>>> stubbing out the hv_vmbus_exists() based on configuration would
>>> also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.
>>
>> Conceptually, the MSHV root code should not have a dependency on
>> VMBus. The "does VMBus exist?" question should handled differently
>> by setting up a boolean in the core Hyper-V code that defaults to "false".
>> If the VMBus driver loads, it would set the boolean to "true". MSHV
>> root code would query the boolean.
>
> That makes sense to me, but it's outside of my area for a drive-by
> build fix. Please treat my patch as a 'Reported-by' then, I expect
> this can easily be addressed by Jork or someone else in the
> hyperv team.

Apologies for letting this linger - we were looking for the best fix 
internally. While Michael is correct that conceptually the dependency is 
not ideal, I think it's the right fix *for now*.

Yes, this would not allow MSHV root without VMBUS, which is relevant for 
MSHV root only, which misses a few other components. For L1VH it is the 
right fix that also enforces module load order.

Ultimately (and this is planned), we need a SYNIC driver, and the 
hv_vmbus_exists() will vanish for this usecase. It is a hack even now.

Best,
Jork


^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Yu Zhang @ 2026-05-20 17:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Michael Kelley
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, jacob.pan, tgopinath,
	easwar.hariharan
In-Reply-To: <20260515223545.GL7702@ziepe.ca>

On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > +					  unsigned long start,
> > +					  unsigned long end)
> > +{
> > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > +	unsigned long nr_pages = end_pfn - start_pfn;
> > +	u16 count = 0;
> > +
> > +	while (nr_pages > 0) {
> > +		unsigned long flush_pages;
> > +		int order;
> > +		unsigned long pfn_align;
> > +		unsigned long size_align;
> > +
> > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > +			break;
> > +		}
> > +
> > +		if (start_pfn)
> > +			pfn_align = __ffs(start_pfn);
> > +		else
> > +			pfn_align = BITS_PER_LONG - 1;
> > +
> > +		size_align = __fls(nr_pages);
> > +		order = min(pfn_align, size_align);
> > +		iova_list[count].page_mask_shift = order;
> > +		iova_list[count].page_number = start_pfn;
> > +
> > +		flush_pages = 1UL << order;
> > +		start_pfn += flush_pages;
> > +		nr_pages -= flush_pages;
> > +		count++;
> > +	}
> 
> This seems like a really silly hypervisor interface. Why doesn't it
> just accept a normal range? Splitting it into power of two aligned
> ranges is very inefficient.

Fair point. I'm not sure how much flexibility we have to change
this hypercall interface at the moment - it predates the pvIOMMU
work and may have other consumers beyond Linux guest. On the other
hand, having the guest specify 2^N-aligned blocks does save the
hypervisor from having to decompose ranges itself before issuing
hardware invalidation commands - the guest-provided entries can be
fed to the HW more or less directly.

That said, the way I'm currently using this interface may be
more precise than necessary. Maybe we have 2 options:

1) Current approach: decompose the range into multiple exact
   2^N-aligned blocks with no over-flush, but at the cost of
   more complex calculations and more entries.

2) Follow what Intel/AMD drivers do: find a single minimal
   2^N-aligned block that covers the entire range, but may
   over-flush.

Any preference?

@Michael, since you've also been reviewing this patch, I'd
appreciate your thoughts on the above as well. :)

Yu

^ permalink raw reply

* Re: [PATCH] mshv: add vmbus dependency
From: Jork Loeser @ 2026-05-20 17:15 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Anirudh Rayabharam (Microsoft), Stanislav Kinsburskii,
	Arnd Bergmann, linux-hyperv, linux-kernel
In-Reply-To: <20260520074044.923728-1-arnd@kernel.org>

On Wed, 20 May 2026, Arnd Bergmann wrote:

> From: Arnd Bergmann <arnd@arndb.de>
>
> When the vmbus driver is not part of the kernel, the mvhv_root
> driver now fails to link:
>
> ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
>
> Avoid this by adding an explicit Kconfig dependency. Note that
> stubbing out the hv_vmbus_exists() based on configuration would
> also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.
>
> Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> drivers/hv/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 52af086fdeb2..21193b571a80 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -75,6 +75,7 @@ config MSHV_ROOT
> 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> 	# no particular order, making it impossible to reassemble larger pages
> 	depends on PAGE_SIZE_4KB
> +	depends on HYPERV_VMBUS
> 	select EVENTFD
> 	select VIRT_XFER_TO_GUEST_WORK
> 	select HMM_MIRROR
> -- 
> 2.39.5
>

Yes, this is the right short-term fix. We will need to solve the root case 
(no VMBUS required) with a separate SYNIC driver abstraction.

Reviewed-by: Jork Loeser <jloeser@linux.microsoft.com>

Thank you!
Jork


^ permalink raw reply

* Re: [PATCH] hyperv: move vmbus_root_device outside of the VMBus driver
From: Jork Loeser @ 2026-05-20 17:28 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-hyperv, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Stanislav Kinsburskii,
	Anirudh Rayabharam (Microsoft), Michael Kelley, Arnd Bergmann
In-Reply-To: <20260520154943.898386-1-hamzamahfooz@linux.microsoft.com>

On Wed, 20 May 2026, Hamza Mahfooz wrote:

> Conceptually, we shouldn't have to build the VMBus driver to answer the
> question "does VMBus exist?" So, move vmbus_root_device and the
> functions that access it directly to hyperv.h. Also, introduce
> hv_set_vmbus_root_device() to allow vmbus_drv to set the root device.
>
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Closes: https://lore.kernel.org/r/20260520074044.923728-1-arnd@kernel.org/
> Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> ---
> drivers/hv/vmbus_drv.c | 27 ++++++---------------------
> include/linux/hyperv.h | 18 ++++++++++++++++--
> 2 files changed, 22 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 71f1b4d52f7f..b26e6b42fa49 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -47,9 +47,6 @@ struct vmbus_dynid {
> 	struct hv_vmbus_device_id id;
> };
>
> -/* VMBus Root Device */
> -static struct device  *vmbus_root_device;
> -
> static int hyperv_cpuhp_online;
>
> static DEFINE_PER_CPU(long, vmbus_evt);
> @@ -95,18 +92,6 @@ static struct resource *fb_mmio;
> static struct resource *hyperv_mmio;
> static DEFINE_MUTEX(hyperv_mmio_lock);
>
> -struct device *hv_get_vmbus_root_device(void)
> -{
> -	return vmbus_root_device;
> -}
> -EXPORT_SYMBOL_GPL(hv_get_vmbus_root_device);
> -
> -bool hv_vmbus_exists(void)
> -{
> -	return vmbus_root_device != NULL;
> -}
> -EXPORT_SYMBOL_GPL(hv_vmbus_exists);
> -
> static u8 channel_monitor_group(const struct vmbus_channel *channel)
> {
> 	return (u8)channel->offermsg.monitorid / 32;
> @@ -903,7 +888,7 @@ static int vmbus_dma_configure(struct device *child_device)
> 	 * On x86/x64 coherence is assumed and these calls have no effect.
> 	 */
> 	hv_setup_dma_ops(child_device,
> -		device_get_dma_attr(vmbus_root_device) == DEV_DMA_COHERENT);
> +		device_get_dma_attr(hv_get_vmbus_root_device()) == DEV_DMA_COHERENT);
> 	return 0;
> }
>
> @@ -2172,7 +2157,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
> 		     &child_device_obj->channel->offermsg.offer.if_instance);
>
> 	child_device_obj->device.bus = &hv_bus;
> -	child_device_obj->device.parent = vmbus_root_device;
> +	child_device_obj->device.parent = hv_get_vmbus_root_device();
> 	child_device_obj->device.release = vmbus_device_release;
>
> 	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
> @@ -2561,7 +2546,7 @@ static int vmbus_acpi_add(struct platform_device *pdev)
> 	struct acpi_device *ancestor;
> 	struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
>
> -	vmbus_root_device = &device->dev;
> +	hv_set_vmbus_root_device(&device->dev);
>
> 	/*
> 	 * Older versions of Hyper-V for ARM64 fail to include the _CCA
> @@ -2648,7 +2633,7 @@ static int vmbus_device_add(struct platform_device *pdev)
> 	struct device_node *np = pdev->dev.of_node;
> 	int ret;
>
> -	vmbus_root_device = &pdev->dev;
> +	hv_set_vmbus_root_device(&pdev->dev);
>
> 	ret = of_range_parser_init(&parser, np);
> 	if (ret)
> @@ -2969,7 +2954,7 @@ static int __init hv_acpi_init(void)
> 	if (ret)
> 		return ret;
>
> -	if (!vmbus_root_device) {
> +	if (!hv_vmbus_exists()) {
> 		ret = -ENODEV;
> 		goto cleanup;
> 	}
> @@ -3000,7 +2985,7 @@ static int __init hv_acpi_init(void)
>
> cleanup:
> 	platform_driver_unregister(&vmbus_platform_driver);
> -	vmbus_root_device = NULL;
> +	hv_set_vmbus_root_device(NULL);
> 	return ret;
> }
>
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 5459e776ec17..3e8ea7d9653c 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1302,9 +1302,23 @@ static inline void *hv_get_drvdata(struct hv_device *dev)
> 	return dev_get_drvdata(&dev->device);
> }
>
> -struct device *hv_get_vmbus_root_device(void);
> +/* Do *not* use this outside of hyperv.h */
> +static struct device *__vmbus_root_device __read_mostly;
>
> -bool hv_vmbus_exists(void);
> +static inline void hv_set_vmbus_root_device(struct device *dev)
> +{
> +	__vmbus_root_device = dev;
> +}
> +
> +static inline struct device *hv_get_vmbus_root_device(void)
> +{
> +	return __vmbus_root_device;
> +}
> +
> +static inline bool hv_vmbus_exists(void)
> +{
> +	return hv_get_vmbus_root_device();
> +}
>
> struct hv_ring_buffer_debug_info {
> 	u32 current_interrupt_mask;
> -- 
> 2.54.0
>

Thank you Hamza. Yes, one could do this. It still does not solve the 
problem of loading MSHV-root first, then VMBUS. The entire 
hv_vmbus_exists() test is a workaround that mostly works, but shouldn't be 
there in the first place. What really needs to happen is a SYNIC driver so 
components can load in arbitrary orders if compiled as modules. Which in 
turn is outside of my original goal to fix L1VH's kexec. The SYNIC 
abstraction needs to come no later than the root/nested capability.

Best,
Jork

^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Jason Gunthorpe @ 2026-05-20 17:29 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Michael Kelley, linux-kernel, linux-hyperv, iommu, linux-pci,
	linux-arch, wei.liu, kys, haiyangz, decui, longli, joro, will,
	robin.murphy, bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd,
	jacob.pan, tgopinath, easwar.hariharan
In-Reply-To: <lxmfd2ml5dafkxquuf5i45uqgh6wxl44hlqphu77kvximqrnmi@b3htoyjc6d5z>

On Thu, May 21, 2026 at 01:15:24AM +0800, Yu Zhang wrote:

> 2) Follow what Intel/AMD drivers do: find a single minimal
>    2^N-aligned block that covers the entire range, but may
>    over-flush.

If it is plugged straight into a HW invalidation I would do this..

Jason

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-20 17:56 UTC (permalink / raw)
  To: David Woodhouse
  Cc: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	pbonzini@redhat.com, kys@microsoft.com, decui@microsoft.com,
	daniel.lezcano@kernel.org, wei.liu@kernel.org,
	peterz@infradead.org, jgross@suse.com, boris.ostrovsky@oracle.com,
	linux-coco@lists.linux.dev, kvm@vger.kernel.org,
	mhklinux@outlook.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <949e39aec749f019b18fa41c2a42bcc9231288b9.camel@amazon.co.uk>

On Mon, May 18, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > 
> > --- a/arch/x86/xen/time.c
> > +++ b/arch/x86/xen/time.c
> > @@ -569,7 +569,7 @@ static void __init xen_init_time_common(void)
> >  	static_call_update(pv_steal_clock, xen_steal_clock);
> >  	paravirt_set_sched_clock(xen_sched_clock);
> >  
> > -	x86_platform.calibrate_tsc = xen_tsc_khz;
> > +	tsc_register_calibration_routines(xen_tsc_khz, NULL);
> >  	x86_platform.get_wallclock = xen_get_wallclock;
> >  }
> >  
> 
> xen_tsc_khz() doesn't use CPUID but really *should*.
> 
> Care to pull in
> https://lore.kernel.org/all/20260509224824.3264567-31-dwmw2@infradead.org/
> to your next round please?
> 
> (Without the misplaced changes in kvm/x86.c that should have been in
> two different prior commits, and are now folded into those correctly in
> my kvmclock5 branch ready for the next posting of that).

Ya, will do.  What's one more patch...

^ permalink raw reply

* Re: [PATCH v3 00/41] x86: Try to wrangle PV clocks vs. TSC
From: Sean Christopherson @ 2026-05-20 17:59 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <7260682b21c28d1299e58400b9a2f4b8d23bd434.camel@infradead.org>

On Tue, May 19, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Dave/Thomas/Peter/Boris, what's the going rate for bribes to take something
> > like this through the tip tree?
> > 
> > The bulk of the changes are in kvmclock and TSC, but pretty much every
> > hypervisor's guest-side code gets touched at some point.  I am reaonsably
> > confident in the correctness of the KVM changes.  Michael tested Hyper-V in
> > v2, and while there were conflicts when rebasing, they were largely
> > superficial (and I've just jinxed myself).  For all other hypervisors, assume
> > the code is compile-tested only, but those changes are all quite small and
> > straightforward.
> > 
> > The only changes that are questionable/contentious are the last two patches,
> > which have KVM-as-a-guest use CPUID 0x16 to get the CPU frequency, even on
> > AMD (that's the dubious part).  I very deliberately put them last, so that
> > they can be dropped at will (I don't care terribly if those patches land).
> > To merge them, I would want explicit Acks from Paolo and David W.
> > 
> > So, except for the last two patches, to get the stuff I really care about
> > landed, I think/hope it's just the TSC and guest-side CoCo changes that need
> > reviews/acks?
> > 
> > The primary goal of this series is (or at least was, when I started) to
> > fix flaws with SNP and TDX guests where a PV clock provided by the untrusted
> > hypervisor is used instead of the secure/trusted TSC that is controlled by
> > trusted firmware.
> > 
> > The secondary goal is to draft off of the SNP and TDX changes to slightly
> > modernize running under KVM.  Currently, KVM guests will use TSC for
> > clocksource, but not sched_clock.  And they ignore Intel's CPUID-based TSC
> > and CPU frequency enumeration, even when using the TSC instead of kvmclock.
> > And if the host provides the core crystal frequency in CPUID.0x15, then KVM
> > guests can use that for the APIC timer period instead of manually calibrating
> > the frequency.
> > 
> > The tertiary goal is to clean up all of the PV clock code to deduplicate logic
> > across hypervisors, and to hopefully make it all easier to maintain going
> > forward.
> 
> I booted this in qemu with -cpu host,+invtsc,+vmware-cpuid-freq
> 
> I was expecting to see it eschew the kvmclock and use *only* the TSC.
> Is there even any need for 'tsc-early' given that it's *told* the TSC
> frequency in CPUID? Shouldn't it have detected that the TSC is known
> before init_tsc_clocksource() runs?
>
> And then it even spent some time at boot actually using the kvmclock as
> clocksource... when ideally I don't think it would even have *enabled*
> it at all?

Yeah, that's definitely the ideal state.  And I had all the same expectations and
observations as you when digging in and testing this.  But unless this series
makes things worse, I want punt on achieving the ideal state for the moment, as
it's proving to be a big lift just to get to a not-awful state.

> [    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [    0.000000] tsc: Detected 2400.000 MHz processor
> [    0.008205] TSC deadline timer available
> [    0.008270] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [    0.159085] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
> [    0.164074] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x22983777dd9, max_idle_ns: 440795300422 ns
> [    0.229087] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [    0.337095] clocksource: Switched to clocksource kvm-clock
> [    0.345246] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [    0.356201] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x22983777dd9, max_idle_ns: 440795300422 ns
> [    0.360560] clocksource: Switched to clocksource tsc
> 



^ permalink raw reply

* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
From: Dipayaan Roy @ 2026-05-20 18:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260519225556.3939204-1-kuba@kernel.org>

On Tue, May 19, 2026 at 03:55:56PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Add NULL guards in teardown path to prevent panic on attach failure
>     
> When queue allocation fails partway through, the error cleanup frees
> and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these
> pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(),
> and mana_dealloc_queues() to prevent NULL pointer dereferences.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9afc786b297a8..9e7e4bf526bf6 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> [ ... ]
> > @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
> >  	struct mana_rxq *rxq;
> >  	u32 rxq_idx;
> >  
> > -	for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> > -		rxq = apc->rxqs[rxq_idx];
> > -		if (!rxq)
> > -			continue;
> > +	if (apc->rxqs) {
> >  
> > -		mana_destroy_rxq(apc, rxq, true);
> > -		apc->rxqs[rxq_idx] = NULL;
> > +		for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> > +			rxq = apc->rxqs[rxq_idx];
> > +			if (!rxq)
> > +				continue;
> > +
> > +			mana_destroy_rxq(apc, rxq, true);
> > +			apc->rxqs[rxq_idx] = NULL;
> > +		}
> >  	}
> 
> If the teardown path frees apc->rxqs and sets it to NULL on attach failure,
> how does the interface recover if the user subsequently brings it back up?
> 
> When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside
> mana_alloc_queues() -> mana_add_rx_queues(), the code does:
> 
>     apc->rxqs[i] = rxq;
> 
> If apc->rxqs was left as NULL during the teardown, will this cause a NULL
> pointer dereference on the next interface bring-up?
> 
> [ ... ]
Hi Jakub,

The only path that recovers from this state is mana_attach(), which
calls mana_init_port() -> mana_init_port_context() to re-allocate
apc->rxqs before calling mana_alloc_queues().

When mana_open is invoked prior to it the rxqs would be already setup by
mana_probe_port -> mana_init_port -> mana_init_port_context.

> > @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
> >  	 * number of queues.
> >  	 */
> >  
> > -	for (i = 0; i < apc->num_queues; i++) {
> > -		txq = &apc->tx_qp[i].txq;
> > -		tsleep = 1000;
> > -		while (atomic_read(&txq->pending_sends) > 0 &&
> > -		       time_before(jiffies, timeout)) {
> > -			usleep_range(tsleep, tsleep + 1000);
> > -			tsleep <<= 1;
> > -		}
> > -		if (atomic_read(&txq->pending_sends)) {
> > -			err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> > -			if (err) {
> > -				netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> > -					   err, atomic_read(&txq->pending_sends),
> > -					   txq->gdma_txq_id);
> > +	if (apc->tx_qp) {
> 
> In the error path where queue creation fails partway through, will
> mana_destroy_txq() be called?
> 
> If mana_destroy_txq() unconditionally iterates over all apc->num_queues:
> 
>     for (i = 0; i < apc->num_queues; i++) {
>         ...
>         mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
>         ...
> 
> Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues
> that were never fully initialized?
> 
> Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware
> with these uninitialized handles, and could handle 0 corrupt the firmware
> state?
>
This is a known  exisiting issue in the cleaup path for partial tx
setups and was also mentioned in the a recent patch where the rx partial
init clean-up path was fixed. My colleague Aditya is already working on a
patch to fix this and all other issues in the tx cleanup path.
 
> > +		for (i = 0; i < apc->num_queues; i++) {
> > +			txq = &apc->tx_qp[i].txq;
> > +			tsleep = 1000;
> > +			while (atomic_read(&txq->pending_sends) > 0 &&
> > +			       time_before(jiffies, timeout)) {
> > +				usleep_range(tsleep, tsleep + 1000);
> > +				tsleep <<= 1;
> > +			}
> > +			if (atomic_read(&txq->pending_sends)) {
> > +				err =
> > +				    pcie_flr(to_pci_dev(gd->gdma_context->dev));
> 
> If pcie_flr() is triggered here, the PCI Function Level Reset wipes the
> internal GDMA state.
> 
> Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(),
> and mana_destroy_vport() are called. Will these functions attempt to send
> commands to the now-destroyed GDMA admin queue via mana_send_request() and
> wait for completions?
> 
> Could this lead to cascading timeouts since the hardware was just reset and
> cannot process admin commands?
> 
This is pre-existing behavior/issue the pcie_flr() path and the
subsequent calls to mana_disable_vport_rx()/mana_fence_rqs()/mana_destroy_vport()
are unchanged by this patch. My patch only adds a NULL guard for
apc->tx_qp to skip this entire TX drain block when tx_qp is already
freed. But this point seems to be valid issue, I will investigate this
further and I will send a different patch set for this as follow up work.


> > +				if (err) {
> > +					netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> > +						   err,
> > +					    atomic_read(&txq->pending_sends),
> > +					    txq->gdma_txq_id);
> > +				}
> > +				break;
> >  			}
> > -			break;
> >  		}
> -- 
> pw-bot: cr

Regards
Dipayaan Roy


^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: Dipayaan Roy @ 2026-05-20 18:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260519225559.3939235-1-kuba@kernel.org>

On Tue, May 19, 2026 at 03:55:59PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Skip redundant detach in queue reset handler if already detached
> 
> This commit adds a check for netif_device_present() in the queue reset
> handler. If the device is already detached, it skips the redundant
> mana_detach() call and proceeds directly to mana_attach().
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9e7e4bf526bf6..740f3c34db451 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
> [ ... ]
> > +	/* If already detached (indicates detach succeeded but attach failed
> > +	 * previously). Now skip mana detach and just retry mana_attach.
> > +	 */
> > +	if (!netif_device_present(ndev))
> > +		goto attach;
> > +
> >  	err = mana_detach(ndev, false);
> 
> The fix here correctly prevents mana_detach() from running on an already
> detached device. However, do identical issues remain in mana_change_mtu(),
> mana_set_channels(), mana_set_ringparam(), and mana_bpf()?
> 
> If mana_attach() previously failed (for example, due to out of memory),
> the port is left with apc->port_is_up = false, but it preserves
> apc->port_st_save = true, and calls netif_device_detach(ndev).
> 
> If an admin tries to recover the port by retrying a configuration command
> like changing the MTU or attaching XDP, these functions appear to
> unconditionally call mana_detach(ndev, false) without the
> !netif_device_present(ndev) check added here.
> 
> Inside mana_detach(), apc->port_st_save = apc->port_is_up executes. Since
> port_is_up is already false, port_st_save is overwritten to false,
> permanently erasing the intent that the port should be up.
> 
> The subsequent call to mana_attach() would then see port_st_save == false,
> skip mana_alloc_queues(), set port_is_up = false, and successfully call
> netif_device_attach(ndev).
> 
> Would this result in the configuration command returning 0 to userspace
> while the port queues were never allocated, leaving the interface silently
> broken until manually toggled down and up?

Hi Jakub,

I agree that mana_change_mtu() and mana_xdp_set() would also need the same
netif_device_present() guard (or alternatively, an early return
when the port is already detached). I'll investigate and address those in a
follow-up patch to keep this fix minimal, as this patch
specifically fixes the reset handler that runs without user intervention,
unlike the ohers cases mentioned above.

Regards
Dipayaan Roy




^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Jacob Pan @ 2026-05-20 18:27 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Jason Gunthorpe, linux-kernel, linux-hyperv, iommu, linux-pci,
	linux-arch, wei.liu, kys, haiyangz, decui, longli, joro, will,
	robin.murphy, bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd,
	mhklinux, tgopinath, easwar.hariharan, jacob.pan
In-Reply-To: <yqhb6vxoovscvfafgv6i6zn7uydpfxeff7hqmvbn6z7c2tjqp6@jn2vvtxqgfef>

Hi Yu,

On Wed, 20 May 2026 23:25:43 +0800
Yu Zhang <zhangyu1@linux.microsoft.com> wrote:

> > > +static const struct iommu_domain_ops
> > > hv_iommu_identity_domain_ops = {
> > > +	.attach_dev	= hv_iommu_attach_dev,
> > > +};
> > > +
> > > +static const struct iommu_domain_ops
> > > hv_iommu_blocking_domain_ops = {
> > > +	.attach_dev	= hv_iommu_attach_dev,
> > > +};  
> > 
> > Usually I would expect these to have their own attach
> > functions. blocking in particular must have an attach op that cannot
> > fail. It is used to recover the device back to a known translation
> > in case of cascading other errors.
> >   
> 
> For blocking domain, the hypercall handler of such attach essentially
> disable the translation and IOPF for the device.
I think this should disable all faults, including unrecoverable fault
reporting. right?

^ permalink raw reply

* Re: [PATCH v3 00/41] x86: Try to wrangle PV clocks vs. TSC
From: David Woodhouse @ 2026-05-20 18:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag32mwLvHpWn2vCt@google.com>

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On Wed, 2026-05-20 at 10:59 -0700, Sean Christopherson wrote:
> 
> > And then it even spent some time at boot actually using the kvmclock as
> > clocksource... when ideally I don't think it would even have *enabled*
> > it at all?
> 
> Yeah, that's definitely the ideal state.  And I had all the same expectations and
> observations as you when digging in and testing this.  But unless this series
> makes things worse, I want punt on achieving the ideal state for the moment, as
> it's proving to be a big lift just to get to a not-awful state.

Ack. Just checking my understanding was correct. Baby steps... 40 patches at a time :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 40/41] x86/tsc: Add standalone helper for getting CPU frequency from CPUID
From: David Woodhouse @ 2026-05-20 18:50 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Kiryl Shutsemau,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <3cb683b4-973a-4b3e-a5d5-a8baa8a70eb0@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 907 bytes --]

On Sat, 2026-05-16 at 09:42 +0200, Paolo Bonzini wrote:
> On 5/15/26 21:19, Sean Christopherson wrote:
> > Extract the guts of cpu_khz_from_cpuid() to a standalone helper that
> > doesn't restrict the usage to Intel CPUs.  This will allow sharing the
> > core logic with kvmclock, as (a) CPUID.0x16 may be enumerated alongside
> > kvmclock, and (b) KVM generally doesn't restrict CPUID based on vendor.
> 
> Even for native there's no real reason to restrict to Intel, I think.
> native_calibrate_tsc() only limits itself because historically (prior to 
> commit 604dc9170f24, "x86/tsc: Use CPUID.0x16 to calculate missing 
> crystal frequency", 2019-05-09) it used a hardcoded table of crystal 
> frequencies.
> 
> Of course paranoia applies, but for virtualization, if the leaf exists 
> there is no reason not to trust it.

Agreed.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 41/41] x86/kvmclock: Get CPU base frequency from CPUID when it's available
From: David Woodhouse @ 2026-05-20 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-42-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1441 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> If CPUID.0x16 is present and valid, use the CPU frequency provided by
> CPUID instead of assuming that the virtual CPU runs at the same
> frequency as TSC and/or kvmclock.  Back before constant TSCs were a
> thing, treating the TSC and CPU frequencies as one and the same was
> somewhat reasonable, but now it's nonsensical, especially if the
> hypervisor explicitly enumerates the CPU frequency.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/kvmclock.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 62c8ea2e6769..7607920ae386 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -190,6 +190,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
>  	}
>  }
>  
> +static unsigned long kvm_get_cpu_khz(void)
> +{
> +	unsigned int cpu_khz;
> +
> +	/*
> +	 * Prefer CPUID over kvmclock when possible, as the base CPU frequency
> +	 * isn't necessarily the same as the kvmlock "TSC" frequency.
> +	 */
> +	if (!cpuid_get_cpu_freq(&cpu_khz))
> +		return cpu_khz;
> +
> +	return pvclock_tsc_khz(this_cpu_pvti());

I'm fine with this in principle but shouldn't the fallback be calling
kvm_get_tsc_khz() instead of directly calling pvclock_tsc_khz()?



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox