Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v3 40/41] x86/tsc: Add standalone helper for getting CPU frequency from CPUID
From: Sean Christopherson @ 2026-05-15 23:04 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515195116.E0AFBC2BCB3@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] Unintended clobbering of the global `cpu_khz` variable in `native_calibrate_tsc()` due to a missing local variable declaration.
> --
> 
> commit 36e45379d4c9f40ae24c5c323b4f0471c94c6bef
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/tsc: Add standalone helper for getting CPU frequency from CPUID
> 
> This commit extracts the logic for getting the CPU frequency via CPUID into
> a new standalone helper function. It also updates the TSC calibration code
> to use this new helper.
> 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> 
> [ ... ]
> 
> > @@ -754,13 +772,8 @@ unsigned long native_calibrate_tsc(void)
> >  	 * clock, but we can easily calculate it to a high degree of accuracy
> >  	 * by considering the crystal ratio and the CPU speed.
> >  	 */
> > -	if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
> > -		unsigned int eax_base_mhz, ebx, ecx, edx;
> > -
> > -		cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
> > -		info.crystal_khz = eax_base_mhz * 1000 *
> > -			info.denominator / info.numerator;
> > -	}
> > +	if (!info.crystal_khz && !cpuid_get_cpu_freq(&cpu_khz))
> > +		info.crystal_khz = cpu_khz * info.denominator / info.numerator;
> 
> Does this unintentionally clobber the global cpu_khz variable?

Yep.  I don't think I even realized there was such a global variable.  Probably
makes sense to use a different name for the local variables.

^ permalink raw reply

* RE: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Michael Kelley @ 2026-05-15 23:33 UTC (permalink / raw)
  To: Michael Kelley, Yu Zhang
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157F7758A127AA1E8096B6CD4042@SN6PR02MB4157.namprd02.prod.outlook.com>

From: Michael Kelley <mhklinux@outlook.com> Sent: Friday, May 15, 2026 11:00 AM
> 
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 9:24 AM
> >
> > On Thu, May 14, 2026 at 06:14:22PM +0000, Michael Kelley wrote:
> > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> > > >
> 
> [....]
> 
> > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > +	u16 count = 0;
> > > > +
> > > > +	while (nr_pages > 0) {
> > > > +		unsigned long flush_pages;
> > > > +		int order;
> > > > +		unsigned long pfn_align;
> > > > +		unsigned long size_align;
> > > > +
> > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		if (start_pfn)
> > > > +			pfn_align = __ffs(start_pfn);
> > >
> > > I don't understand why __ffs() is correct here. I would expect
> > > __fls() so it is consistent with the calculation of size_align. But I
> > > can only surmise how the hypercall works since there's no
> > > documentation, so maybe my understanding of the hypercall is
> > > wrong.   If __ffs really is correct, a comment explaining why
> > > would help. :-)
> > >
> >
> > The use of __ffs() is intentional. Each flush entry invalidates a
> > naturally aligned 2^N page block, and the hypervisor requires the
> > page_number to be aligned to 2^page_mask_shift.
> >
> > Here __ffs() and __fls() serve different purposes:
> > - __ffs(start_pfn) is about the alignment constraint, e.g.,  how
> > large a block can this address support?
> > - __fls(nr_pages) is about  the size constraint, e.g., how large
> > a block can the remaining range hold?
> >
> > Taking min() of both ensures each entry is both properly aligned
> > and within bounds.
> >
> > Thanks for raising this - it definitely deserves a comment. I had to
> > stare at it for a while myself to remember why. :)
> 
> Hmmm. Something about this still nags at me. I'll run some
> experiments to either convince myself that you are right, or to
> come up with a counterexample.

I have resolved what was nagging at me. My understanding of how
_ffs() and __fls() work was wrong. :-(  Your code is correct.

> 
> A related thought occurred to me. If each flush entry that is passed
> to Hyper-V describes a naturally aligned 2^N page block, I don't
> think the HV_IOMMU_MAX_FLUSH_VA_COUNT can ever
> be reached. The number of entries is limited by the number of
> bits in a PFN and the pages count, both of which are 64. And with
> 52 bit physical addressing and 4KiB pages, the actual size of
> a PFN and pages count is even smaller than 64.
> HV_IOMMU_MAX_FLUSH_VA_COUNT is the number of 8 byte
> union hv_iommu_flush_va entries that fit in a 4KiB page, so
> it's ~500.
> 
> My statement applies to a single flush range. If multiple flush
> ranges were strung together in a single hypercall, a larger count
> could be reached, but hv_flush_device_domain_list() only does
> a single range. So I don't think the overflow case in
> hv_flush_device_domain_list() can ever happen. But let me
> do my experiments, and I will also look at this aspect to confirm
> if it's right.

My experiments also convince me that the overflow case can't
happen as long as the hypercall is being populated with entries
derived from a single range.

Michael

> 
> >
> > > > +		else
> > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > +		size_align = __fls(nr_pages);
> > > > +		order = min(pfn_align, size_align);
> > > > +		iova_list[count].page_mask_shift = order;
> > > > +		iova_list[count].page_number = start_pfn;
> > > > +
> > > > +		flush_pages = 1UL << order;
> > > > +		start_pfn += flush_pages;
> > > > +		nr_pages -= flush_pages;
> > > > +		count++;
> > > > +	}
> > > > +
> > > > +	return count;
> > > > +}
> > > > +
> > > > +static void hv_flush_device_domain_list(struct hv_iommu_domain *hv_domain,
> > > > +					struct iommu_iotlb_gather *iotlb_gather)
> > > > +{
> > > > +	u64 status;
> > > > +	u16 count;
> > > > +	unsigned long flags;
> > > > +	struct hv_input_flush_device_domain_list *input;
> > > > +
> > > > +	local_irq_save(flags);
> > > > +
> > > > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > > +	memset(input, 0, sizeof(*input));
> > > > +
> > > > +	input->device_domain = hv_domain->device_domain;
> > > > +	input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
> > >
> > > I would suggest moving the memset() and setting the input fields down
> > > under the "else" below so that they are parallel with the flush all case.
> > >
> >
> > I agree the structure should be more symmetric. Yet I guess the memset and
> > hv_iommu_fill_iova_list() need to stay before the branch since the fill
> > writes directly into input->iova_list[]. :)
> 
> Agreed.
> 
> >
> > > > +	count = hv_iommu_fill_iova_list(input->iova_list,
> > > > +					iotlb_gather->start,
> > > > +					iotlb_gather->end);
> > > > +	if (count == HV_IOMMU_FLUSH_VA_OVERFLOW) {
> > > > +		/*
> > > > +		 * Range exceeds hypercall page capacity. Fall back to a full
> > > > +		 * domain flush.
> > > > +		 */
> > > > +		struct hv_input_flush_device_domain *flush_all = (void *)input;
> > > > +
> > > > +		memset(flush_all, 0, sizeof(*flush_all));
> > > > +		flush_all->device_domain = hv_domain->device_domain;
> > > > +		status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
> > > > +					flush_all, NULL);
> > > > +	} else {
> > > > +		status = hv_do_rep_hypercall(
> > > > +				HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
> > > > +				count, 0, input, NULL);
> > > > +	}
> > > > +
> > > > +	local_irq_restore(flags);
> > > > +
> > > > +	if (!hv_result_success(status))
> > > > +		pr_err("HVCALL_FLUSH_DEVICE_DOMAIN_LIST failed, status %lld\n", status);
> > >
> > > As Sashiko pointed out, a failure here can lead to all kinds of trouble because
> > > of leaving unflushed entries. Maybe a WARN() is more appropriate? Also, maybe
> > > a failure in the list flush should try a flush all as a fallback, with the WARN()
> > > only if the flush all fails.
> > >
> >
> > Good idea. How about we restructure this routine to sth. like this:
> >
> >
> > 	memset(input, 0, sizeof(*input));
> > 	count = hv_iommu_fill_iova_list(...);
> >
> > 	if (count != HV_IOMMU_FLUSH_VA_OVERFLOW) {
> > 		input->device_domain = ...;
> > 		...
> > 		status = hv_do_rep_hypercall(FLUSH_DEVICE_DOMAIN_LIST, ...);
> > 		if (hv_result_success(status))
> > 			goto out;
> > 	}
> >
> > 	/* overflow or list flush failed: fallback to full domain flush */
> > 	flush_all = (void *)input;
> > 	memset(flush_all, 0, sizeof(*flush_all));
> > 	flush_all->device_domain = ...;
> > 	status = hv_do_hypercall(FLUSH_DEVICE_DOMAIN, ...);
> > 	WARN(!hv_result_success(status), "IOTLB flush failed, status %lld\n", status);
> >
> > 	out:
> > 		local_irq_restore(flags);
> >
> 
> Yes, I think this works. But per my earlier comment, if I'm right that
> the overflow case never occurs, it could be simplified further to just
> do the list flush with the full flush as the error fallback. Then WARN
> if the full flush fails.
> 
> Michael


^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Mukesh R @ 2026-05-16  0:11 UTC (permalink / raw)
  To: Yu Zhang, Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <fw2pruvjgo7yigtcxssf3xv27soibsj6hmw2ls5wj4rylfhdha@e63f32cwu2x5>

On 5/15/26 09:53, Yu Zhang wrote:
> On Fri, May 15, 2026 at 02:51:38PM +0000, Michael Kelley wrote:
>> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 7:00 AM
>>>
>>> On Thu, May 14, 2026 at 06:13:24PM +0000, Michael Kelley wrote:
>>>> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
>>>>>
>>>>> Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
>>>>> This driver implements stage-1 IO translation within the guest OS.
>>>>> It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
>>>>> for:
>>>>>   - Capability discovery
>>>>>   - Domain allocation, configuration, and deallocation
>>>>>   - Device attachment and detachment
>>>>>   - IOTLB invalidation
>>>>>
>>>>> The driver constructs x86-compatible stage-1 IO page tables in the
>>>>> guest memory using consolidated IO page table helpers. This allows
>>>>> the guest to manage stage-1 translations independently of vendor-
>>>>> specific drivers (like Intel VT-d or AMD IOMMU).
>>>>>
>>>>> Hyper-V consumes this stage-1 IO page table when a device domain is
>>>>> created and configured, and nests it with the host's stage-2 IO page
>>>>> tables, therefore eliminating the VM exits for guest IOMMU mapping
>>>>> operations. For unmapping operations, VM exits to perform the IOTLB
>>>>> flush are still unavoidable.
>>>>>
>>>>> Hyper-V identifies each PCI pass-thru device by a logical device ID
>>>>> in its hypercall interface. The vPCI driver (pci-hyperv) registers the
>>>>> per-bus portion of this ID with the pvIOMMU driver during bus probe.
>>>>> The pvIOMMU driver stores this mapping and combines it with the function
>>>>> number of the endpoint PCI device to form the complete ID for hypercalls.
>>>>
>>>> As you are probably aware, Mukesh's patch series to support PCI
>>>> pass-thru devices also needs to get the logical device ID. Maybe the
>>>> registration mechanism needs to move somewhere that can be shared
>>>> with his code.
>>>>
>>>
>>> Thank you so much for the review, Michael!
>>>
>>> Yes, I looked at Mukesh's series and noticed his hv_pci_vmbus_device_id()
>>> in pci-hyperv.c has the same dev_instance byte manipulation. We do need
>>> a common registration mechanism.
>>>
>>> Any suggestion on where to put it? drivers/hv/hv_common.c seems like a
>>> natural place, but the register/lookup functions are currently only
>>> meaningful when CONFIG_HYPERV_PVIOMMU is set. If Mukesh's pass-thru
>>> code also needs them, we might need a new shared Kconfig option that
>>> both can select. Open to better ideas.
>>
>> Unfortunately, I have not looked at Mukesh's series in detail yet, so
>> I don't have enough knowledge of the full situation to offer a good
>> recommendation.
>>
> 
> Sorry I forgot to Cc Mukesh in the previous reply. :(
> @Mukesh, any thoughts on sharing the logical device ID registration mechanism?

Yeah, I went round and round trying to find the best place. I almost
created virt/hyperv/hv_utils.c file. Maybe that is the best place?

Thanks,
-Mukesh


>>>
>>> [...]
>>>
>>>>> +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
>>>>> +{
>>>>> +	u64 status;
>>>>> +	unsigned long flags;
>>>>> +	struct hv_input_flush_device_domain *input;
>>>>> +
>>>>> +	local_irq_save(flags);
>>>>> +
>>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>>> +	memset(input, 0, sizeof(*input));
>>>>> +	input->device_domain = hv_domain->device_domain;
>>>>
>>>> The previous version of this patch had code to set several other fields in
>>>> the input. I wanted to confirm that not setting them in this version is
>>>> intentional. Were they not needed?
>>>>
>>>
>>> Oh. The RFC v1 set partition_id, owner_vtl, domain_id.type, and domain_id.id
>>> individually. In this version, I just simplified it to a struct assignment.
>>> No functional change.
>>
>> Of course! I should have looked more closely at the details before making
>> this comment. :-(
>>
>> [...]
>>
>>>>
>>>> Previous versions of this function did hv_iommu_detach_dev(). With that call
>>>> removed from here, hv_iommu_detach_dev() is only called when attaching a
>>>> domain to a device that already has a domain attached. Is it the case that
>>>> Hyper-V doesn't require the detach as a cleanup step?
>>>>
>>>
>>> The IOMMU core attaches the device to release_domain (our blocking domain)
>>> before calling release_device(), so I believe the explicit detach in the RFC
>>> was redundant. I simply didn't realize that at the time.
>>>
>>
>> Got it. But after the IOMMU core attaches the device to the blocking
>> domain, there's the possibility that the vPCI device is rescinded by
>> Hyper-V and it goes away entirely. Or the device might be subjected
>> to an "unbind/bind" cycle in Linux. Does the detach need to be done
>> on the blocking domain in such cases? In this version of the patches, the
>> Hyper-V "attach" and "detach" hypercalls still end up unbalanced. That
>> seems a bit untidy at best, and I wonder if there are scenarios where
>> Hyper-V will complain about the lack of balance.
>>
> 
> Thank you, Michael. May I ask what "the vPCI device is rescinded by
> Hyper-V and it goes away entirely" mean?
> 
> I realized it's a bit untidy. But I want to understand this issue more
> clearly first. :)
> 
> B.R.
> Yu


^ permalink raw reply

* Re: [PATCH net-next v10 1/6] net: mana: Create separate EQs for each vPort
From: sashiko-bot @ 2026-05-16  4:05 UTC (permalink / raw)
  To: Long Li; +Cc: linux-hyperv
In-Reply-To: <20260515040508.491748-2-longli@microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Missing cross-port validation in `mana_ib_create_qp_rss()` allows userspace to trigger NULL pointer dereferences and corrupt other ports' queues.
- [High] Use of boolean flag `channel_changing` as an ad-hoc lock in `mana_set_channels()` violates networking subsystem guidelines.
--

commit ed2a523d0427d1b949add95925cad26c8b468ff8
Author: Long Li <longli@microsoft.com>
Date:   Fri May 15 04:05:03 2026 +0000

    net: mana: Create separate EQs for each vPort

This commit separates EQs for each vPort, removing the shared EQ mechanism
from the MANA driver to prepare for dedicated MSI-X vector assignment.
RSS QPs are updated to take a vport reference, ensuring EQs outlive QP
consumers.

> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 0fbcf449c134b..108ec4c5ce51c 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
[ ... ]
> @@ -155,6 +156,18 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>  
>  	qp->port = port;
>  
> +	/* Take a reference on the vport to ensure EQs outlive this QP.
> +	 * The vport must already be configured by a raw QP.
> +	 */
> +	mutex_lock(&mana_pd->vport_mutex);
> +	if (!mana_pd->vport_use_count) {
> +		mutex_unlock(&mana_pd->vport_mutex);
> +		ret = -EINVAL;
> +		goto fail;
> +	}
> +	mana_pd->vport_use_count++;
> +	mutex_unlock(&mana_pd->vport_mutex);
> +

Does this code need to verify that the requested port matches
mana_pd->vport_port?

While mana_ib_cfg_vport() correctly rejects cross-port PD sharing for
raw QPs, an RSS QP creation could specify a different port (e.g., Port B)
using the same PD bound to Port A.

If Port B is unconfigured, could fetching its mpc result in a NULL
pointer dereference when accessing mpc->eqs below?

If Port B is active, could this lead to erroneously destroying Port B's EQs
or underflowing its vport_use_count when the PD drops its references during
teardown?

>  	for (i = 0; i < ind_tbl_size; i++) {
>  		struct mana_obj_spec wq_spec = {};
[ ... ]
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 04350973e19e3..4633acc976f0a 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
[ ... ]
> @@ -462,10 +467,22 @@ static int mana_set_channels(struct net_device *ndev,
>  	unsigned int old_count = apc->num_queues;
>  	int err;
>  
> +	/* Set channel_changing to block RDMA from grabbing the vport
> +	 * during the detach/attach window. mana_cfg_vport() checks
> +	 * this flag under vport_mutex and returns -EBUSY if set.
> +	 */
> +	mutex_lock(&apc->vport_mutex);
> +	if (!apc->port_is_up && apc->vport_use_count) {
> +		mutex_unlock(&apc->vport_mutex);
> +		return -EBUSY;
> +	}
> +	apc->channel_changing = true;
> +	mutex_unlock(&apc->vport_mutex);
> +

Is channel_changing acting as an ad-hoc lock here?

The networking subsystem guidelines discourage using boolean flags set on
entry and cleared on exit to guard code sections against concurrent access.

Since mana_detach() and mana_attach() are executed while this flag is held
to prevent mana_cfg_vport() from proceeding, could this be replaced with a
standard locking mechanism to ensure lockdep visibility?

>  	err = mana_pre_alloc_rxbufs(apc, ndev->mtu, new_count);
>  	if (err) {
>  		netdev_err(ndev, "Insufficient memory for new allocations");
> -		return err;
> +		goto clear_flag;
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515040508.491748-1-longli@microsoft.com?part=1

^ permalink raw reply

* Re: [PATCH net-next v10 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: sashiko-bot @ 2026-05-16  4:05 UTC (permalink / raw)
  To: Long Li; +Cc: linux-hyperv
In-Reply-To: <20260515040508.491748-3-longli@microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Per-vPort queue counts are not clamped across device resume, leading to MSI-X vector exhaustion and resume failure if available interrupts shrink.
--

commit 5d1fd25db612c4024c6a414d322389f154cfe998
Author: Long Li <longli@microsoft.com>
Date:   Fri May 15 04:05:04 2026 +0000

    net: mana: Query device capabilities and configure MSI-X sharing for EQs

This commit adjusts the maximum number of queues to allow dedicated MSI-X
vectors for each vPort, calculating per-vPort limits based on hardware
capabilities. It enables MSI-X sharing dynamically when there are not
enough vectors for dedicated allocation and ensures a minimum of one port
during probe.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 18f8f653da3d4..d23b856f48f6f 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c

[ ... ]

> @@ -3426,9 +3431,9 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
>  	apc = netdev_priv(ndev);
>  	apc->ac = ac;
>  	apc->ndev = ndev;
> -	apc->max_queues = gc->max_num_queues;
> +	apc->max_queues = gc->max_num_queues_vport;
>  	/* Use MANA_DEF_NUM_QUEUES as default, still honoring the HW limit */
> -	apc->num_queues = min(gc->max_num_queues, MANA_DEF_NUM_QUEUES);
> +	apc->num_queues = min(gc->max_num_queues_vport, MANA_DEF_NUM_QUEUES);
>  	apc->tx_queue_size = DEF_TX_BUFFERS_PER_QUEUE;

Does this code handle clamping apc->num_queues if the available MSI-X
vectors decrease during a suspend/resume cycle?

During a device resume, mana_gd_query_max_resources() might recalculate
a smaller gc->max_num_queues_vport if fewer MSI-X vectors are provided
by the system.

Since apc->num_queues is set here during the initial probe, it appears
to retain its previously configured size on resume. When mana_attach()
is called in the resume path for existing ports, could it try to allocate
the old, larger number of vectors from the newly undersized msi_bitmap,
exhausting the bitmap and failing the resume with -ENOMEM?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515040508.491748-1-longli@microsoft.com?part=2

^ permalink raw reply

* Re: [PATCH net-next v10 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort
From: sashiko-bot @ 2026-05-16  4:05 UTC (permalink / raw)
  To: Long Li; +Cc: linux-hyperv
In-Reply-To: <20260515040508.491748-6-longli@microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The RDMA driver (mana_ib) fails to initialize because it does not allocate GIC references for its EQs.
--

commit 9f615e81497dbea890c8df0a9aa9efd643e9a6a7
Author: Long Li <longli@microsoft.com>

net: mana: Allocate interrupt context for each EQ when creating vPort

This commit updates the driver to use GIC functions to create a dedicated
interrupt context or acquire a shared interrupt context for each EQ when
setting up a vPort. The caller is now responsible for acquiring and
releasing the GIC reference across the EQ create/destroy lifecycle.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c

[ ... ]

> @@ -1684,12 +1690,22 @@ int mana_create_eq(struct mana_port_context *apc)
>  		debugfs_create_dir("EQs", apc->mana_port_debugfs);
>  
>  	for (i = 0; i < apc->num_queues; i++) {
> -		spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
> +		msi = (i + 1) % gc->num_msix_usable;
> +
> +		gic = mana_gd_get_gic(gc, !gc->msi_sharing, &msi);
> +		if (!gic) {
> +			err = -ENOMEM;
> +			goto out;
> +		}
> +		spec.eq.msix_index = msi;

Will this change to the initialization logic break the RDMA driver (mana_ib)?

Since the caller is now required to manually acquire a GIC reference via
mana_gd_get_gic() before creating EQs, and mana_ib hasn't been updated to
do this, won't it trigger a WARN_ON(!gic) inside mana_gd_register_irq() and
fail device initialization?

I noticed this was fixed later in the patch series by commit 'RDMA/mana_ib:
Allocate interrupt contexts on EQs', but could this break bisectability here?

> +
>  		err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515040508.491748-1-longli@microsoft.com?part=5

^ permalink raw reply

* Re: [PATCH v3 40/41] x86/tsc: Add standalone helper for getting CPU frequency from CPUID
From: Paolo Bonzini @ 2026-05-16  7:42 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, Thomas Gleixner,
	John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-41-seanjc@google.com>

On 5/15/26 21:19, Sean Christopherson wrote:
> Extract the guts of cpu_khz_from_cpuid() to a standalone helper that
> doesn't restrict the usage to Intel CPUs.  This will allow sharing the
> core logic with kvmclock, as (a) CPUID.0x16 may be enumerated alongside
> kvmclock, and (b) KVM generally doesn't restrict CPUID based on vendor.

Even for native there's no real reason to restrict to Intel, I think. 
native_calibrate_tsc() only limits itself because historically (prior to 
commit 604dc9170f24, "x86/tsc: Use CPUID.0x16 to calculate missing 
crystal frequency", 2019-05-09) it used a hardcoded table of crystal 
frequencies.

Of course paranoia applies, but for virtualization, if the leaf exists 
there is no reason not to trust it.

Thanks,

Paolo

> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/include/asm/tsc.h |  1 +
>   arch/x86/kernel/tsc.c      | 37 +++++++++++++++++++++++--------------
>   2 files changed, 24 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
> index f458be688512..c145f5707b52 100644
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -91,6 +91,7 @@ struct cpuid_tsc_info {
>   };
>   extern int cpuid_get_tsc_info(struct cpuid_tsc_info *info);
>   extern int cpuid_get_tsc_freq(struct cpuid_tsc_info *info);
> +extern int cpuid_get_cpu_freq(unsigned int *cpu_khz);
>   
>   extern void tsc_early_init(void);
>   extern void tsc_init(void);
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 1b569954ae5e..745fa2052c74 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -719,6 +719,24 @@ int cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
>   	return 0;
>   }
>   
> +int cpuid_get_cpu_freq(unsigned int *cpu_khz)
> +{
> +	unsigned int eax_base_mhz, ebx, ecx, edx;
> +
> +	*cpu_khz = 0;
> +
> +	if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
> +		return -ENOENT;
> +
> +	cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
> +
> +	if (!eax_base_mhz)
> +		return -ENOENT;
> +
> +	*cpu_khz = eax_base_mhz * 1000;
> +	return 0;
> +}
> +
>   /**
>    * native_calibrate_tsc - determine TSC frequency
>    * Determine TSC frequency via CPUID, else return 0.
> @@ -754,13 +772,8 @@ unsigned long native_calibrate_tsc(void)
>   	 * clock, but we can easily calculate it to a high degree of accuracy
>   	 * by considering the crystal ratio and the CPU speed.
>   	 */
> -	if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
> -		unsigned int eax_base_mhz, ebx, ecx, edx;
> -
> -		cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
> -		info.crystal_khz = eax_base_mhz * 1000 *
> -			info.denominator / info.numerator;
> -	}
> +	if (!info.crystal_khz && !cpuid_get_cpu_freq(&cpu_khz))
> +		info.crystal_khz = cpu_khz * info.denominator / info.numerator;
>   
>   	if (!info.crystal_khz)
>   		return 0;
> @@ -787,19 +800,15 @@ unsigned long native_calibrate_tsc(void)
>   
>   static unsigned long cpu_khz_from_cpuid(void)
>   {
> -	unsigned int eax_base_mhz, ebx_max_mhz, ecx_bus_mhz, edx;
> +	unsigned int cpu_khz;
>   
>   	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
>   		return 0;
>   
> -	if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
> +	if (cpuid_get_cpu_freq(&cpu_khz))
>   		return 0;
>   
> -	eax_base_mhz = ebx_max_mhz = ecx_bus_mhz = edx = 0;
> -
> -	cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx_max_mhz, &ecx_bus_mhz, &edx);
> -
> -	return eax_base_mhz * 1000;
> +	return cpu_khz;
>   }
>   
>   /*


^ permalink raw reply

* [PATCH 0/2] drm/hyperv: harden VMBus message parser input validation
From: Berkant Koc @ 2026-05-17 12:55 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, Wei Liu, Michael Kelley,
	Thomas Zimmermann, Maarten Lankhorst, Maxime Ripard, Deepak Rawat

The hyperv synthetic video driver parses VMBus messages from the host
without bounding two host-controlled values that feed into fixed-size
buffers. Both items are input validation, not security bugs: the
Hyper-V host sits inside the trusted compute base under the default
Hyper-V threat-model. The patches still trim the inputs the driver
accepts at face value, matching the trajectory drivers/hv/ has
followed for Confidential-VMBus work where the host is no longer
fully trusted.

Patch 1 bounds resolution_count against
supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT]; the existing
default_resolution_index check is bypassable when both values
exceed 64.

Patch 2 forwards bytes_recvd from vmbus_recvpacket() into the
sub-handler so that vid_hdr.type and feature_chg.is_dirt_needed
are only read once the host actually delivered enough bytes, and
so that the init_buf memcpy uses the received length.

Sending as a plain patch series, not a security disclosure.
Compile-tested against drm-fixes (6916d5703ddf), static-only.

Berkant Koc (2):
  drm/hyperv: validate resolution_count from host VMBus message
  drm/hyperv: validate VMBus packet size in receive callback

 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
-- 
2.47.3

^ permalink raw reply

* [PATCH 1/2] drm/hyperv: validate resolution_count from host VMBus message
From: Berkant Koc @ 2026-05-17 12:55 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, Wei Liu, Michael Kelley,
	Thomas Zimmermann, Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <20260517-drm-hyperv-cover@berkoc.com>

The synthetic video device receives a SYNTHVID_RESOLUTION_RESPONSE
containing a u8 resolution_count and a u8 default_resolution_index
from the host. The existing check rejects resolution_count == 0 and
rejects an index that is greater or equal to resolution_count, but
does not bound resolution_count itself against the fixed
supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT] array. A host
that returns resolution_count > 64 together with an in-range
default_resolution_index causes the subsequent loop to read past
the array.

Reject any resolution_count that exceeds SYNTHVID_MAX_RESOLUTION_COUNT,
folded into the existing zero-check for one log entry per failure.
This matches the input-validation pattern used by other VMBus parsers
under drivers/hv/ and trims one host-controlled value from the trusted
path.

Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
Signed-off-by: Berkant Koc <me@berkoc.com>
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index 051ecc526832..003bb118d64c 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -391,8 +391,11 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 		return -ETIMEDOUT;
 	}

-	if (msg->resolution_resp.resolution_count == 0) {
-		drm_err(dev, "No supported resolutions\n");
+	if (msg->resolution_resp.resolution_count == 0 ||
+	    msg->resolution_resp.resolution_count >
+	    SYNTHVID_MAX_RESOLUTION_COUNT) {
+		drm_err(dev, "Invalid resolution count: %d\n",
+			msg->resolution_resp.resolution_count);
 		return -ENODEV;
 	}

-- 
2.47.3

^ permalink raw reply related

* [PATCH 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Berkant Koc @ 2026-05-17 12:55 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, Wei Liu, Michael Kelley,
	Thomas Zimmermann, Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <20260517-drm-hyperv-cover@berkoc.com>

hyperv_receive() reads bytes_recvd from vmbus_recvpacket() but does not
forward that value to hyperv_receive_sub(). The sub-handler reads
msg->vid_hdr.type and msg->feature_chg.is_dirt_needed without knowing
how many bytes the host actually wrote, so a short packet leaves the
parser reading bytes that the host did not write in this packet. The
unconditional memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE) on the
wait-completion path also copies the full 16 KiB recv_buf regardless
of bytes_recvd, including any residue from the prior message.

Pass bytes_recvd into hyperv_receive_sub() and reject any packet shorter
than the pipe + synthvid header. The dirt-feature branch additionally
requires the feature_change payload size before reading is_dirt_needed.
The init_buf copy now uses bytes_recvd as the length argument, which
keeps it bounded by VMBUS_MAX_PACKET_SIZE through the new upper check.

Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
Signed-off-by: Berkant Koc <me@berkoc.com>
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index 003bb118d64c..0d32d9944c43 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -420,26 +420,35 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 	return 0;
 }
 
-static void hyperv_receive_sub(struct hv_device *hdev)
+static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
 {
 	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *msg;
+	size_t hdr_size;
 
 	if (!hv)
 		return;
 
+	hdr_size = sizeof(struct pipe_msg_hdr) +
+		   sizeof(struct synthvid_msg_hdr);
+	if (bytes_recvd < hdr_size || bytes_recvd > VMBUS_MAX_PACKET_SIZE)
+		return;
+
 	msg = (struct synthvid_msg *)hv->recv_buf;
 
 	/* Complete the wait event */
 	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
-		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
+		memcpy(hv->init_buf, msg, bytes_recvd);
 		complete(&hv->wait);
 		return;
 	}
 
 	if (msg->vid_hdr.type == SYNTHVID_FEATURE_CHANGE) {
+		if (bytes_recvd < hdr_size +
+		    sizeof(struct synthvid_feature_change))
+			return;
 		hv->dirt_needed = msg->feature_chg.is_dirt_needed;
 		if (hv->dirt_needed)
 			hyperv_hide_hw_ptr(hv->hdev);
@@ -466,7 +475,7 @@ static void hyperv_receive(void *ctx)
 				       &bytes_recvd, &req_id);
 		if (bytes_recvd > 0 &&
 		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
-			hyperv_receive_sub(hdev);
+			hyperv_receive_sub(hdev, bytes_recvd);
 	} while (bytes_recvd > 0 && ret == 0);
 }
 
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH 1/2] drm/hyperv: validate resolution_count from host VMBus message
From: sashiko-bot @ 2026-05-17 13:49 UTC (permalink / raw)
  To: Berkant Koc; +Cc: linux-hyperv
In-Reply-To: <20260517-drm-hyperv-patch1@berkoc.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Uninitialized State on Error Path Leaves Display Unusable
- [High] Missing Completion Re-Initialization Breaks Driver Synchronization
--

commit 94c5684ae582179bbf9c456d94f6a0bd0f92785f
Author: Berkant Koc <me@berkoc.com>
Subject: drm/hyperv: validate resolution_count from host VMBus message

This patch adds validation to ensure the host-provided resolution_count does
not exceed SYNTHVID_MAX_RESOLUTION_COUNT. It folds this check into the
existing zero-count check to prevent reading past the end of the supported
resolution array.

> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 051ecc5268321..003bb118d64c9 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -391,8 +391,11 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  		return -ETIMEDOUT;
>  	}

This isn't a bug in your patch, but looking at the wait_for_completion_timeout
call just before this hunk, does the driver reuse the hv->wait completion
structure for multiple sequential VMBus requests without calling
reinit_completion()?

If a previous request times out but the host sends a delayed response, the
receive callback might execute complete(&hv->wait). Could a subsequent
request's wait_for_completion_timeout incorrectly consume this stray
completion, return immediately, and parse stale data from hv->init_buf?

>  
> -	if (msg->resolution_resp.resolution_count == 0) {
> -		drm_err(dev, "No supported resolutions\n");
> +	if (msg->resolution_resp.resolution_count == 0 ||
> +	    msg->resolution_resp.resolution_count >
> +	    SYNTHVID_MAX_RESOLUTION_COUNT) {
> +		drm_err(dev, "Invalid resolution count: %d\n",
> +			msg->resolution_resp.resolution_count);
>  		return -ENODEV;

This is also a pre-existing issue, but I noticed that returning an error
here can leave the display unusable.

When hyperv_get_supported_resolution() fails due to this new bounds check,
the caller hyperv_connect_vsp() logs a warning to use defaults, but it
skips setting up the resolution limits. As a result, hv->screen_width_max
and hv->screen_height_max remain 0.

Later, hyperv_mode_config_init() sets dev->mode_config.max_width and
max_height to 0. Since the DRM core's drm_internal_framebuffer_create()
strictly bounds framebuffer dimensions to dev->mode_config.max_width,
would this cause all userspace attempts to create a framebuffer to fail
with -EINVAL?

>  	}
>

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260517-drm-hyperv-cover@berkoc.com?part=1

^ permalink raw reply

* [PATCH v2 0/2] drm/hyperv: harden VMBus message parser input validation
From: Berkant Koc @ 2026-05-17 14:25 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <20260517-drm-hyperv-cover@berkoc.com>

v2 folds two further issues into patch 1 that the sashiko-bot review
pointed out on v1:

  1. The resolution_count bounds check in v1 returned -ENODEV, but
     hyperv_connect_vsp() only logged a warning and continued without
     setting hv->screen_width_max / height_max / preferred_*. That
     left dev->mode_config.max_width and max_height at 0, which made
     drm_internal_framebuffer_create() reject every userspace
     framebuffer with -EINVAL. v2 falls back to the WIN8 defaults on
     that error path, matching the pre-WIN10 branch.

  2. The three sequential VSP requests in hyperv_connect_vsp()
     (negotiate version, update VRAM location, get supported
     resolution) all wait on the same hv->wait completion without
     calling reinit_completion() between requests. A delayed
     complete() after a wait_for_completion_timeout() can leak into
     the next request and let it parse stale data out of
     hv->init_buf. v2 calls reinit_completion() before each send.

Patch 2 is unchanged from v1.

v1: https://lore.kernel.org/r/20260517-drm-hyperv-cover@berkoc.com

Berkant Koc (2):
  drm/hyperv: validate resolution_count and harden VSP request paths
  drm/hyperv: validate VMBus packet size in receive callback

 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 32 ++++++++++++++++++-----
 1 file changed, 26 insertions(+), 6 deletions(-)


base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
-- 
2.47.3


^ permalink raw reply

* [PATCH v2 1/2] drm/hyperv: validate resolution_count and harden VSP request paths
From: Berkant Koc @ 2026-05-17 14:25 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat, stable
In-Reply-To: <20260517-drm-hyperv-cover-v2@berkoc.com>

The synthetic video device parses a SYNTHVID_RESOLUTION_RESPONSE that
contains a u8 resolution_count and a u8 default_resolution_index. The
existing check rejects resolution_count == 0 and an index greater or
equal to resolution_count, but does not bound resolution_count itself
against the fixed supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT]
array. A host that returns resolution_count > 64 together with an
in-range default_resolution_index causes the subsequent loop to read
past the array. Reject any resolution_count that exceeds the array
bound, folded into the existing zero-check so a single log entry
remains per failure.

When that bounds check (or any later failure in
hyperv_get_supported_resolution()) returns an error, the caller in
hyperv_connect_vsp() previously logged a warning and continued without
populating hv->screen_width_max / hv->screen_height_max / preferred_*.
hyperv_mode_config_init() then set dev->mode_config.max_width and
max_height to 0, which makes drm_internal_framebuffer_create() reject
every userspace framebuffer with -EINVAL. Populate the fields with the
WIN8 defaults that the pre-WIN10 branch already uses so a failed
resolution probe degrades to a usable display instead of disabling it.

The driver also issues three sequential VSP requests (negotiate
version, update VRAM location, get supported resolution) that share a
single hv->wait completion. None of the call sites call
reinit_completion() between requests. If wait_for_completion_timeout()
returns 0 but a delayed response later triggers complete(&hv->wait) in
the receive callback, the next request's wait can consume that stale
completion, return immediately, and parse stale data out of
hv->init_buf. Call reinit_completion() before each send so every
request waits for its own response.

Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
Signed-off-by: Berkant Koc <me@berkoc.com>
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index 051ecc526832..3b5065fe06e4 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -223,6 +223,7 @@ static int hyperv_negotiate_version(struct hv_device *hdev, u32 ver)
 	msg->vid_hdr.size = sizeof(struct synthvid_msg_hdr) +
 		sizeof(struct synthvid_version_req);
 	msg->ver_req.version = ver;
+	reinit_completion(&hv->wait);
 	hyperv_sendpacket(hdev, msg);

 	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
@@ -257,6 +258,7 @@ int hyperv_update_vram_location(struct hv_device *hdev, phys_addr_t vram_pp)
 	msg->vram.user_ctx = vram_pp;
 	msg->vram.vram_gpa = vram_pp;
 	msg->vram.is_vram_gpa_specified = 1;
+	reinit_completion(&hv->wait);
 	hyperv_sendpacket(hdev, msg);

 	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
@@ -383,6 +385,7 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 		sizeof(struct synthvid_supported_resolution_req);
 	msg->resolution_req.maximum_resolution_count =
 		SYNTHVID_MAX_RESOLUTION_COUNT;
+	reinit_completion(&hv->wait);
 	hyperv_sendpacket(hdev, msg);

 	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
@@ -391,8 +394,11 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 		return -ETIMEDOUT;
 	}

-	if (msg->resolution_resp.resolution_count == 0) {
-		drm_err(dev, "No supported resolutions\n");
+	if (msg->resolution_resp.resolution_count == 0 ||
+	    msg->resolution_resp.resolution_count >
+	    SYNTHVID_MAX_RESOLUTION_COUNT) {
+		drm_err(dev, "Invalid resolution count: %d\n",
+			msg->resolution_resp.resolution_count);
 		return -ENODEV;
 	}

@@ -506,8 +512,13 @@ int hyperv_connect_vsp(struct hv_device *hdev)

 	if (hyperv_version_ge(hv->synthvid_version, SYNTHVID_VERSION_WIN10)) {
 		ret = hyperv_get_supported_resolution(hdev);
-		if (ret)
+		if (ret) {
 			drm_err(dev, "Failed to get supported resolution from host, use default\n");
+			hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
+			hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;
+			hv->preferred_width = SYNTHVID_WIDTH_WIN8;
+			hv->preferred_height = SYNTHVID_HEIGHT_WIN8;
+		}
 	} else {
 		hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
 		hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;
-- 
2.47.3

^ permalink raw reply related

* [PATCH v2 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Berkant Koc @ 2026-05-17 14:25 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat, stable
In-Reply-To: <20260517-drm-hyperv-cover-v2@berkoc.com>

hyperv_receive() reads bytes_recvd from vmbus_recvpacket() but does not
forward that value to hyperv_receive_sub(). The sub-handler reads
msg->vid_hdr.type and msg->feature_chg.is_dirt_needed without knowing
how many bytes the host actually wrote, so a short packet leaves the
parser reading bytes that the host did not write in this packet. The
unconditional memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE) on the
wait-completion path also copies the full 16 KiB recv_buf regardless
of bytes_recvd, including any residue from the prior message.

Pass bytes_recvd into hyperv_receive_sub() and reject any packet shorter
than the pipe + synthvid header. The dirt-feature branch additionally
requires the feature_change payload size before reading is_dirt_needed.
The init_buf copy now uses bytes_recvd as the length argument, which
keeps it bounded by VMBUS_MAX_PACKET_SIZE through the new upper check.

Unchanged from v1.

Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
Signed-off-by: Berkant Koc <me@berkoc.com>
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index 3b5065fe06e4..cdab4895dd40 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -423,26 +423,35 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 	return 0;
 }
 
-static void hyperv_receive_sub(struct hv_device *hdev)
+static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
 {
 	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *msg;
+	size_t hdr_size;
 
 	if (!hv)
 		return;
 
+	hdr_size = sizeof(struct pipe_msg_hdr) +
+		   sizeof(struct synthvid_msg_hdr);
+	if (bytes_recvd < hdr_size || bytes_recvd > VMBUS_MAX_PACKET_SIZE)
+		return;
+
 	msg = (struct synthvid_msg *)hv->recv_buf;
 
 	/* Complete the wait event */
 	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
-		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
+		memcpy(hv->init_buf, msg, bytes_recvd);
 		complete(&hv->wait);
 		return;
 	}
 
 	if (msg->vid_hdr.type == SYNTHVID_FEATURE_CHANGE) {
+		if (bytes_recvd < hdr_size +
+		    sizeof(struct synthvid_feature_change))
+			return;
 		hv->dirt_needed = msg->feature_chg.is_dirt_needed;
 		if (hv->dirt_needed)
 			hyperv_hide_hw_ptr(hv->hdev);
@@ -469,7 +478,7 @@ static void hyperv_receive(void *ctx)
 				       &bytes_recvd, &req_id);
 		if (bytes_recvd > 0 &&
 		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
-			hyperv_receive_sub(hdev);
+			hyperv_receive_sub(hdev, bytes_recvd);
 	} while (bytes_recvd > 0 && ret == 0);
 }
 
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: sashiko-bot @ 2026-05-17 14:17 UTC (permalink / raw)
  To: Berkant Koc; +Cc: linux-hyperv
In-Reply-To: <20260517-drm-hyperv-patch2@berkoc.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Incomplete payload size validation causes wait-completion handlers to read stale request data, allowing response spoofing and invalid state.
--

commit ffe1d9057aa46275f644c2e152d0e7a9e8837c3b
Author: Berkant Koc <me@berkoc.com>

drm/hyperv: validate VMBus packet size in receive callback

This commit introduces bytes_recvd into hyperv_receive_sub() to validate
packet sizes. It restricts the memcpy length to bytes_recvd and checks
payload size for feature change messages.

Cc: stable@vger.kernel.org # 5.14+

> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c

[ ... ]

> @@ -420,26 +420,35 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
>  {
>  	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
>  	struct synthvid_msg *msg;
> +	size_t hdr_size;
>  
>  	if (!hv)
>  		return;
>  
> +	hdr_size = sizeof(struct pipe_msg_hdr) +
> +		   sizeof(struct synthvid_msg_hdr);
> +	if (bytes_recvd < hdr_size || bytes_recvd > VMBUS_MAX_PACKET_SIZE)
> +		return;
> +
>  	msg = (struct synthvid_msg *)hv->recv_buf;
>  
>  	/* Complete the wait event */
>  	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
> -		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
> +		memcpy(hv->init_buf, msg, bytes_recvd);

Could this change introduce a regression where a short packet leaves stale
request data in hv->init_buf?

The patch bounds the memcpy to bytes_recvd, but it doesn't appear to validate
that bytes_recvd is large enough to contain the expected response payloads
before calling complete(&hv->wait).

Because hv->init_buf is used for both sending requests and receiving responses,
a short response might only partially overwrite hv->init_buf. The payload
portion would retain the outgoing request's data or zeroes from the pre-send
initialization.

For example, when sending a VRAM location update, the driver sets
msg->vram.user_ctx = vram_pp. If a short response arrives, the payload is not
overwritten. The waiter checks if (msg->vram_ack.user_ctx != vram_pp). Since
vram and vram_ack share a union in struct synthvid_msg, this check might pass
against the driver's own request data, allowing a truncated packet to spoof a
successful ACK.

Similarly, in hyperv_get_supported_resolution(), if a truncated response
includes the resolution_count but omits the supported_resolution array, the
uncopied array memory remains zero. The driver could parse zeroed dimensions,
setting hv->preferred_width and height to 0, which could propagate to the DRM
core.

Would it be safer to have the receive handler validate the specific payload
sizes for these wait-completion response types, similar to how it validates
SYNTHVID_FEATURE_CHANGE below?

>  		complete(&hv->wait);
>  		return;
>  	}
>  
>  	if (msg->vid_hdr.type == SYNTHVID_FEATURE_CHANGE) {
> +		if (bytes_recvd < hdr_size +
> +		    sizeof(struct synthvid_feature_change))
> +			return;
>  		hv->dirt_needed = msg->feature_chg.is_dirt_needed;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260517-drm-hyperv-cover@berkoc.com?part=2

^ permalink raw reply

* Re: [PATCH v2 1/2] drm/hyperv: validate resolution_count and harden VSP request paths
From: sashiko-bot @ 2026-05-17 14:47 UTC (permalink / raw)
  To: Berkant Koc; +Cc: linux-hyperv
In-Reply-To: <20260517-drm-hyperv-patch1-v2@berkoc.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] The fix for delayed VSP responses is incomplete, leaving a critical data race on the shared `hv->init_buf` that allows type confusion and out-of-bounds memory reads.
--

commit 3d049f39a946f7898e25e92a422ae95b066e9989
Author: Berkant Koc <me@berkoc.com>

drm/hyperv: validate resolution_count and harden VSP request paths

This commit adds bounds checking for resolution_count when parsing
SYNTHVID_RESOLUTION_RESPONSE. It degrades gracefully to default
resolutions on failure, and adds reinit_completion() before VSP requests
to prevent consuming stale completions.

> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
[ ... ]
> @@ -383,6 +385,7 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  		sizeof(struct synthvid_supported_resolution_req);
>  	msg->resolution_req.maximum_resolution_count =
>  		SYNTHVID_MAX_RESOLUTION_COUNT;
> +	reinit_completion(&hv->wait);
>  	hyperv_sendpacket(hdev, msg);

Does this reinit_completion() prevent a data race on the shared hv->init_buf?

If a timed-out request's delayed response arrives while a new request is
being constructed, could hyperv_receive_sub() concurrently overwrite
hv->init_buf without holding any locks?

Since the buffer is used for both sending and receiving, if a delayed
response arrives exactly between the memset(msg, 0) and
hyperv_sendpacket(hdev, msg), could msg->vid_hdr.size be overwritten by
the host's response?

If a host provides a crafted size, could this cause an integer overflow in
totalbytes_towrite within hv_ringbuffer_write(), bypassing ring buffer
free-space checks and causing an out-of-bounds read of guest memory into
the VMBus ring buffer?

Additionally, if the delayed response triggers complete(&hv->wait), is
there a chance the driver parses a mismatched response because it does not
validate that msg->vid_hdr.type matches the expected response?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260517-drm-hyperv-cover-v2@berkoc.com?part=1

^ permalink raw reply

* Re: [PATCH v2 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: sashiko-bot @ 2026-05-17 15:13 UTC (permalink / raw)
  To: Berkant Koc; +Cc: linux-hyperv
In-Reply-To: <20260517-drm-hyperv-patch2-v2@berkoc.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Incomplete validation of synchronous response payload sizes causes state leakage and deterministic false success.
--

commit 31b0cdbb563323a78bd6df8c18d507462a03a084
Author: Berkant Koc <me@berkoc.com>

drm/hyperv: validate VMBus packet size in receive callback

This commit updates the hyperv_receive callback to validate the received
packet size before processing it. It ensures that the packet contains at
least the message headers, passes bytes_recvd to the sub-handler, and
uses it to bound the memcpy for synchronous responses.

> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 3b5065fe06e46..cdab4895dd404 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
[ ... ]
> @@ -423,26 +423,35 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  	return 0;
>  }
>  
> -static void hyperv_receive_sub(struct hv_device *hdev)
> +static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
>  {
>  	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
>  	struct synthvid_msg *msg;
> +	size_t hdr_size;
>  
>  	if (!hv)
>  		return;
>  
> +	hdr_size = sizeof(struct pipe_msg_hdr) +
> +		   sizeof(struct synthvid_msg_hdr);
> +	if (bytes_recvd < hdr_size || bytes_recvd > VMBUS_MAX_PACKET_SIZE)
> +		return;
> +
>  	msg = (struct synthvid_msg *)hv->recv_buf;
>  
>  	/* Complete the wait event */
>  	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
> -		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
> +		memcpy(hv->init_buf, msg, bytes_recvd);

Is it possible for a truncated response to incorrectly report success here?

Since hv->init_buf is reused for both sending requests and receiving responses,
and their structures alias each other in the synthvid_msg union, a short packet
would leave the payload section of hv->init_buf unmodified.

For example, if a truncated SYNTHVID_VRAM_LOCATION_ACK packet contains only
headers, memcpy() will only copy the headers. The request's vram.user_ctx and
the response's vram_ack.user_ctx reside at the exact same memory offset in the
union. Because the payload in hv->init_buf is unmodified, msg->vram_ack.user_ctx
will perfectly retain the request's vram_pp value.

The waiting thread checks this with:

	if (msg->vram_ack.user_ctx != vram_pp)

Because the memory was unchanged, this check will deterministically evaluate
to false, causing the driver to proceed as if the VRAM location was
successfully acknowledged despite the malformed packet.

Should this code validate that bytes_recvd is large enough to contain the
specific response type's payload before copying and calling complete()?

>  		complete(&hv->wait);
>  		return;
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260517-drm-hyperv-cover-v2@berkoc.com?part=2

^ permalink raw reply

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
From: Erni Sri Satya Vennela @ 2026-05-18  7:04 UTC (permalink / raw)
  To: Yury Norov
  Cc: Shradha Gupta, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <af4X_52txN28b9RV@yury>

> > But one observation I had was that " irq_set_affinity_and_hint(*irqs++,
> > NULL);" is essentially a no-op and we end up relying on the initial
> > placement from pci_alloc_irq_vectors().
> 
> Yes you are, assuming you're not binding them before in your call chain.
> 
> > Even though in these tests we
> > were not able to reproduce it, but with this distribution there is a
> > chance we end up clustering the mana queue IRQs, while other vCPUs are
> > not running any network load.
> 
> That sounds like an IRQ balancer bug which you're unable to reproduce. 
> 
> > It's because the placement depends on
> > system-wide IRQ state at allocation time.
> 
> I don't understand this point. The 
> 
>         irq_set_affinity_and_hint(*irqs++, NULL);
> 
> simply means: I trust system IRQ balancer to pick the best CPU for my
> IRQ at runtime. It doesn't refer any "IRQ state at allocation time".
>   
> > The linear approach however gaurantees each queue IRQ lands on a
> > distinct vCPU regardless of system state. Even after stressing the cpus
> > using stress-ng, we did not observe any significant throughput drop.
> 
> If you just do nothing, it would lead to the same numbers, right? What
> does that "non-significant throughput drop" mean? It sounds like the
> linear approach is slightly worse.

The numbers are not worse, they almost same in both the cases.
> 
> --
> 
> So, as you can't demonstrate solid benefit for the 'linear' IRQ placement,
> I would just stick to the no-affinity logic.

Thankyou Yury,
We are investigating on more test scenarios and trying to
capture numbers with both, your proposed change and the one from this
patch. We will keep you updated about the results.


- Vennela

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-18  9:38 UTC (permalink / raw)
  To: Mukesh R
  Cc: Michael Kelley, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, jgg@ziepe.ca,
	jacob.pan@linux.microsoft.com, tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <53754e0b-2af8-edd2-dfc0-293fac002a52@linux.microsoft.com>

On Fri, May 15, 2026 at 05:11:19PM -0700, Mukesh R wrote:
> On 5/15/26 09:53, Yu Zhang wrote:
> > On Fri, May 15, 2026 at 02:51:38PM +0000, Michael Kelley wrote:
> > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 7:00 AM
> > > > 
> > > > On Thu, May 14, 2026 at 06:13:24PM +0000, Michael Kelley wrote:
> > > > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> > > > > > 
> > > > > > Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
> > > > > > This driver implements stage-1 IO translation within the guest OS.
> > > > > > It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
> > > > > > for:
> > > > > >   - Capability discovery
> > > > > >   - Domain allocation, configuration, and deallocation
> > > > > >   - Device attachment and detachment
> > > > > >   - IOTLB invalidation
> > > > > > 
> > > > > > The driver constructs x86-compatible stage-1 IO page tables in the
> > > > > > guest memory using consolidated IO page table helpers. This allows
> > > > > > the guest to manage stage-1 translations independently of vendor-
> > > > > > specific drivers (like Intel VT-d or AMD IOMMU).
> > > > > > 
> > > > > > Hyper-V consumes this stage-1 IO page table when a device domain is
> > > > > > created and configured, and nests it with the host's stage-2 IO page
> > > > > > tables, therefore eliminating the VM exits for guest IOMMU mapping
> > > > > > operations. For unmapping operations, VM exits to perform the IOTLB
> > > > > > flush are still unavoidable.
> > > > > > 
> > > > > > Hyper-V identifies each PCI pass-thru device by a logical device ID
> > > > > > in its hypercall interface. The vPCI driver (pci-hyperv) registers the
> > > > > > per-bus portion of this ID with the pvIOMMU driver during bus probe.
> > > > > > The pvIOMMU driver stores this mapping and combines it with the function
> > > > > > number of the endpoint PCI device to form the complete ID for hypercalls.
> > > > > 
> > > > > As you are probably aware, Mukesh's patch series to support PCI
> > > > > pass-thru devices also needs to get the logical device ID. Maybe the
> > > > > registration mechanism needs to move somewhere that can be shared
> > > > > with his code.
> > > > > 
> > > > 
> > > > Thank you so much for the review, Michael!
> > > > 
> > > > Yes, I looked at Mukesh's series and noticed his hv_pci_vmbus_device_id()
> > > > in pci-hyperv.c has the same dev_instance byte manipulation. We do need
> > > > a common registration mechanism.
> > > > 
> > > > Any suggestion on where to put it? drivers/hv/hv_common.c seems like a
> > > > natural place, but the register/lookup functions are currently only
> > > > meaningful when CONFIG_HYPERV_PVIOMMU is set. If Mukesh's pass-thru
> > > > code also needs them, we might need a new shared Kconfig option that
> > > > both can select. Open to better ideas.
> > > 
> > > Unfortunately, I have not looked at Mukesh's series in detail yet, so
> > > I don't have enough knowledge of the full situation to offer a good
> > > recommendation.
> > > 
> > 
> > Sorry I forgot to Cc Mukesh in the previous reply. :(
> > @Mukesh, any thoughts on sharing the logical device ID registration mechanism?
> 
> Yeah, I went round and round trying to find the best place. I almost
> created virt/hyperv/hv_utils.c file. Maybe that is the best place?

Thanks for thinking about this, Mukesh!

I'm a bit hesitant about introducing virt/hyperv/. Currently virt/ 
only hosts KVM's architecture-neutral hypervisor core. And it feels
like the wrong layer for driver-level utility code. And drivers/hv/
seems like a more natural fit?

I'm also thinking about the config to gating these new interfaces(
register/lookup etc.), I am using CONFIG_HYPERV_PVIOMMU, and I guess
you may propably propose another one for the host side change(or just 
CONFIG_MSHV_ROOT)?

B.R.
Yu

> 
> Thanks,
> -Mukesh
> 
> 
> > > > 
> > > > [...]
> > > > 
> > > > > > +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
> > > > > > +{
> > > > > > +	u64 status;
> > > > > > +	unsigned long flags;
> > > > > > +	struct hv_input_flush_device_domain *input;
> > > > > > +
> > > > > > +	local_irq_save(flags);
> > > > > > +
> > > > > > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > > > > +	memset(input, 0, sizeof(*input));
> > > > > > +	input->device_domain = hv_domain->device_domain;
> > > > > 
> > > > > The previous version of this patch had code to set several other fields in
> > > > > the input. I wanted to confirm that not setting them in this version is
> > > > > intentional. Were they not needed?
> > > > > 
> > > > 
> > > > Oh. The RFC v1 set partition_id, owner_vtl, domain_id.type, and domain_id.id
> > > > individually. In this version, I just simplified it to a struct assignment.
> > > > No functional change.
> > > 
> > > Of course! I should have looked more closely at the details before making
> > > this comment. :-(
> > > 
> > > [...]
> > > 
> > > > > 
> > > > > Previous versions of this function did hv_iommu_detach_dev(). With that call
> > > > > removed from here, hv_iommu_detach_dev() is only called when attaching a
> > > > > domain to a device that already has a domain attached. Is it the case that
> > > > > Hyper-V doesn't require the detach as a cleanup step?
> > > > > 
> > > > 
> > > > The IOMMU core attaches the device to release_domain (our blocking domain)
> > > > before calling release_device(), so I believe the explicit detach in the RFC
> > > > was redundant. I simply didn't realize that at the time.
> > > > 
> > > 
> > > Got it. But after the IOMMU core attaches the device to the blocking
> > > domain, there's the possibility that the vPCI device is rescinded by
> > > Hyper-V and it goes away entirely. Or the device might be subjected
> > > to an "unbind/bind" cycle in Linux. Does the detach need to be done
> > > on the blocking domain in such cases? In this version of the patches, the
> > > Hyper-V "attach" and "detach" hypercalls still end up unbalanced. That
> > > seems a bit untidy at best, and I wonder if there are scenarios where
> > > Hyper-V will complain about the lack of balance.
> > > 
> > 
> > Thank you, Michael. May I ask what "the vPCI device is rescinded by
> > Hyper-V and it goes away entirely" mean?
> > 
> > I realized it's a bit untidy. But I want to understand this issue more
> > clearly first. :)
> > 
> > B.R.
> > Yu
> 

^ permalink raw reply

* Re: [PATCH net] net: mana: Fix TOCTOU double-fetch of hwc_msg_id from DMA buffer
From: Erni Sri Satya Vennela @ 2026-05-18 11:38 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515194427.52CACC2BCB0@smtp.kernel.org>

 
> > @@ -269,13 +268,17 @@ static void mana_hwc_rx_event_handler(void *ctx, u32 gdma_rxq_id,
> >  	rx_req = &hwc_rxq->msg_buf->reqs[rx_req_idx];
> 
> Since this patch addresses host-controlled DMA memory vulnerabilities in
> Confidential VMs, could there be another out-of-bounds access just before
> the msg_id check?
> 
> Looking at how rx_req_idx is calculated earlier in this function:
> 
> 	sge = (struct gdma_sge *)(wqe + 8 + dma_oob->inline_oob_size_div4 * 4);
> 	rq_base_addr = hwc_rxq->msg_buf->mem_info.dma_handle;
> 	rx_req_idx = (sge->address - rq_base_addr) / hwc->max_req_msg_size;
> 
> Because sge->address is read directly from the untrusted DMA queue, a malicious
> host could provide an arbitrary value, resulting in an out-of-bounds
> rx_req_idx.
> 
> Using this unvalidated index to access hwc_rxq->msg_buf->reqs[rx_req_idx]
> yields an attacker-controlled rx_req pointer. When this corrupted rx_req is
> subsequently passed to mana_hwc_post_rx_wqe(), the driver writes to req->sge
> and req->wqe_req, which could allow arbitrary writes to guest kernel memory.
> 
> Should rx_req_idx be bounds-checked against the size of the reqs array
> before it is used to index the array?
> 
> >  	resp = (struct gdma_resp_hdr *)rx_req->buf_va;
> >  
> > -	if (resp->response.hwc_msg_id >= hwc->num_inflight_msg) {
> > -		dev_err(hwc->dev, "HWC RX: wrong msg_id=%u\n",
> > -			resp->response.hwc_msg_id);
> > +	/* Read msg_id once from DMA buffer to prevent TOCTOU:
> > +	 * DMA memory is shared/unencrypted in CVMs - host can
> > +	 * modify it between reads.
> > +	 */
> > +	msg_id = READ_ONCE(resp->response.hwc_msg_id);
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260514194156.466823-1-ernis@linux.microsoft.com?part=1

This issue is already addressed in a separate patch by Aditya Garg
<gargaditya@linux.microsoft.com>, which is currently under internal
review.

- Vennela

^ permalink raw reply

* Re: [PATCH v3 32/41] x86/tsc: Rejects attempts to override TSC calibration with lesser routine
From: Sean Christopherson @ 2026-05-18 19:17 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515201628.9E3C5C2BCB3@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] Spurious WARN_ON stack trace on every boot for CoCo (SEV-SNP/TDX) guests under KVM.
> --
> 
> commit e57255e4d076143a67d0e6e44f914aec31aed8c7
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/tsc: Rejects attempts to override TSC calibration with lesser routine
> 
> This commit adds a sanity check when registering a TSC frequency calibration
> routine to ensure the incoming routine is as robust as the outgoing one,
> rejecting it with a warning if it is not. This prevents replacing a trusted
> calibration source with a less trusted one.
> 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index 98bef1d06fa9d..7a261214fa3ef 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1319,8 +1319,13 @@ void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
> >  
> >  	if (properties & TSC_FREQUENCY_KNOWN)
> >  		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > +	else if (WARN_ON(boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)))
> > +		return;
> > +
> >  	if (properties & TSC_RELIABLE)
> >  		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > +	else if (WARN_ON(boot_cpu_has(X86_FEATURE_TSC_RELIABLE)))
> > +		return;
> 
> Will this cause a spurious warning stack trace on every boot for CoCo
> (SEV-SNP/TDX) guests under KVM?
> 
> For KVM CoCo guests, CoCo registers first via tsc_early_init(), setting both
> TSC_FREQUENCY_KNOWN and X86_FEATURE_TSC_RELIABLE.
> 
> KVM then registers second via kvmclock_init(). At this point in the patch
> series, KVM only passes TSC_FREQUENCY_KNOWN, so the properties & TSC_RELIABLE
> check is false, and it falls through to this WARN_ON. Since CoCo already set
> the capability, the condition evaluates to true.
> 
> I see this is fixed later in the series by the commit "x86/kvmclock: Mark TSC
> as reliable when it's constant and nonstop", which modifies KVM to pass
> TSC_FREQ_KNOWN_AND_RELIABLE on modern hardware, preventing this warning.
> 
> Could the patch ordering be adjusted to prevent this intermediate regression?

Ah shoot, no.  Well, not entirely.  Because when running as a CoCo guest with a
trusted TSC, the kernel needs to prevent overwriting the TSC calibration, *period*.
I.e. changing the ordering will eliminate the unwanted WARN, but it won't fix the
underlying goof that the trusted calibration routines are still being overwritten
with untrusted routines.

Hrm, the SNP secure TSC code complicates things, but I suspect it's broken.  If
it's indeed broken, then I think we can add a TSC_TRUSTED flag and then do:

	if (WARN_ON_ONCE(!calibrate_tsc))
		return;

	if (properties & TSC_FREQUENCY_KNOWN)
		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
	else if (WARN_ON(boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)))
		return;

	if (properties & TSC_RELIABLE)
		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
	else if (WARN_ON(boot_cpu_has(X86_FEATURE_TSC_RELIABLE)))
		return;

	if (!cpu_has_trusted_tsc() || (properties & TSC_TRUSTED))
		x86_platform.calibrate_tsc = calibrate_tsc;

	if (calibrate_cpu)
		x86_platform.calibrate_cpu = calibrate_cpu;


Tom / Nikunj,

Isn't it completely wrong to assume the CPU frequency is the same as the TSC
frequency?  The changelog says the difference "does not apply", but that makes
no sense.

    Use the GUEST_TSC_FREQ MSR to discover the TSC frequency instead of
    relying on kvm-clock based frequency calibration.  Override both CPU and
    TSC frequency calibration callbacks with securetsc_get_tsc_khz(). Since
    the difference between CPU base and TSC frequency does not apply in this
    case, the same callback is being used.

E.g. if the host passed through APERF/MPERF, then the difference most definitely
applies.  If TSC != CPU frequency, then knowingly using bad data is even worse
(far, far worse) than hoping the untrusted host is playing nice.

If the TSC and CPU frequencies are somehow guaranteed to be the same (which I
can't possibly imagine is the case), then the above won't work because we also
want to prevent overriding calibrate_cpu().

^ permalink raw reply

* [PATCH net 0/2] net: mana: Fix NULL dereferences during teardown after attach failure.
From: Dipayaan Roy @ 2026-05-18 19:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov

When mana_attach() fails (e.g. during queue allocation), the error
cleanup frees apc->tx_qp and apc->rxqs and sets them to NULL. Multiple
subsequent teardown paths can then dereference these NULL pointers,
causing kernel panics.

Patch 1 adds NULL guards in the low-level teardown functions
(mana_fence_rqs, mana_destroy_vport, mana_dealloc_queues) so they are
safe to call regardless of queue initialization state. This covers all
callers: mana_remove(), mana_change_mtu() recovery, and internal error
paths in mana_alloc_queues().

Patch 2 addresses the queue reset work handler specifically, where an
unconditional mana_detach() on an already-detached port caused
redundant teardown. It checks netif_device_present() to skip the detach
and directly retry mana_attach().

Dipayaan Roy (2):
  net: mana: Add NULL guards in teardown path to prevent panic on attach
    failure
  net: mana: Skip redundant detach in queue reset handler if already
    detached

 drivers/net/ethernet/microsoft/mana/mana_en.c | 77 ++++++++++++-------
 1 file changed, 48 insertions(+), 29 deletions(-)

-- 
2.43.0

^ permalink raw reply

* [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
From: Dipayaan Roy @ 2026-05-18 19:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260518194654.735580-1-dipayanroy@linux.microsoft.com>

When queue allocation fails partway through, the error cleanup frees
and NULLs apc->tx_qp and apc->rxqs. Multiple teardown paths such as
mana_remove(), mana_change_mtu() recovery, and internal error handling
in mana_alloc_queues() can subsequently call into functions that
dereference these pointers without NULL checks:

- mana_chn_setxdp() dereferences apc->rxqs[0], causing a NULL pointer
  dereference panic (CR2: 0000000000000000 at mana_chn_setxdp+0x26).
- mana_destroy_vport() iterates apc->rxqs without a NULL check.
- mana_fence_rqs() iterates apc->rxqs without a NULL check.
- mana_dealloc_queues() iterates apc->tx_qp without a NULL check.

Add NULL guards for apc->rxqs in mana_fence_rqs(),
mana_destroy_vport(), and before the mana_chn_setxdp() call. Add a
NULL guard for apc->tx_qp in mana_dealloc_queues() to skip TX queue
draining when TX queues were never allocated or already freed.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 70 +++++++++++--------
 1 file changed, 41 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 9afc786b297a..0582803907a8 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1727,6 +1727,9 @@ static void mana_fence_rqs(struct mana_port_context *apc)
 	struct mana_rxq *rxq;
 	int err;
 
+	if (!apc->rxqs)
+		return;
+
 	for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
 		rxq = apc->rxqs[rxq_idx];
 		err = mana_fence_rq(apc, rxq);
@@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
 	struct mana_rxq *rxq;
 	u32 rxq_idx;
 
-	for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
-		rxq = apc->rxqs[rxq_idx];
-		if (!rxq)
-			continue;
+	if (apc->rxqs) {
 
-		mana_destroy_rxq(apc, rxq, true);
-		apc->rxqs[rxq_idx] = NULL;
+		for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
+			rxq = apc->rxqs[rxq_idx];
+			if (!rxq)
+				continue;
+
+			mana_destroy_rxq(apc, rxq, true);
+			apc->rxqs[rxq_idx] = NULL;
+		}
 	}
 
 	mana_destroy_txq(apc);
@@ -3269,7 +3275,8 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	if (apc->port_is_up)
 		return -EINVAL;
 
-	mana_chn_setxdp(apc, NULL);
+	if (apc->rxqs)
+		mana_chn_setxdp(apc, NULL);
 
 	if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
 		mana_pf_deregister_filter(apc);
@@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	 * number of queues.
 	 */
 
-	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
-		tsleep = 1000;
-		while (atomic_read(&txq->pending_sends) > 0 &&
-		       time_before(jiffies, timeout)) {
-			usleep_range(tsleep, tsleep + 1000);
-			tsleep <<= 1;
-		}
-		if (atomic_read(&txq->pending_sends)) {
-			err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
-			if (err) {
-				netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
-					   err, atomic_read(&txq->pending_sends),
-					   txq->gdma_txq_id);
+	if (apc->tx_qp) {
+		for (i = 0; i < apc->num_queues; i++) {
+			txq = &apc->tx_qp[i].txq;
+			tsleep = 1000;
+			while (atomic_read(&txq->pending_sends) > 0 &&
+			       time_before(jiffies, timeout)) {
+				usleep_range(tsleep, tsleep + 1000);
+				tsleep <<= 1;
+			}
+			if (atomic_read(&txq->pending_sends)) {
+				err =
+				    pcie_flr(to_pci_dev(gd->gdma_context->dev));
+				if (err) {
+					netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
+						   err,
+					    atomic_read(&txq->pending_sends),
+					    txq->gdma_txq_id);
+				}
+				break;
 			}
-			break;
 		}
-	}
 
-	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
-		while ((skb = skb_dequeue(&txq->pending_skbs))) {
-			mana_unmap_skb(skb, apc);
-			dev_kfree_skb_any(skb);
+		for (i = 0; i < apc->num_queues; i++) {
+			txq = &apc->tx_qp[i].txq;
+			while ((skb = skb_dequeue(&txq->pending_skbs))) {
+				mana_unmap_skb(skb, apc);
+				dev_kfree_skb_any(skb);
+			}
+			atomic_set(&txq->pending_sends, 0);
 		}
-		atomic_set(&txq->pending_sends, 0);
 	}
+
 	/* We're 100% sure the queues can no longer be woken up, because
 	 * we're sure now mana_poll_tx_cq() can't be running.
 	 */
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: Dipayaan Roy @ 2026-05-18 19:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260518194654.735580-1-dipayanroy@linux.microsoft.com>

When mana_per_port_queue_reset_work_handler() runs, it unconditionally
calls mana_detach() which attempts to tear down queues that are already
freed, leading to NULL pointer dereferences on apc->tx_qp and
apc->rxqs.

Check netif_device_present() in the reset handler and skip
mana_detach() when the port is already in detached state. This avoids
the redundant teardown and goes directly to mana_attach() to retry
bringing the port back up.

Fixes: 3b194343c250 ("net: mana: Implement ndo_tx_timeout and serialize queue resets per port.")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 0582803907a8..907efadf6fd6 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
 		goto out;
 	}
 
+	/* If already detached (indicates detach succeeded but attach failed
+	 * previously). Now skip mana detach and just retry mana_attach.
+	 */
+	if (!netif_device_present(ndev))
+		goto attach;
+
 	err = mana_detach(ndev, false);
 	if (err) {
 		netdev_err(ndev, "mana_detach failed: %d\n", err);
 		goto dealloc_pre_rxbufs;
 	}
 
+attach:
 	err = mana_attach(ndev);
 	if (err)
 		netdev_err(ndev, "mana_attach failed: %d\n", err);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v3 37/41] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
From: Sean Christopherson @ 2026-05-18 20:28 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515200929.19F2CC2BCB3@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] Bypassing `kvm_sched_clock_init()` when `CONSTANT_TSC` is present
> ignores KVM's `PVCLOCK_TSC_STABLE_BIT`, leading to unfiltered backward time
> jumps in the core scheduler.  --
> 
> commit 57cf68835ef70733c037ca9ffa04fd39e1d70027
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
> 
> This commit updates kvmclock_init to prefer the TSC over kvmclock for
> sched_clock when the TSC is constant, non-stop, and not marked unstable.
> It does this by bypassing kvm_sched_clock_init and instead modifying the
> TSC properties so that TSC becomes the preferred clocksource.
> 
> Link: https://lore.kernel.org/all/Z4hDK27OV7wK572A@google.com
> 
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> [ ... ]
> > @@ -416,22 +416,22 @@ void __init kvmclock_init(void)
> [ ... ]
> >  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> >  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> >  	    !check_tsc_unstable()) {
> >  		kvm_clock.rating = 299;
> >  		tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > +	} else {
> > +		kvm_sched_clock_init(stable);
> >  	}
> >  
> > -	kvm_sched_clock_init(stable);
> 
> Does bypassing kvm_sched_clock_init() when CONSTANT_TSC is present
> accidentally ignore KVM's PVCLOCK_TSC_STABLE_BIT?
> 
> CONSTANT_TSC indicates the hardware TSC rate doesn't change with power
> states, but it doesn't guarantee cross-CPU synchronization, such as across
> multi-socket systems or differing vCPU offsets from hotplug and migration.
> KVM explicitly signals this cross-CPU stability via PVCLOCK_TSC_STABLE_BIT,
> which is mapped to the stable variable.
> 
> Prior to this patch, kvm_sched_clock_init() was always called, invoking
> __paravirt_set_sched_clock() with the stable flag. This allowed the system
> to call clear_sched_clock_stable() if the hypervisor could not guarantee
> synchronization, enabling crucial per-CPU monotonicity filters in
> sched_clock_cpu().
> 
> By skipping kvm_sched_clock_init() entirely and registering
> TSC_FREQ_KNOWN_AND_RELIABLE, the kernel's native cross-CPU TSC synchronization
> checks and the clocksource watchdog appear to be bypassed.

This is very much intented behavior.

> If KVM clears PVCLOCK_TSC_STABLE_BIT, could this result in the guest
> ignoring the condition and using raw, unfiltered rdtsc() values?

Yes, that's the entire point.
 
Note, PVCLOCK_TSC_STABLE_BIT is still set in pvclock_read_flags(&hv_clock_boot[0].pvti),
so anything else that directly consumes PVCLOCK_TSC_STABLE_BIT still does the
right thing.  E.g. kvm_setup_vsyscall_timeinfo() will still set kvm_clock's vDSO
mode to VDSO_CLOCKMODE_PVCLOCK, to communicate that kvm_lock can be used for the
vDSO page, *if* it's chose.

> When tasks migrate between vCPUs with differing TSC offsets, this regression
> might cause backward time jumps in sched_clock(), which breaks rq_clock
> monotonicity and corrupts CFS vruntime and min_vruntime calculations.

Only if the host messed up and incorrectly advertised CONSTANT+NONSTOP.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox