Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-15 16:53 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com, Mukesh R
In-Reply-To: <SN6PR02MB415734108A86BDFB66AEE4CED4042@SN6PR02MB4157.namprd02.prod.outlook.com>

On Fri, May 15, 2026 at 02:51:38PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 7:00 AM
> > 
> > On Thu, May 14, 2026 at 06:13:24PM +0000, Michael Kelley wrote:
> > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> > > >
> > > > Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
> > > > This driver implements stage-1 IO translation within the guest OS.
> > > > It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
> > > > for:
> > > >  - Capability discovery
> > > >  - Domain allocation, configuration, and deallocation
> > > >  - Device attachment and detachment
> > > >  - IOTLB invalidation
> > > >
> > > > The driver constructs x86-compatible stage-1 IO page tables in the
> > > > guest memory using consolidated IO page table helpers. This allows
> > > > the guest to manage stage-1 translations independently of vendor-
> > > > specific drivers (like Intel VT-d or AMD IOMMU).
> > > >
> > > > Hyper-V consumes this stage-1 IO page table when a device domain is
> > > > created and configured, and nests it with the host's stage-2 IO page
> > > > tables, therefore eliminating the VM exits for guest IOMMU mapping
> > > > operations. For unmapping operations, VM exits to perform the IOTLB
> > > > flush are still unavoidable.
> > > >
> > > > Hyper-V identifies each PCI pass-thru device by a logical device ID
> > > > in its hypercall interface. The vPCI driver (pci-hyperv) registers the
> > > > per-bus portion of this ID with the pvIOMMU driver during bus probe.
> > > > The pvIOMMU driver stores this mapping and combines it with the function
> > > > number of the endpoint PCI device to form the complete ID for hypercalls.
> > >
> > > As you are probably aware, Mukesh's patch series to support PCI
> > > pass-thru devices also needs to get the logical device ID. Maybe the
> > > registration mechanism needs to move somewhere that can be shared
> > > with his code.
> > >
> > 
> > Thank you so much for the review, Michael!
> > 
> > Yes, I looked at Mukesh's series and noticed his hv_pci_vmbus_device_id()
> > in pci-hyperv.c has the same dev_instance byte manipulation. We do need
> > a common registration mechanism.
> > 
> > Any suggestion on where to put it? drivers/hv/hv_common.c seems like a
> > natural place, but the register/lookup functions are currently only
> > meaningful when CONFIG_HYPERV_PVIOMMU is set. If Mukesh's pass-thru
> > code also needs them, we might need a new shared Kconfig option that
> > both can select. Open to better ideas.
> 
> Unfortunately, I have not looked at Mukesh's series in detail yet, so
> I don't have enough knowledge of the full situation to offer a good
> recommendation.
> 

Sorry I forgot to Cc Mukesh in the previous reply. :(
@Mukesh, any thoughts on sharing the logical device ID registration mechanism?

> > 
> > [...]
> > 
> > > > +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
> > > > +{
> > > > +	u64 status;
> > > > +	unsigned long flags;
> > > > +	struct hv_input_flush_device_domain *input;
> > > > +
> > > > +	local_irq_save(flags);
> > > > +
> > > > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > > +	memset(input, 0, sizeof(*input));
> > > > +	input->device_domain = hv_domain->device_domain;
> > >
> > > The previous version of this patch had code to set several other fields in
> > > the input. I wanted to confirm that not setting them in this version is
> > > intentional. Were they not needed?
> > >
> > 
> > Oh. The RFC v1 set partition_id, owner_vtl, domain_id.type, and domain_id.id
> > individually. In this version, I just simplified it to a struct assignment.
> > No functional change.
> 
> Of course! I should have looked more closely at the details before making
> this comment. :-(
> 
> [...]
> 
> > >
> > > Previous versions of this function did hv_iommu_detach_dev(). With that call
> > > removed from here, hv_iommu_detach_dev() is only called when attaching a
> > > domain to a device that already has a domain attached. Is it the case that
> > > Hyper-V doesn't require the detach as a cleanup step?
> > >
> > 
> > The IOMMU core attaches the device to release_domain (our blocking domain)
> > before calling release_device(), so I believe the explicit detach in the RFC
> > was redundant. I simply didn't realize that at the time.
> > 
> 
> Got it. But after the IOMMU core attaches the device to the blocking
> domain, there's the possibility that the vPCI device is rescinded by
> Hyper-V and it goes away entirely. Or the device might be subjected
> to an "unbind/bind" cycle in Linux. Does the detach need to be done
> on the blocking domain in such cases? In this version of the patches, the
> Hyper-V "attach" and "detach" hypercalls still end up unbalanced. That
> seems a bit untidy at best, and I wonder if there are scenarios where
> Hyper-V will complain about the lack of balance.
> 

Thank you, Michael. May I ask what "the vPCI device is rescinded by
Hyper-V and it goes away entirely" mean? 

I realized it's a bit untidy. But I want to understand this issue more 
clearly first. :)

B.R.
Yu

^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Yu Zhang @ 2026-05-15 16:23 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <SN6PR02MB41577D5EEC884EAE8AF5E14ED4072@SN6PR02MB4157.namprd02.prod.outlook.com>

On Thu, May 14, 2026 at 06:14:22PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> > 
> > Add page-selective IOTLB flush using HVCALL_FLUSH_DEVICE_DOMAIN_LIST.
> > This hypercall accepts a list of (page_number, page_mask_shift) entries,
> > enabling finer-grained IOTLB invalidation compared to the domain-wide
> > HVCALL_FLUSH_DEVICE_DOMAIN used by hv_iommu_flush_iotlb_all().
> > 
> > hv_iommu_fill_iova_list() decomposes a contiguous IOVA range into a
> > minimal set of aligned power-of-two regions that fit in a single
> > hypercall input page. When the range exceeds the page capacity, the
> > code falls back to a full domain flush automatically.
> > 
> > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> > ---
> >  drivers/iommu/hyperv/iommu.c | 91 +++++++++++++++++++++++++++++++++++-
> >  include/hyperv/hvgdk_mini.h  |  1 +
> >  include/hyperv/hvhdk_mini.h  | 17 +++++++
> >  3 files changed, 108 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> > index e5fc625314b5..3bca362b7815 100644
> > --- a/drivers/iommu/hyperv/iommu.c
> > +++ b/drivers/iommu/hyperv/iommu.c
> > @@ -486,10 +486,98 @@ static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
> >  	hv_flush_device_domain(to_hv_iommu_domain(domain));
> >  }
> > 
> > +/* Max number of iova_list entries in a single hypercall input page. */
> > +#define HV_IOMMU_MAX_FLUSH_VA_COUNT \
> > +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_flush_device_domain_list)) / \
> > +	 sizeof(union hv_iommu_flush_va))
> > +
> > +/* Returned by hv_iommu_fill_iova_list() when the range exceeds the capacity */
> > +#define HV_IOMMU_FLUSH_VA_OVERFLOW	U16_MAX
> > +
> > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > +					  unsigned long start,
> > +					  unsigned long end)
> > +{
> > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> 
> "end" is an inclusive end address per comment in struct iommu_iotlb_gather.
> So a page aligned value would typically have 0xFFF as the low order 12 bits,
> and PAGE_ALIGN() will do the right thing. But I don't think the value is
> *required* to be page aligned.  If the value of "end" had 0x000 as the
> low order 12 bits, the above calculation would fail to include the page
> that has the address ending in 0x000.  I think it needs to be
> PAGE_ALIGN(end + 1) in order to work correctly for this corner case. 
> 

Good catch! Will use HVPFN_DOWN(start) and HVPFN_UP(end + 1) as you
suggested in your follow-up mail.

> > +	unsigned long nr_pages = end_pfn - start_pfn;
> > +	u16 count = 0;
> > +
> > +	while (nr_pages > 0) {
> > +		unsigned long flush_pages;
> > +		int order;
> > +		unsigned long pfn_align;
> > +		unsigned long size_align;
> > +
> > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > +			break;
> > +		}
> > +
> > +		if (start_pfn)
> > +			pfn_align = __ffs(start_pfn);
> 
> I don't understand why __ffs() is correct here. I would expect
> __fls() so it is consistent with the calculation of size_align. But I
> can only surmise how the hypercall works since there's no
> documentation, so maybe my understanding of the hypercall is
> wrong.   If __ffs really is correct, a comment explaining why
> would help. :-)
> 

The use of __ffs() is intentional. Each flush entry invalidates a
naturally aligned 2^N page block, and the hypervisor requires the
page_number to be aligned to 2^page_mask_shift.

Here __ffs() and __fls() serve different purposes:
- __ffs(start_pfn) is about the alignment constraint, e.g.,  how
large a block can this address support?
- __fls(nr_pages) is about  the size constraint, e.g., how large
a block can the remaining range hold?

Taking min() of both ensures each entry is both properly aligned
and within bounds.

Thanks for raising this — it definitely deserves a comment. I had to
stare at it for a while myself to remember why. :)

> > +		else
> > +			pfn_align = BITS_PER_LONG - 1;
> > +
> > +		size_align = __fls(nr_pages);
> > +		order = min(pfn_align, size_align);
> > +		iova_list[count].page_mask_shift = order;
> > +		iova_list[count].page_number = start_pfn;
> > +
> > +		flush_pages = 1UL << order;
> > +		start_pfn += flush_pages;
> > +		nr_pages -= flush_pages;
> > +		count++;
> > +	}
> > +
> > +	return count;
> > +}
> > +
> > +static void hv_flush_device_domain_list(struct hv_iommu_domain *hv_domain,
> > +					struct iommu_iotlb_gather *iotlb_gather)
> > +{
> > +	u64 status;
> > +	u16 count;
> > +	unsigned long flags;
> > +	struct hv_input_flush_device_domain_list *input;
> > +
> > +	local_irq_save(flags);
> > +
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	memset(input, 0, sizeof(*input));
> > +
> > +	input->device_domain = hv_domain->device_domain;
> > +	input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
> 
> I would suggest moving the memset() and setting the input fields down
> under the "else" below so that they are parallel with the flush all case.
> 

I agree the structure should be more symmetric. Yet I guess the memset and
hv_iommu_fill_iova_list() need to stay before the branch since the fill
writes directly into input->iova_list[]. :)

> > +	count = hv_iommu_fill_iova_list(input->iova_list,
> > +					iotlb_gather->start,
> > +					iotlb_gather->end);
> > +	if (count == HV_IOMMU_FLUSH_VA_OVERFLOW) {
> > +		/*
> > +		 * Range exceeds hypercall page capacity. Fall back to a full
> > +		 * domain flush.
> > +		 */
> > +		struct hv_input_flush_device_domain *flush_all = (void *)input;
> > +
> > +		memset(flush_all, 0, sizeof(*flush_all));
> > +		flush_all->device_domain = hv_domain->device_domain;
> > +		status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
> > +					flush_all, NULL);
> > +	} else {
> > +		status = hv_do_rep_hypercall(
> > +				HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
> > +				count, 0, input, NULL);
> > +	}
> > +
> > +	local_irq_restore(flags);
> > +
> > +	if (!hv_result_success(status))
> > +		pr_err("HVCALL_FLUSH_DEVICE_DOMAIN_LIST failed, status %lld\n", status);
> 
> As Sashiko pointed out, a failure here can lead to all kinds of trouble because
> of leaving unflushed entries. Maybe a WARN() is more appropriate? Also, maybe
> a failure in the list flush should try a flush all as a fallback, with the WARN()
> only if the flush all fails.
> 

Good idea. How about we restructure this routine to sth. like this:


	memset(input, 0, sizeof(*input));
	count = hv_iommu_fill_iova_list(...);
	
	if (count != HV_IOMMU_FLUSH_VA_OVERFLOW) {
		input->device_domain = ...;
		...
		status = hv_do_rep_hypercall(FLUSH_DEVICE_DOMAIN_LIST, ...);
		if (hv_result_success(status))
			goto out;
	}
	
	/* overflow or list flush failed: fallback to full domain flush */
	flush_all = (void *)input;
	memset(flush_all, 0, sizeof(*flush_all));
	flush_all->device_domain = ...;
	status = hv_do_hypercall(FLUSH_DEVICE_DOMAIN, ...);
	WARN(!hv_result_success(status), "IOTLB flush failed, status %lld\n", status);

	out:
		local_irq_restore(flags);

B.R.
Yu

^ permalink raw reply

* Re: [PATCH 0/9] drm: Limit DRM_IOCTL_WAIT_VBLANK to vblank interrupts
From: Michel Dänzer @ 2026-05-15 15:12 UTC (permalink / raw)
  To: Thomas Zimmermann, simona, airlied, pekka.paalanen, jadahl,
	contact, maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

On 5/15/26 13:55, Thomas Zimmermann wrote:
> DRM's WAIT_VBLANK ioctl synchronizes user-space clients to display
> refresh. This is meaningless with vblank timers, which run unrelated
> to the hardware's vblank.
> 
> Disable the ioctl for simulated vblanks. Set DRM_VBLANK_FLAG_SIMULATED
> for CRTCs with simulated vblank events in all such drivers. The vblank
> timers of these devices still rate-limit the number of page-flip events
> to match the display refresh.
> 
> According to maintainers, user-space compositors do not require the ioctl
> for rate-limitting display output. Weston and Kwin rely on page-flip
> events. Mutter uses and internal timer to limit the number of display
> updates per second.

Actually mutter fundamentally relies on atomic commit completion events for that, same as Weston & KWin. Mutter uses the WAIT_VBLANK ioctl only for minimizing input → output latency (which can hide issues when completion of atomic commits isn't properly throttled).


(Just a side not on the cover letter, no objections to the patches themselves)


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply

* RE: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Michael Kelley @ 2026-05-15 14:51 UTC (permalink / raw)
  To: Yu Zhang, Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <qeyycsdnejwrqle4zwrvkjvkvrpjifeanwxjaa7i7y2ab7rnt2@b6gvugqayarg>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 7:00 AM
> 
> On Thu, May 14, 2026 at 06:13:24PM +0000, Michael Kelley wrote:
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> > >
> > > Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
> > > This driver implements stage-1 IO translation within the guest OS.
> > > It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
> > > for:
> > >  - Capability discovery
> > >  - Domain allocation, configuration, and deallocation
> > >  - Device attachment and detachment
> > >  - IOTLB invalidation
> > >
> > > The driver constructs x86-compatible stage-1 IO page tables in the
> > > guest memory using consolidated IO page table helpers. This allows
> > > the guest to manage stage-1 translations independently of vendor-
> > > specific drivers (like Intel VT-d or AMD IOMMU).
> > >
> > > Hyper-V consumes this stage-1 IO page table when a device domain is
> > > created and configured, and nests it with the host's stage-2 IO page
> > > tables, therefore eliminating the VM exits for guest IOMMU mapping
> > > operations. For unmapping operations, VM exits to perform the IOTLB
> > > flush are still unavoidable.
> > >
> > > Hyper-V identifies each PCI pass-thru device by a logical device ID
> > > in its hypercall interface. The vPCI driver (pci-hyperv) registers the
> > > per-bus portion of this ID with the pvIOMMU driver during bus probe.
> > > The pvIOMMU driver stores this mapping and combines it with the function
> > > number of the endpoint PCI device to form the complete ID for hypercalls.
> >
> > As you are probably aware, Mukesh's patch series to support PCI
> > pass-thru devices also needs to get the logical device ID. Maybe the
> > registration mechanism needs to move somewhere that can be shared
> > with his code.
> >
> 
> Thank you so much for the review, Michael!
> 
> Yes, I looked at Mukesh's series and noticed his hv_pci_vmbus_device_id()
> in pci-hyperv.c has the same dev_instance byte manipulation. We do need
> a common registration mechanism.
> 
> Any suggestion on where to put it? drivers/hv/hv_common.c seems like a
> natural place, but the register/lookup functions are currently only
> meaningful when CONFIG_HYPERV_PVIOMMU is set. If Mukesh's pass-thru
> code also needs them, we might need a new shared Kconfig option that
> both can select. Open to better ideas.

Unfortunately, I have not looked at Mukesh's series in detail yet, so
I don't have enough knowledge of the full situation to offer a good
recommendation.

> 
> [...]
> 
> > > +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
> > > +{
> > > +	u64 status;
> > > +	unsigned long flags;
> > > +	struct hv_input_flush_device_domain *input;
> > > +
> > > +	local_irq_save(flags);
> > > +
> > > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > +	memset(input, 0, sizeof(*input));
> > > +	input->device_domain = hv_domain->device_domain;
> >
> > The previous version of this patch had code to set several other fields in
> > the input. I wanted to confirm that not setting them in this version is
> > intentional. Were they not needed?
> >
> 
> Oh. The RFC v1 set partition_id, owner_vtl, domain_id.type, and domain_id.id
> individually. In this version, I just simplified it to a struct assignment.
> No functional change.

Of course! I should have looked more closely at the details before making
this comment. :-(

[...]

> >
> > Previous versions of this function did hv_iommu_detach_dev(). With that call
> > removed from here, hv_iommu_detach_dev() is only called when attaching a
> > domain to a device that already has a domain attached. Is it the case that
> > Hyper-V doesn't require the detach as a cleanup step?
> >
> 
> The IOMMU core attaches the device to release_domain (our blocking domain)
> before calling release_device(), so I believe the explicit detach in the RFC
> was redundant. I simply didn't realize that at the time.
> 

Got it. But after the IOMMU core attaches the device to the blocking
domain, there's the possibility that the vPCI device is rescinded by
Hyper-V and it goes away entirely. Or the device might be subjected
to an "unbind/bind" cycle in Linux. Does the detach need to be done
on the blocking domain in such cases? In this version of the patches, the
Hyper-V "attach" and "detach" hypercalls still end up unbalanced. That
seems a bit untidy at best, and I wonder if there are scenarios where
Hyper-V will complain about the lack of balance.

Michael

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-15 13:59 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157FB81CC9B6347DCCC8C56D4072@SN6PR02MB4157.namprd02.prod.outlook.com>

On Thu, May 14, 2026 at 06:13:24PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> > 
> > Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
> > This driver implements stage-1 IO translation within the guest OS.
> > It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
> > for:
> >  - Capability discovery
> >  - Domain allocation, configuration, and deallocation
> >  - Device attachment and detachment
> >  - IOTLB invalidation
> > 
> > The driver constructs x86-compatible stage-1 IO page tables in the
> > guest memory using consolidated IO page table helpers. This allows
> > the guest to manage stage-1 translations independently of vendor-
> > specific drivers (like Intel VT-d or AMD IOMMU).
> > 
> > Hyper-V consumes this stage-1 IO page table when a device domain is
> > created and configured, and nests it with the host's stage-2 IO page
> > tables, therefore eliminating the VM exits for guest IOMMU mapping
> > operations. For unmapping operations, VM exits to perform the IOTLB
> > flush are still unavoidable.
> > 
> > Hyper-V identifies each PCI pass-thru device by a logical device ID
> > in its hypercall interface. The vPCI driver (pci-hyperv) registers the
> > per-bus portion of this ID with the pvIOMMU driver during bus probe.
> > The pvIOMMU driver stores this mapping and combines it with the function
> > number of the endpoint PCI device to form the complete ID for hypercalls.
> 
> As you are probably aware, Mukesh's patch series to support PCI
> pass-thru devices also needs to get the logical device ID. Maybe the
> registration mechanism needs to move somewhere that can be shared
> with his code.
> 

Thank you so much for the review, Michael!

Yes, I looked at Mukesh's series and noticed his hv_pci_vmbus_device_id()
in pci-hyperv.c has the same dev_instance byte manipulation. We do need
a common registration mechanism.

Any suggestion on where to put it? drivers/hv/hv_common.c seems like a
natural place, but the register/lookup functions are currently only
meaningful when CONFIG_HYPERV_PVIOMMU is set. If Mukesh's pass-thru
code also needs them, we might need a new shared Kconfig option that
both can select. Open to better ideas.

Adding Mukesh to the loop. :)

> > 
> > Co-developed-by: Wei Liu <wei.liu@kernel.org>
> > Signed-off-by: Wei Liu <wei.liu@kernel.org>
> > Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> > ---
> >  arch/x86/hyperv/hv_init.c           |   4 +
> >  arch/x86/include/asm/mshyperv.h     |   4 +
> >  drivers/iommu/hyperv/Kconfig        |  17 +
> >  drivers/iommu/hyperv/Makefile       |   1 +
> >  drivers/iommu/hyperv/iommu.c        | 705 ++++++++++++++++++++++++++++
> >  drivers/iommu/hyperv/iommu.h        |  54 +++
> >  drivers/pci/controller/pci-hyperv.c |  19 +-
> >  include/asm-generic/mshyperv.h      |  12 +
> >  8 files changed, 815 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/iommu/hyperv/iommu.c
> >  create mode 100644 drivers/iommu/hyperv/iommu.h
> > 
> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > index 323adc93f2dc..2c8ff8e06249 100644
> > --- a/arch/x86/hyperv/hv_init.c
> > +++ b/arch/x86/hyperv/hv_init.c
> > @@ -578,6 +578,10 @@ void __init hyperv_init(void)
> >  	old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev;
> >  	x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev;
> > 
> > +#ifdef CONFIG_HYPERV_PVIOMMU
> > +	x86_init.iommu.iommu_init = hv_iommu_init;
> > +#endif
> > +
> >  	hv_apic_init();
> > 
> >  	x86_init.pci.arch_init = hv_pci_init;
> > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> > index f64393e853ee..20d947c2c758 100644
> > --- a/arch/x86/include/asm/mshyperv.h
> > +++ b/arch/x86/include/asm/mshyperv.h
> > @@ -313,6 +313,10 @@ static inline void mshv_vtl_return_hypercall(void) {}
> >  static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> >  #endif
> > 
> > +#ifdef CONFIG_HYPERV_PVIOMMU
> > +int __init hv_iommu_init(void);
> > +#endif
> > +
> >  #include <asm-generic/mshyperv.h>
> > 
> >  #endif
> > diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig
> > index 30f40d867036..9e658d5c9a77 100644
> > --- a/drivers/iommu/hyperv/Kconfig
> > +++ b/drivers/iommu/hyperv/Kconfig
> > @@ -8,3 +8,20 @@ config HYPERV_IOMMU
> >  	help
> >  	  Stub IOMMU driver to handle IRQs to support Hyper-V Linux
> >  	  guest and root partitions.
> > +
> > +if HYPERV_IOMMU
> > +config HYPERV_PVIOMMU
> > +	bool "Microsoft Hypervisor para-virtualized IOMMU support"
> > +	depends on X86 && HYPERV
> 
> What is the intent w.r.t. 32-bit builds? Using X86 instead of X86_64
> allows it. I did a 32-bit build and didn't get any build failures, which is
> good. But I can't run it to see if the pvIOMMU actually works in a
> 32-bit build. I don't know how building X86_64 generic PT entries
> would fare.
> 

Sorry, no intent to support 32-bit. I'll change to `depends on X86_64 && HYPERV`

[...]

> > +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
> > +{
> > +	u64 status;
> > +	unsigned long flags;
> > +	struct hv_input_flush_device_domain *input;
> > +
> > +	local_irq_save(flags);
> > +
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	memset(input, 0, sizeof(*input));
> > +	input->device_domain = hv_domain->device_domain;
> 
> The previous version of this patch had code to set several other fields in
> the input. I wanted to confirm that not setting them in this version is
> intentional. Were they not needed?
> 

Oh. The RFC v1 set partition_id, owner_vtl, domain_id.type, and domain_id.id
individually. In this version, I just simplified it to a struct assignment.
No functional change.

> > +	status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input, NULL);
> > +
> > +	local_irq_restore(flags);
> > +
> > +	if (!hv_result_success(status))
> > +		pr_err("HVCALL_FLUSH_DEVICE_DOMAIN failed, status %lld\n", status);
> > +}
> > +
> > +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev)
> > +{
> > +	u64 status;
> > +	unsigned long flags;
> > +	struct hv_input_detach_device_domain *input;
> > +	struct pci_dev *pdev;
> > +	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> > +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +
> > +	/* See the attach function, only PCI devices for now */
> > +	if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> > +		return;
> 
> Are these sanity checks necessary? The only caller is hv_iommu_attach_dev()
> and it has already done the checks.

You're right, they're redundant. 

> > +
> > +	pdev = to_pci_dev(dev);
> > +
> > +	dev_dbg(dev, "detaching from domain %d\n", hv_domain->device_domain.domain_id.id);
> > +
> > +	local_irq_save(flags);
> > +
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	memset(input, 0, sizeof(*input));
> > +	input->partition_id = HV_PARTITION_ID_SELF;
> > +	if (hv_iommu_lookup_logical_dev_id(pdev, &input->device_id.as_uint64)) {
> 
> As Sashiko and Jacob Pan pointed out, doing the lookup while interrupts are disabled
> is problematic. My suggestion would be to just do the lookup into a local variable
> before disabling interrupts (rather than using a raw spin lock as Jacob suggested).
> 
> Same situation occurs in hv_iommu_attach_dev() and
> hv_iommu_get_logical_device_property().
> 

Thanks! I would also prefer to look up before disabling interrupts.

> > +		local_irq_restore(flags);
> > +		dev_warn(&pdev->dev, "no IOMMU registration for vPCI bus on detach\n");
> > +		return;
> > +	}
> > +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> > +
> > +	local_irq_restore(flags);
> > +
> > +	if (!hv_result_success(status))
> > +		pr_err("HVCALL_DETACH_DEVICE_DOMAIN failed, status %lld\n", status);
> > +
> > +	hv_flush_device_domain(hv_domain);
> > +
> > +	vdev->hv_domain = NULL;
> > +}
> > +

[...]

> > +static int hv_iommu_get_logical_device_property(struct device *dev,
> > +					u32 code,
> > +					struct hv_output_get_logical_device_property *property)
> > +{
> > +	u64 status, lid;
> > +	unsigned long flags;
> > +	int ret;
> > +	struct hv_input_get_logical_device_property *input;
> > +	struct hv_output_get_logical_device_property *output;
> > +
> > +	local_irq_save(flags);
> > +
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	output = *this_cpu_ptr(hyperv_pcpu_input_arg) + sizeof(*input);
> 
> Nit: The other way to set output is:
> 
> 	output = input + 1;
> 
> I think this produces slightly better code because of not needing to
> reference the per-cpu variable hyperv_pcpu_input_arg a 2nd time.
> 
> 

Indeed! It's more elegant. :)

> > +	memset(input, 0, sizeof(*input));
> > +	input->partition_id = HV_PARTITION_ID_SELF;
> > +	ret = hv_iommu_lookup_logical_dev_id(to_pci_dev(dev), &lid);
> > +	if (ret) {
> > +		local_irq_restore(flags);
> > +		return ret;
> > +	}
> > +	input->logical_device_id = lid;
> > +	input->code = code;
> > +	status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY, input, output);
> > +	*property = *output;
> > +
> > +	local_irq_restore(flags);
> > +
> > +	if (!hv_result_success(status))
> > +		pr_err("HVCALL_GET_LOGICAL_DEVICE_PROPERTY failed, status %lld\n", status);
> > +
> > +	return hv_result_to_errno(status);
> > +}
> > +

[...]

> > +static void hv_iommu_release_device(struct device *dev)
> > +{
> > +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	if (pdev->ats_enabled)
> > +		pci_disable_ats(pdev);
> > +
> > +	dev_iommu_priv_set(dev, NULL);
> > +	set_dma_ops(dev, NULL);
> 
> Previous versions of this function did hv_iommu_detach_dev(). With that call
> removed from here, hv_iommu_detach_dev() is only called when attaching a
> domain to a device that already has a domain attached. Is it the case that
> Hyper-V doesn't require the detach as a cleanup step?
> 

The IOMMU core attaches the device to release_domain (our blocking domain)
before calling release_device(), so I believe the explicit detach in the RFC
was redundant. I simply didn't realize that at the time.

Sorry I forgot to mention this in the changelog.

[...]

> > +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> > +{
> > +	int ret;
> > +	struct hv_iommu_domain *hv_domain;
> > +	struct pt_iommu_x86_64_cfg cfg = {};
> > +
> > +	hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
> > +	if (!hv_domain)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > +	if (ret) {
> > +		kfree(hv_domain);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> > +	hv_domain->pt_iommu.nid = dev_to_node(dev);
> > +
> > +	cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
> > +	cfg.common.hw_max_oasz_lg2 = 52;
> > +	cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 : 3;
> > +
> > +	ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL);
> > +	if (ret) {
> > +		hv_delete_device_domain(hv_domain);
> > +		kfree(hv_domain);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	/* Constrain to page sizes the hypervisor supports */
> > +	hv_domain->domain.pgsize_bitmap &= hv_iommu_device->pgsize_bitmap;
> > +
> > +	hv_domain->domain.ops = &hv_iommu_paging_domain_ops;
> > +
> > +	ret = hv_configure_device_domain(hv_domain, __IOMMU_DOMAIN_PAGING);
> > +	if (ret) {
> > +		pt_iommu_deinit(&hv_domain->pt_iommu);
> > +		hv_delete_device_domain(hv_domain);
> > +		kfree(hv_domain);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	return &hv_domain->domain;
> 
> I think this function would be better if the error paths did "goto"
> a cascading set of error labels. That's the typical pattern, and it's what you
> use in hv_iommu_init(), for example.
> 

Good point. Will restructure to use goto-based error labels

> > +}
> > +
> > +static struct iommu_ops hv_iommu_ops = {
> > +	.capable		  = hv_iommu_capable,
> > +	.domain_alloc_paging	  = hv_iommu_domain_alloc_paging,
> > +	.probe_device		  = hv_iommu_probe_device,
> > +	.release_device		  = hv_iommu_release_device,
> > +	.device_group		  = hv_iommu_device_group,
> > +	.get_resv_regions	  = hv_iommu_get_resv_regions,
> > +	.owner			  = THIS_MODULE,
> > +	.identity_domain	  = &hv_identity_domain.domain,
> > +	.blocked_domain		  = &hv_blocking_domain.domain,
> > +	.release_domain		  = &hv_blocking_domain.domain,
> > +};
> > +
> > +static int hv_iommu_detect(struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> > +{
> > +	u64 status;
> > +	unsigned long flags;
> > +	struct hv_input_get_iommu_capabilities *input;
> > +	struct hv_output_get_iommu_capabilities *output;
> > +
> > +	local_irq_save(flags);
> > +
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	output = *this_cpu_ptr(hyperv_pcpu_input_arg) + sizeof(*input);
> 
> Potentially use "output = input + 1" here as well.
> 

Yes. Thanks!

[...]

> > @@ -3857,13 +3858,25 @@ static int hv_pci_probe(struct hv_device *hdev,
> > 
> >  	hbus->state = hv_pcibus_probed;
> > 
> > -	ret = create_root_hv_pci_bus(hbus);
> > +	/* Notify pvIOMMU before any device on the bus is scanned. */
> > +	prefix = (hdev->dev_instance.b[5] << 24) |
> > +		 (hdev->dev_instance.b[4] << 16) |
> > +		 (hdev->dev_instance.b[7] <<  8) |
> > +		 (hdev->dev_instance.b[6] & 0xf8);
> 
> This assembling of the logical device id prefix duplicates the
> code in hv_irq_retarget_interrupt(). Could this code save the
> prefix in struct hv_pcibus_device, and then have
> hv_irq_retarget_interrupt() use it?  Then it would be clear
> that HVCALL_RETARGET_INTERRUPT is using exactly the same
> logical device id as the IOMMU hypercalls.
> 

Good point. I think we can do it. :)

> > +
> > +	ret = hv_iommu_register_pci_bus(dom, prefix);
> >  	if (ret)
> >  		goto free_windows;
> > 
> > +	ret = create_root_hv_pci_bus(hbus);
> > +	if (ret)
> > +		goto unregister_pviommu;
> > +
> >  	mutex_unlock(&hbus->state_lock);
> >  	return 0;
> > 
> > +unregister_pviommu:
> > +	hv_iommu_unregister_pci_bus(dom);
> >  free_windows:
> >  	hv_pci_free_bridge_windows(hbus);
> >  exit_d0:
> > @@ -3974,8 +3987,10 @@ static int hv_pci_bus_exit(struct hv_device *hdev, bool
> > keep_devs)
> >  static void hv_pci_remove(struct hv_device *hdev)
> >  {
> >  	struct hv_pcibus_device *hbus;
> > +	int dom;
> > 
> >  	hbus = hv_get_drvdata(hdev);
> > +	dom = hbus->bridge->domain_nr;
> 
> Nit: Setting "dom" here feels a little weird because the value is only needed
> under the "if" statement. The value must be read before the root bus is
> removed, but even so moving it under the "if" statement would make more
> sense to me.
> 

Sure. Thanks!

> >  	if (hbus->state == hv_pcibus_installed) {
> >  		tasklet_disable(&hdev->channel->callback_event);
> >  		hbus->state = hv_pcibus_removing;
> > @@ -3994,6 +4009,8 @@ static void hv_pci_remove(struct hv_device *hdev)
> >  		hv_pci_remove_slots(hbus);
> >  		pci_remove_root_bus(hbus->bridge->bus);
> >  		pci_unlock_rescan_remove();
> > +
> > +		hv_iommu_unregister_pci_bus(dom);
> >  	}
> > 
> >  	hv_pci_bus_exit(hdev, false);

B.R.
Yu

^ permalink raw reply

* Re: [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
From: Yu Zhang @ 2026-05-15 13:58 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jacob Pan, Mukesh R, hpa@zytor.com, robin.murphy@arm.com,
	robh@kernel.org, wei.liu@kernel.org, muislam@microsoft.com,
	namjain@linux.microsoft.com, magnuskulke@linux.microsoft.com,
	anbelski@linux.microsoft.com, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
	longli@microsoft.com, tglx@kernel.org, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	joro@8bytes.org, will@kernel.org, lpieralisi@kernel.org,
	kwilczynski@kernel.org, bhelgaas@google.com, arnd@arndb.de
In-Reply-To: <SN6PR02MB4157371A1A3931F582F9A2E9D4062@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, May 13, 2026 at 03:15:52AM +0000, Michael Kelley wrote:
> From: Jacob Pan <jacob.pan@linux.microsoft.com> Sent: Tuesday, May 12, 2026 4:46 PM
> > 
> > Hi Mukesh,
> > 
> > On Mon, 11 May 2026 19:02:49 -0700
> > Mukesh R <mrathor@linux.microsoft.com> wrote:
> > 
> > > This file actually implements irq remapping, so rename to more
> > > appropriate hyperv-irq.c. A new file to implement hyperv iommu will
> > > be introduced later.  Also, it should not be tied to HYPERV_IOMMU,
> > > but to CONFIG_HYPERV and IRQ_REMAP. The file already has #ifdef
> > > CONFIG_IRQ_REMAP.
> > >
> > > Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > > Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> > > ---
> > >  MAINTAINERS                                    | 2 +-
> > >  drivers/iommu/Makefile                         | 2 +-
> > >  drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +++---
> >
> > Given that we have multiple Hyper-V IOMMU-related files — this renamed
> > hyperv-irq.c, the existing hyperv-iommu code, iommu-root (this
> > series) and the recently posted guest pvIOMMU driver — should we create
> > a drivers/iommu/hyperv/ directory to consolidate them?
> 
> Patch 1/4 in the guest pvIOMMU driver [1] that was recently posted by
> Yu Zhang does as you suggest.
> 
> Michael
> 
> [1] https://lore.kernel.org/linux-hyperv/20260511162408.1180069-1-zhangyu1@linux.microsoft.com/
> 

Maybe we can send a standalone patch and get it merged
first (move it to drivers/iommu/hyperv/irq_remapping.c)?

The rename itself is a meaningful cleanup regardless of
either series, and in the future, both Mukesh and I can
can then build on top of it without conflicts. :)

B.R.
Yu

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-15 12:38 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, jgg, mhklinux,
	tgopinath, easwar.hariharan
In-Reply-To: <20260513113952.00005b20@linux.microsoft.com>

[...]
> > diff --git a/drivers/iommu/hyperv/Kconfig
> > b/drivers/iommu/hyperv/Kconfig index 30f40d867036..9e658d5c9a77 100644
> > --- a/drivers/iommu/hyperv/Kconfig
> > +++ b/drivers/iommu/hyperv/Kconfig
> > @@ -8,3 +8,20 @@ config HYPERV_IOMMU
> >  	help
> >  	  Stub IOMMU driver to handle IRQs to support Hyper-V Linux
> >  	  guest and root partitions.
> > +
> > +if HYPERV_IOMMU
> > +config HYPERV_PVIOMMU
> > +	bool "Microsoft Hypervisor para-virtualized IOMMU support"
> > +	depends on X86 && HYPERV
> > +	select IOMMU_API
> > +	select GENERIC_PT
> > +	select IOMMU_PT
> > +	select IOMMU_PT_X86_64
> nit:
> If HYPERV_PVIOMMU is enabled on a (hypothetical) platform with
> GENERIC_ATOMIC64=y, the select would force-enable IOMMU_PT_X86_64 even
> though its depends on is unsatisfied — leading to a build failure.
> 
> In practice this can't happen today because HYPERV_PVIOMMU already
> depends on X86 && HYPERV, and x86 never sets GENERIC_ATOMIC64. But
> adding the explicit guard is more defensive.
> i.e.
>        depends on !GENERIC_ATOMIC64    # for cmpxchg64 in IOMMU_PT
> 

Good point. Will add "depends on !GENERIC_ATOMIC64".

[...]

> > +
> > +/*
> > + * Look up the logical device ID for a vPCI device. Returns 0 on
> > success
> > + * with *logical_id filled in; -ENODEV if no entry registered for
> > this
> > + * device's vPCI bus.
> > + */
> > +static int hv_iommu_lookup_logical_dev_id(struct pci_dev *pdev, u64
> > *logical_id) +{
> > +	struct hv_pci_busdata *bus;
> > +	int domain = pci_domain_nr(pdev->bus);
> > +	int ret = -ENODEV;
> > +
> > +	spin_lock(&hv_iommu_pci_bus_lock);
> this is called under local_irq_save, should use  raw_spinlock_t for RT
> kernel?
> 

Yes, this is problematic on PREEMPT_RT. Michael also suggested hoisting
the lookup before the local_irq_save() section instead of using a raw
spinlock, which I think is a great idea - all 3 call sites (detach_dev,
attach_dev, get_logical_device_property) will be changed.

> > +	list_for_each_entry(bus, &hv_iommu_pci_bus_list, list) {
> > +		if (bus->pci_domain_nr == domain) {
> > +			*logical_id =
> > (u64)bus->logical_dev_id_prefix |
> > +				      PCI_FUNC(pdev->devfn);
> > +			ret = 0;
> > +			break;
> > +		}
> > +	}
> > +	spin_unlock(&hv_iommu_pci_bus_lock);
> > +	return ret;
> > +}
> > +

[...]

> > +static void hv_iommu_release_device(struct device *dev)
> > +{
> > +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	if (pdev->ats_enabled)
> > +		pci_disable_ats(pdev);
> > +
> > +	dev_iommu_priv_set(dev, NULL);
> > +	set_dma_ops(dev, NULL);
> I don't think this is necessary.
> 

Oh, yes. Thanks!

> > +
> > +	kfree(vdev);
> > +}
> > +
> > +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> > +{
> > +	if (dev_is_pci(dev))
> > +		return pci_device_group(dev);
> > +	else
> > +		return generic_device_group(dev);
> non pci device already rejected during attach, maybe we should warn
> here?
>         WARN_ON_ONCE(1);
>         return generic_device_group(dev);
> 

Good idea. Will add WARN_ON_ONCE(1).

> > +}
> > +
> > +static int hv_configure_device_domain(struct hv_iommu_domain
> > *hv_domain, u32 domain_type) +{
> > +	u64 status;
> > +	unsigned long flags;
> > +	struct pt_iommu_x86_64_hw_info pt_info;
> > +	struct hv_input_configure_device_domain *input;
> > +
> > +	local_irq_save(flags);
> > +
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	memset(input, 0, sizeof(*input));
> > +	input->device_domain = hv_domain->device_domain;
> > +	input->settings.flags.blocked = (domain_type ==
> > IOMMU_DOMAIN_BLOCKED);
> > +	input->settings.flags.translation_enabled = (domain_type !=
> > IOMMU_DOMAIN_IDENTITY); +
> Should this be:
> input->settings.flags.translation_enabled =
>      (domain_type & __IOMMU_DOMAIN_PAGING);
> Otherwise, blocked domain will have translation enabled. Maybe add some
> explanation of what HV expects.
> 
I do agree this is not intuitive, but current hypervisor implementation
requires "blocked == 1" to be paired with "translation_enabled = 1",
otherwise it returns HV_STATUS_INVALID_PARAMETER. But I can add some
comment at least.

Thanks for the thorough review, Jacob!

B.R.
Yu



^ permalink raw reply

* [PATCH 8/9] drm/virtgpu: Set DRM_VBLANK_FLAG_SIMULATED
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

Mark the vblank event on virtgpu as simulated, so that the WAIT_VBLANK
ioctl fails with an error. The ioctl should not be supported because
the output is not synchronized to a display refresh.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/virtio/virtgpu_display.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
index 44ffffec550f..558d8001c54f 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -381,7 +381,7 @@ int virtio_gpu_modeset_init(struct virtio_gpu_device *vgdev)
 	for (i = 0 ; i < vgdev->num_scanouts; ++i)
 		vgdev_output_init(vgdev, i);
 
-	ret = drm_vblank_init(vgdev->ddev, vgdev->num_scanouts);
+	ret = drmm_vblank_init(vgdev->ddev, vgdev->num_scanouts, DRM_VBLANK_FLAG_SIMULATED);
 	if (ret)
 		return ret;
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 7/9] drm/qxl: Set DRM_VBLANK_FLAG_SIMULATED
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

Mark the vblank event on qxl as simulated, so that the WAIT_VBLANK
ioctl fails with an error. The ioctl should not be supported because
the output is not synchronized to a display refresh.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/qxl/qxl_display.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
index a026bd35ef48..b808fdebbd89 100644
--- a/drivers/gpu/drm/qxl/qxl_display.c
+++ b/drivers/gpu/drm/qxl/qxl_display.c
@@ -1300,7 +1300,7 @@ int qxl_modeset_init(struct qxl_device *qdev)
 
 	qxl_display_read_client_monitors_config(qdev);
 
-	ret = drm_vblank_init(&qdev->ddev, qxl_num_crtc);
+	ret = drmm_vblank_init(&qdev->ddev, qxl_num_crtc, DRM_VBLANK_FLAG_SIMULATED);
 	if (ret)
 		return ret;
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 9/9] drm/vkms: Set DRM_VBLANK_FLAG_SIMULATED
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

Mark the vblank event on vkms as simulated, so that the WAIT_VBLANK
ioctl fails with an error. The ioctl should not be supported because
the output is not synchronized to a display refresh.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/vkms/vkms_drv.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/vkms/vkms_drv.c b/drivers/gpu/drm/vkms/vkms_drv.c
index 5a640b531d88..c4cfa1e5ab01 100644
--- a/drivers/gpu/drm/vkms/vkms_drv.c
+++ b/drivers/gpu/drm/vkms/vkms_drv.c
@@ -192,8 +192,8 @@ int vkms_create(struct vkms_config *config)
 		goto out_devres;
 	}
 
-	ret = drm_vblank_init(&vkms_device->drm,
-			      vkms_config_get_num_crtcs(config));
+	ret = drmm_vblank_init(&vkms_device->drm, vkms_config_get_num_crtcs(config),
+			       DRM_VBLANK_FLAG_SIMULATED);
 	if (ret) {
 		DRM_ERROR("Failed to vblank\n");
 		goto out_devres;
-- 
2.54.0


^ permalink raw reply related

* [PATCH 5/9] drm/cirrus: Set DRM_VBLANK_FLAG_SIMULATED
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

Mark the vblank event on cirrus as simulated, so that the WAIT_VBLANK
ioctl fails with an error. The ioctl should not be supported because
the output is not synchronized to a display refresh.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/tiny/cirrus-qemu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/tiny/cirrus-qemu.c b/drivers/gpu/drm/tiny/cirrus-qemu.c
index 075221b431d3..01522d1158b2 100644
--- a/drivers/gpu/drm/tiny/cirrus-qemu.c
+++ b/drivers/gpu/drm/tiny/cirrus-qemu.c
@@ -501,7 +501,7 @@ static int cirrus_pipe_init(struct cirrus_device *cirrus)
 	if (ret)
 		return ret;
 
-	ret = drm_vblank_init(dev, 1);
+	ret = drmm_vblank_init(dev, 1, DRM_VBLANK_FLAG_SIMULATED);
 	if (ret)
 		return ret;
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 6/9] drm/hypervdrm: Set DRM_VBLANK_FLAG_SIMULATED
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

Mark the vblank event on hypervdrm as simulated, so that the WAIT_VBLANK
ioctl fails with an error. The ioctl should not be supported because the
output is not synchronized to a display refresh.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/hyperv/hyperv_drm_modeset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
index 1bbb7de5ab49..24bed31c35e7 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
@@ -329,7 +329,7 @@ int hyperv_mode_config_init(struct hyperv_drm_device *hv)
 		return ret;
 	}
 
-	ret = drm_vblank_init(dev, 1);
+	ret = drmm_vblank_init(dev, 1, DRM_VBLANK_FLAG_SIMULATED);
 	if (ret)
 		return ret;
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 4/9] drm/bochs: Set DRM_VBLANK_FLAG_SIMULATED
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

Mark the vblank event on bochs as simulated, so that the WAIT_VBLANK
ioctl fails with an error. The ioctl should not be supported because
the output is not synchronized to a display refresh.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/tiny/bochs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/tiny/bochs.c b/drivers/gpu/drm/tiny/bochs.c
index e2d957e51505..b5955ef39e31 100644
--- a/drivers/gpu/drm/tiny/bochs.c
+++ b/drivers/gpu/drm/tiny/bochs.c
@@ -677,7 +677,7 @@ static int bochs_kms_init(struct bochs_device *bochs)
 	drm_connector_attach_edid_property(connector);
 	drm_connector_attach_encoder(connector, encoder);
 
-	ret = drm_vblank_init(dev, 1);
+	ret = drmm_vblank_init(dev, 1, DRM_VBLANK_FLAG_SIMULATED);
 	if (ret)
 		return ret;
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/9] drm/amdgpu: vkms: Set DRM_VBLANK_FLAG_SIMULATED
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

Mark the vblank event on amdgpu's vkms as simulated, so that the
WAIT_VBLANK ioctl fails with an error. The ioctl should not be
supported because the output is not synchronized to a display refresh.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
index 170adaf7e76a..bc88acc819a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
@@ -413,7 +413,8 @@ static int amdgpu_vkms_sw_init(struct amdgpu_ip_block *ip_block)
 			return r;
 	}
 
-	r = drm_vblank_init(adev_to_drm(adev), adev->mode_info.num_crtc);
+	r = drmm_vblank_init(adev_to_drm(adev), adev->mode_info.num_crtc,
+			     DRM_VBLANK_FLAG_SIMULATED);
 	if (r)
 		return r;
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/9] drm/vblank: Add DRM_VBLANK_FLAG_SIMULATED
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

Add DRM_VBLANK_FLAG_SIMULATED for CRTCs that do not have a hardware
vblank interrupt. Setting the flag tells DRM to not report vblank
capabilities from the WAIT_VBLANK ioctl.

DRM_IOCTL_WAIT_VBLANK queries timestamps from a vblank event or waits
for the next vblank event to occur. DRM clients use this functionality
to synchronize their output with the display's vblank phase. Hence this
is only supported for hardware implementations.

Software implementations are not synchronized to the display and merely
act as a rate limiter for page-flip events. The WAIT_VBLANK ioctl thus
should fail with an error.

Suggested-by: Simona Vetter <simona@ffwll.ch>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c | 3 +++
 include/drm/drm_vblank.h     | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 21ca91b4c014..92b699a4e8be 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -1794,6 +1794,9 @@ int drm_wait_vblank_ioctl(struct drm_device *dev, void *data,
 
 	vblank = drm_vblank_crtc(dev, pipe);
 
+	if (vblank->flags & DRM_VBLANK_FLAG_SIMULATED)
+		return -EOPNOTSUPP;
+
 	/* If the counter is currently enabled and accurate, short-circuit
 	 * queries to return the cached timestamp of the last vblank.
 	 */
diff --git a/include/drm/drm_vblank.h b/include/drm/drm_vblank.h
index 39a201b83781..03fa7259b6ac 100644
--- a/include/drm/drm_vblank.h
+++ b/include/drm/drm_vblank.h
@@ -37,6 +37,11 @@ struct drm_device;
 struct drm_crtc;
 struct drm_vblank_work;
 
+/**
+ * DRM_VBLANK_FLAG_SIMULATED - vblank uses a software timer
+ */
+#define DRM_VBLANK_FLAG_SIMULATED	BIT(1)
+
 /**
  * struct drm_pending_vblank_event - pending vblank event tracking
  */
-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/9] drm/vblank: Add drmm_vblank_init() to indicate managed cleanup
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann
In-Reply-To: <20260515120916.333614-1-tzimmermann@suse.de>

Rename drm_vblank_init() to drmm_vblank_init(). As the initializer
function sets up managed cleanup, it should use the drmm prefix. Keep
the old name around until all callers have been converted.

Also add a flags argument to the function. The first use of the flags
will be to distinguish between hardware vblank interrupts and simulated
vblank timeouts.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c        | 16 +++++++++-------
 drivers/gpu/drm/drm_vblank_helper.c |  2 +-
 include/drm/drm_crtc.h              |  2 +-
 include/drm/drm_device.h            |  2 +-
 include/drm/drm_vblank.h            | 10 +++++++++-
 5 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index f90fb2d13e42..21ca91b4c014 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -117,7 +117,7 @@
  * optionally provide a hardware vertical blanking counter.
  *
  * Drivers must initialize the vertical blanking handling core with a call to
- * drm_vblank_init(). Minimally, a driver needs to implement
+ * drmm_vblank_init(). Minimally, a driver needs to implement
  * &drm_crtc_funcs.enable_vblank and &drm_crtc_funcs.disable_vblank plus call
  * drm_crtc_handle_vblank() in its vblank interrupt handler for working vblank
  * support.
@@ -146,7 +146,7 @@
  * See also DRM vblank helpers for more information.
  *
  * Drivers without support for vertical-blanking interrupts nor timers must
- * not call drm_vblank_init(). For these drivers, atomic helpers will
+ * not call drmm_vblank_init(). For these drivers, atomic helpers will
  * automatically generate fake vblank events as part of the display update.
  * This functionality also can be controlled by the driver by enabling and
  * disabling struct drm_crtc_state.no_vblank.
@@ -519,7 +519,7 @@ static void vblank_disable_fn(struct timer_list *t)
 	spin_unlock_irqrestore(&dev->vbl_lock, irqflags);
 }
 
-static void drm_vblank_init_release(struct drm_device *dev, void *ptr)
+static void drmm_vblank_init_release(struct drm_device *dev, void *ptr)
 {
 	struct drm_vblank_crtc *vblank = ptr;
 
@@ -534,9 +534,10 @@ static void drm_vblank_init_release(struct drm_device *dev, void *ptr)
 }
 
 /**
- * drm_vblank_init - initialize vblank support
+ * drmm_vblank_init - initialize vblank support
  * @dev: DRM device
  * @num_crtcs: number of CRTCs supported by @dev
+ * @flags: flags for vblank handling
  *
  * This function initializes vblank support for @num_crtcs display pipelines.
  * Cleanup is handled automatically through a cleanup function added with
@@ -545,7 +546,7 @@ static void drm_vblank_init_release(struct drm_device *dev, void *ptr)
  * Returns:
  * Zero on success or a negative error code on failure.
  */
-int drm_vblank_init(struct drm_device *dev, unsigned int num_crtcs)
+int drmm_vblank_init(struct drm_device *dev, unsigned int num_crtcs, unsigned int flags)
 {
 	int ret;
 	unsigned int i;
@@ -564,11 +565,12 @@ int drm_vblank_init(struct drm_device *dev, unsigned int num_crtcs)
 
 		vblank->dev = dev;
 		vblank->pipe = i;
+		vblank->flags = flags;
 		init_waitqueue_head(&vblank->queue);
 		timer_setup(&vblank->disable_timer, vblank_disable_fn, 0);
 		seqlock_init(&vblank->seqlock);
 
-		ret = drmm_add_action_or_reset(dev, drm_vblank_init_release,
+		ret = drmm_add_action_or_reset(dev, drmm_vblank_init_release,
 					       vblank);
 		if (ret)
 			return ret;
@@ -580,7 +582,7 @@ int drm_vblank_init(struct drm_device *dev, unsigned int num_crtcs)
 
 	return 0;
 }
-EXPORT_SYMBOL(drm_vblank_init);
+EXPORT_SYMBOL(drmm_vblank_init);
 
 /**
  * drm_dev_has_vblank - test if vblanking has been initialized for
diff --git a/drivers/gpu/drm/drm_vblank_helper.c b/drivers/gpu/drm/drm_vblank_helper.c
index d3f8147ecdc1..5b05ab72e133 100644
--- a/drivers/gpu/drm/drm_vblank_helper.c
+++ b/drivers/gpu/drm/drm_vblank_helper.c
@@ -25,7 +25,7 @@
  * for drivers without further requirements. The initializer macro
  * DRM_CRTC_HELPER_VBLANK_FUNCS sets them coveniently.
  *
- * Once the driver enables vblank support with drm_vblank_init(), each
+ * Once the driver enables vblank support with drmm_vblank_init(), each
  * CRTC's vblank timer fires according to the programmed display mode. By
  * default, the vblank timer invokes drm_crtc_handle_vblank(). Drivers with
  * more specific requirements can set their own handler function in
diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h
index c6dbe8b7db9e..f981468d9a00 100644
--- a/include/drm/drm_crtc.h
+++ b/include/drm/drm_crtc.h
@@ -163,7 +163,7 @@ struct drm_crtc_state {
 	 *
 	 * One usage is for drivers and/or hardware without support for VBLANK
 	 * interrupts. Such drivers typically do not initialize vblanking
-	 * (i.e., call drm_vblank_init() with the number of CRTCs). For CRTCs
+	 * (i.e., call drmm_vblank_init() with the number of CRTCs). For CRTCs
 	 * without initialized vblanking, this field is set to true in
 	 * drm_atomic_helper_check_modeset(), and a fake VBLANK event will be
 	 * send out on each update of the display pipeline by
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index bc78fb77cc27..381417c6b6f5 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -283,7 +283,7 @@ struct drm_device {
 	 * Array of vblank tracking structures, one per &struct drm_crtc. For
 	 * historical reasons (vblank support predates kernel modesetting) this
 	 * is free-standing and not part of &struct drm_crtc itself. It must be
-	 * initialized explicitly by calling drm_vblank_init().
+	 * initialized explicitly by calling drmm_vblank_init().
 	 */
 	struct drm_vblank_crtc *vblank;
 
diff --git a/include/drm/drm_vblank.h b/include/drm/drm_vblank.h
index 2fcef9c0f5b1..39a201b83781 100644
--- a/include/drm/drm_vblank.h
+++ b/include/drm/drm_vblank.h
@@ -282,10 +282,12 @@ struct drm_vblank_crtc {
 	 * @vblank_timer: Holds the state of the vblank timer
 	 */
 	struct drm_vblank_crtc_timer vblank_timer;
+
+	unsigned int flags;
 };
 
 struct drm_vblank_crtc *drm_crtc_vblank_crtc(struct drm_crtc *crtc);
-int drm_vblank_init(struct drm_device *dev, unsigned int num_crtcs);
+int drmm_vblank_init(struct drm_device *dev, unsigned int num_crtcs, unsigned int flags);
 bool drm_dev_has_vblank(const struct drm_device *dev);
 u64 drm_crtc_vblank_count(struct drm_crtc *crtc);
 u64 drm_crtc_vblank_count_and_time(struct drm_crtc *crtc,
@@ -321,6 +323,12 @@ int drm_crtc_vblank_start_timer(struct drm_crtc *crtc);
 void drm_crtc_vblank_cancel_timer(struct drm_crtc *crtc);
 void drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_time);
 
+/* deprecated; use drmm_vblank_init() instead */
+static inline int drm_vblank_init(struct drm_device *dev, unsigned int num_crtcs)
+{
+	return drmm_vblank_init(dev, num_crtcs, 0);
+}
+
 /*
  * Helpers for struct drm_crtc_funcs
  */
-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/9] drm: Limit DRM_IOCTL_WAIT_VBLANK to vblank interrupts
From: Thomas Zimmermann @ 2026-05-15 11:55 UTC (permalink / raw)
  To: simona, airlied, mdaenzer, pekka.paalanen, jadahl, contact,
	maarten.lankhorst, mripard
  Cc: amd-gfx, dri-devel, linux-hyperv, virtualization, spice-devel,
	Thomas Zimmermann

DRM's WAIT_VBLANK ioctl synchronizes user-space clients to display
refresh. This is meaningless with vblank timers, which run unrelated
to the hardware's vblank.

Disable the ioctl for simulated vblanks. Set DRM_VBLANK_FLAG_SIMULATED
for CRTCs with simulated vblank events in all such drivers. The vblank
timers of these devices still rate-limit the number of page-flip events
to match the display refresh.

According to maintainers, user-space compositors do not require the ioctl
for rate-limitting display output. Weston and Kwin rely on page-flip
events. Mutter uses and internal timer to limit the number of display
updates per second.

When testing with mutter and weston, the page-flip rate appears correct
with the patch set applied.

This change has been discussed at length on IRC recently.

https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&highlight_names=&date=2026-05-08&show_html=true
https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&highlight_names=&date=2026-05-12&show_html=true
https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&highlight_names=&date=2026-05-13&show_html=true

Thomas Zimmermann (9):
  drm/vblank: Add drmm_vblank_init() to indicate managed cleanup
  drm/vblank: Add DRM_VBLANK_FLAG_SIMULATED
  drm/amdgpu: vkms: Set DRM_VBLANK_FLAG_SIMULATED
  drm/bochs: Set DRM_VBLANK_FLAG_SIMULATED
  drm/cirrus: Set DRM_VBLANK_FLAG_SIMULATED
  drm/hypervdrm: Set DRM_VBLANK_FLAG_SIMULATED
  drm/qxl: Set DRM_VBLANK_FLAG_SIMULATED
  drm/virtgpu: Set DRM_VBLANK_FLAG_SIMULATED
  drm/vkms: Set DRM_VBLANK_FLAG_SIMULATED

 drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c    |  3 ++-
 drivers/gpu/drm/drm_vblank.c                | 19 ++++++++++++-------
 drivers/gpu/drm/drm_vblank_helper.c         |  2 +-
 drivers/gpu/drm/hyperv/hyperv_drm_modeset.c |  2 +-
 drivers/gpu/drm/qxl/qxl_display.c           |  2 +-
 drivers/gpu/drm/tiny/bochs.c                |  2 +-
 drivers/gpu/drm/tiny/cirrus-qemu.c          |  2 +-
 drivers/gpu/drm/virtio/virtgpu_display.c    |  2 +-
 drivers/gpu/drm/vkms/vkms_drv.c             |  4 ++--
 include/drm/drm_crtc.h                      |  2 +-
 include/drm/drm_device.h                    |  2 +-
 include/drm/drm_vblank.h                    | 15 ++++++++++++++-
 12 files changed, 38 insertions(+), 19 deletions(-)


base-commit: 121c16f9d8c56ea07263df84ab971cc10870fe88
-- 
2.54.0


^ permalink raw reply

* [PATCH net-next v10 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs
From: Long Li @ 2026-05-15  4:05 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui, shradhagupta
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260515040508.491748-1-longli@microsoft.com>

Use the GIC functions to allocate interrupt contexts for RDMA EQs. These
interrupt contexts may be shared with Ethernet EQs when MSI-X vectors
are limited.

The driver now supports allocating dedicated MSI-X for each EQ. Indicate
this capability through driver capability bits. The RDMA EQs pass
use_msi_bitmap=false to share MSI-X vectors with Ethernet, while the
capability flag advertises that the driver supports per-vPort EQ
separation when hardware has sufficient vectors.

Populate eq.irq on all RDMA EQs for consistency with the Ethernet path.

Also relocate the GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE define to its
numeric BIT(6) position among the other capability flags.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/infiniband/hw/mana/main.c | 43 +++++++++++++++++++++++++------
 include/net/mana/gdma.h           |  7 +++--
 2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index f8a9013f0ca3..df465a9078c1 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -764,7 +764,8 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
 {
 	struct gdma_context *gc = mdev_to_gc(mdev);
 	struct gdma_queue_spec spec = {};
-	int err, i;
+	struct gdma_irq_context *gic;
+	int err, i, msi;
 
 	spec.type = GDMA_EQ;
 	spec.monitor_avl_buf = false;
@@ -772,11 +773,19 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
 	spec.eq.callback = mana_ib_event_handler;
 	spec.eq.context = mdev;
 	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
-	spec.eq.msix_index = 0;
+
+	msi = 0;
+	gic = mana_gd_get_gic(gc, false, &msi);
+	if (!gic)
+		return -ENOMEM;
+	spec.eq.msix_index = msi;
 
 	err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->fatal_err_eq);
-	if (err)
+	if (err) {
+		mana_gd_put_gic(gc, false, 0);
 		return err;
+	}
+	mdev->fatal_err_eq->eq.irq = gic->irq;
 
 	mdev->eqs = kzalloc_objs(struct gdma_queue *,
 				 mdev->ib_dev.num_comp_vectors);
@@ -786,32 +795,50 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
 	}
 	spec.eq.callback = NULL;
 	for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
-		spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+		msi = (i + 1) % gc->num_msix_usable;
+
+		gic = mana_gd_get_gic(gc, false, &msi);
+		if (!gic) {
+			err = -ENOMEM;
+			goto destroy_eqs;
+		}
+		spec.eq.msix_index = msi;
+
 		err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->eqs[i]);
-		if (err)
+		if (err) {
+			mana_gd_put_gic(gc, false, msi);
 			goto destroy_eqs;
+		}
+		mdev->eqs[i]->eq.irq = gic->irq;
 	}
 
 	return 0;
 
 destroy_eqs:
-	while (i-- > 0)
+	while (i-- > 0) {
 		mana_gd_destroy_queue(gc, mdev->eqs[i]);
+		mana_gd_put_gic(gc, false, (i + 1) % gc->num_msix_usable);
+	}
 	kfree(mdev->eqs);
 destroy_fatal_eq:
 	mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+	mana_gd_put_gic(gc, false, 0);
 	return err;
 }
 
 void mana_ib_destroy_eqs(struct mana_ib_dev *mdev)
 {
 	struct gdma_context *gc = mdev_to_gc(mdev);
-	int i;
+	int i, msi;
 
 	mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+	mana_gd_put_gic(gc, false, 0);
 
-	for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++)
+	for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
 		mana_gd_destroy_queue(gc, mdev->eqs[i]);
+		msi = (i + 1) % gc->num_msix_usable;
+		mana_gd_put_gic(gc, false, msi);
+	}
 
 	kfree(mdev->eqs);
 }
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 6c138cc77407..d84e474309a3 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -615,6 +615,7 @@ enum {
 #define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG BIT(3)
 #define GDMA_DRV_CAP_FLAG_1_GDMA_PAGES_4MB_1GB_2GB BIT(4)
 #define GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT BIT(5)
+#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
 
 /* Driver can handle holes (zeros) in the device list */
 #define GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP BIT(11)
@@ -631,7 +632,8 @@ enum {
 /* Driver detects stalled send queues and recovers them */
 #define GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY BIT(18)
 
-#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
+/* Driver supports separate EQ/MSIs for each vPort */
+#define GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT BIT(19)
 
 /* Driver supports linearizing the skb when num_sge exceeds hardware limit */
 #define GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE BIT(20)
@@ -659,7 +661,8 @@ enum {
 	 GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
 	 GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
 	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
-	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY)
+	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
+	 GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort
From: Long Li @ 2026-05-15  4:05 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui, shradhagupta
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260515040508.491748-1-longli@microsoft.com>

Use GIC functions to create a dedicated interrupt context or acquire a
shared interrupt context for each EQ when setting up a vPort.

The caller now owns the GIC reference across the EQ create/destroy
lifecycle: mana_create_eq() calls mana_gd_get_gic() before creating
each EQ and mana_destroy_eq() calls mana_gd_put_gic() after destroying
it. The msix_index invalidation is moved from mana_gd_deregister_irq()
to the mana_gd_create_eq() error path so that mana_destroy_eq() can
read the index before teardown.

Signed-off-by: Long Li <longli@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c    |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c  | 18 +++++++++++++++++-
 include/net/mana/gdma.h                        |  1 +
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 265cf32a980f..6ab7e441bd9f 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -897,7 +897,6 @@ static void mana_gd_deregister_irq(struct gdma_queue *queue)
 	}
 	spin_unlock_irqrestore(&gic->lock, flags);
 
-	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
 	synchronize_rcu();
 }
 
@@ -1012,6 +1011,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
 out:
 	dev_err(dev, "Failed to create EQ: %d\n", err);
 	mana_gd_destroy_eq(gc, false, queue);
+	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
 	return err;
 }
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d23b856f48f6..8020089de50f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1626,6 +1626,7 @@ void mana_destroy_eq(struct mana_port_context *apc)
 	struct mana_context *ac = apc->ac;
 	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 	struct gdma_queue *eq;
+	unsigned int msi;
 	int i;
 
 	if (!apc->eqs)
@@ -1639,7 +1640,9 @@ void mana_destroy_eq(struct mana_port_context *apc)
 		if (!eq)
 			continue;
 
+		msi = eq->eq.msix_index;
 		mana_gd_destroy_queue(gc, eq);
+		mana_gd_put_gic(gc, !gc->msi_sharing, msi);
 	}
 
 	kfree(apc->eqs);
@@ -1656,6 +1659,7 @@ static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
 	eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
 	debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
 	debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
+	debugfs_create_u32("irq", 0400, eq.mana_eq_debugfs, &eq.eq->eq.irq);
 	debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
 }
 
@@ -1664,7 +1668,9 @@ int mana_create_eq(struct mana_port_context *apc)
 	struct gdma_dev *gd = apc->ac->gdma_dev;
 	struct gdma_context *gc = gd->gdma_context;
 	struct gdma_queue_spec spec = {};
+	struct gdma_irq_context *gic;
 	int err;
+	int msi;
 	int i;
 
 	if (WARN_ON(apc->eqs))
@@ -1684,12 +1690,22 @@ int mana_create_eq(struct mana_port_context *apc)
 		debugfs_create_dir("EQs", apc->mana_port_debugfs);
 
 	for (i = 0; i < apc->num_queues; i++) {
-		spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+		msi = (i + 1) % gc->num_msix_usable;
+
+		gic = mana_gd_get_gic(gc, !gc->msi_sharing, &msi);
+		if (!gic) {
+			err = -ENOMEM;
+			goto out;
+		}
+		spec.eq.msix_index = msi;
+
 		err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
 		if (err) {
 			dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
+			mana_gd_put_gic(gc, !gc->msi_sharing, msi);
 			goto out;
 		}
+		apc->eqs[i].eq->eq.irq = gic->irq;
 		mana_create_eq_debugfs(apc, i);
 	}
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index fbe3c1427b45..6c138cc77407 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -342,6 +342,7 @@ struct gdma_queue {
 			void *context;
 
 			unsigned int msix_index;
+			unsigned int irq;
 
 			u32 log2_throttle_limit;
 		} eq;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 4/6] net: mana: Use GIC functions to allocate global EQs
From: Long Li @ 2026-05-15  4:05 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui, shradhagupta
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260515040508.491748-1-longli@microsoft.com>

Replace the GDMA global interrupt setup code with the new GIC allocation
and release functions for managing interrupt contexts.

This changes the per-queue interrupt names in /proc/interrupts from
mana_q0, mana_q1, ... to mana_msi1, mana_msi2, ... to reflect the
MSI-X index rather than a zero-based queue number. The HWC interrupt
name (mana_hwc) is unchanged.

Signed-off-by: Long Li <longli@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 96 +++----------------
 1 file changed, 13 insertions(+), 83 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index ffe293b561ff..265cf32a980f 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1918,7 +1918,7 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_irq_context *gic;
 	bool skip_first_cpu = false;
-	int *irqs, irq, err, i;
+	int *irqs, err, i;
 
 	irqs = kmalloc_objs(int, nvec);
 	if (!irqs)
@@ -1931,30 +1931,13 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	 * further used in irq_setup()
 	 */
 	for (i = 1; i <= nvec; i++) {
-		gic = kzalloc_obj(*gic);
+		gic = mana_gd_get_gic(gc, false, &i);
 		if (!gic) {
 			err = -ENOMEM;
 			goto free_irq;
 		}
-		gic->handler = mana_gd_process_eq_events;
-		INIT_LIST_HEAD(&gic->eq_list);
-		spin_lock_init(&gic->lock);
-
-		snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
-			 i - 1, pci_name(pdev));
-
-		/* one pci vector is already allocated for HWC */
-		irqs[i - 1] = pci_irq_vector(pdev, i);
-		if (irqs[i - 1] < 0) {
-			err = irqs[i - 1];
-			goto free_current_gic;
-		}
-
-		err = request_irq(irqs[i - 1], mana_gd_intr, 0, gic->name, gic);
-		if (err)
-			goto free_current_gic;
 
-		xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+		irqs[i - 1] = gic->irq;
 	}
 
 	/*
@@ -1976,20 +1959,9 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	kfree(irqs);
 	return 0;
 
-free_current_gic:
-	kfree(gic);
 free_irq:
-	for (i -= 1; i > 0; i--) {
-		irq = pci_irq_vector(pdev, i);
-		gic = xa_load(&gc->irq_contexts, i);
-		if (WARN_ON(!gic))
-			continue;
-
-		irq_update_affinity_hint(irq, NULL);
-		free_irq(irq, gic);
-		xa_erase(&gc->irq_contexts, i);
-		kfree(gic);
-	}
+	for (i -= 1; i > 0; i--)
+		mana_gd_put_gic(gc, false, i);
 	kfree(irqs);
 	return err;
 }
@@ -1998,7 +1970,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_irq_context *gic;
-	int *irqs, *start_irqs, irq;
+	int *irqs, *start_irqs;
 	unsigned int cpu;
 	int err, i;
 
@@ -2009,34 +1981,13 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
 	start_irqs = irqs;
 
 	for (i = 0; i < nvec; i++) {
-		gic = kzalloc_obj(*gic);
+		gic = mana_gd_get_gic(gc, false, &i);
 		if (!gic) {
 			err = -ENOMEM;
 			goto free_irq;
 		}
 
-		gic->handler = mana_gd_process_eq_events;
-		INIT_LIST_HEAD(&gic->eq_list);
-		spin_lock_init(&gic->lock);
-
-		if (!i)
-			snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
-				 pci_name(pdev));
-		else
-			snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
-				 i - 1, pci_name(pdev));
-
-		irqs[i] = pci_irq_vector(pdev, i);
-		if (irqs[i] < 0) {
-			err = irqs[i];
-			goto free_current_gic;
-		}
-
-		err = request_irq(irqs[i], mana_gd_intr, 0, gic->name, gic);
-		if (err)
-			goto free_current_gic;
-
-		xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+		irqs[i] = gic->irq;
 	}
 
 	/* If number of IRQ is one extra than number of online CPUs,
@@ -2065,20 +2016,9 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
 	kfree(start_irqs);
 	return 0;
 
-free_current_gic:
-	kfree(gic);
 free_irq:
-	for (i -= 1; i >= 0; i--) {
-		irq = pci_irq_vector(pdev, i);
-		gic = xa_load(&gc->irq_contexts, i);
-		if (WARN_ON(!gic))
-			continue;
-
-		irq_update_affinity_hint(irq, NULL);
-		free_irq(irq, gic);
-		xa_erase(&gc->irq_contexts, i);
-		kfree(gic);
-	}
+	for (i -= 1; i >= 0; i--)
+		mana_gd_put_gic(gc, false, i);
 
 	kfree(start_irqs);
 	return err;
@@ -2152,26 +2092,16 @@ static int mana_gd_setup_remaining_irqs(struct pci_dev *pdev)
 static void mana_gd_remove_irqs(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
-	struct gdma_irq_context *gic;
-	int irq, i;
+	int i;
 
 	if (gc->max_num_msix < 1)
 		return;
 
 	for (i = 0; i < gc->max_num_msix; i++) {
-		irq = pci_irq_vector(pdev, i);
-		if (irq < 0)
-			continue;
-
-		gic = xa_load(&gc->irq_contexts, i);
-		if (WARN_ON(!gic))
+		if (!xa_load(&gc->irq_contexts, i))
 			continue;
 
-		/* Need to clear the hint before free_irq */
-		irq_update_affinity_hint(irq, NULL);
-		free_irq(irq, gic);
-		xa_erase(&gc->irq_contexts, i);
-		kfree(gic);
+		mana_gd_put_gic(gc, false, i);
 	}
 
 	pci_free_irq_vectors(pdev);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 3/6] net: mana: Introduce GIC context with refcounting for interrupt management
From: Long Li @ 2026-05-15  4:05 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui, shradhagupta
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260515040508.491748-1-longli@microsoft.com>

To allow Ethernet EQs to use dedicated or shared MSI-X vectors and RDMA
EQs to share the same MSI-X, introduce a GIC (GDMA IRQ Context) with
reference counting. This allows the driver to create an interrupt context
on an assigned or unassigned MSI-X vector and share it across multiple
EQ consumers.

Signed-off-by: Long Li <longli@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 159 ++++++++++++++++++
 include/net/mana/gdma.h                       |  12 ++
 2 files changed, 171 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 13feea93d3c2..ffe293b561ff 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1658,6 +1658,164 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
 	return IRQ_HANDLED;
 }
 
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi)
+{
+	struct pci_dev *dev = to_pci_dev(gc->dev);
+	struct gdma_irq_context *gic;
+	struct msi_map irq_map;
+	int irq;
+
+	mutex_lock(&gc->gic_mutex);
+
+	gic = xa_load(&gc->irq_contexts, msi);
+	if (WARN_ON(!gic)) {
+		mutex_unlock(&gc->gic_mutex);
+		return;
+	}
+
+	if (use_msi_bitmap)
+		gic->bitmap_refs--;
+
+	if (use_msi_bitmap && gic->bitmap_refs == 0)
+		clear_bit(msi, gc->msi_bitmap);
+
+	if (!refcount_dec_and_test(&gic->refcount))
+		goto out;
+
+	irq = gic->irq;
+
+	irq_update_affinity_hint(irq, NULL);
+	free_irq(irq, gic);
+
+	if (gic->dyn_msix) {
+		irq_map.virq = irq;
+		irq_map.index = msi;
+		pci_msix_free_irq(dev, irq_map);
+	}
+
+	xa_erase(&gc->irq_contexts, msi);
+	kfree(gic);
+
+out:
+	mutex_unlock(&gc->gic_mutex);
+}
+EXPORT_SYMBOL_NS(mana_gd_put_gic, "NET_MANA");
+
+/*
+ * Get a GIC (GDMA IRQ Context) on a MSI vector
+ * a MSI can be shared between different EQs, this function supports setting
+ * up separate MSIs using a bitmap, or directly using the MSI index
+ *
+ * @use_msi_bitmap:
+ * True if MSI is assigned by this function on available slots from bitmap.
+ * False if MSI is passed from *msi_requested
+ */
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc,
+					 bool use_msi_bitmap,
+					 int *msi_requested)
+{
+	struct pci_dev *dev = to_pci_dev(gc->dev);
+	struct gdma_irq_context *gic;
+	struct msi_map irq_map = { };
+	int irq;
+	int msi;
+	int err;
+
+	mutex_lock(&gc->gic_mutex);
+
+	if (use_msi_bitmap) {
+		msi = find_first_zero_bit(gc->msi_bitmap, gc->num_msix_usable);
+		if (msi >= gc->num_msix_usable) {
+			dev_err(gc->dev, "No free MSI vectors available\n");
+			gic = NULL;
+			goto out;
+		}
+		*msi_requested = msi;
+	} else {
+		msi = *msi_requested;
+	}
+
+	gic = xa_load(&gc->irq_contexts, msi);
+	if (gic) {
+		refcount_inc(&gic->refcount);
+		if (use_msi_bitmap) {
+			gic->bitmap_refs++;
+			set_bit(msi, gc->msi_bitmap);
+		}
+		goto out;
+	}
+
+	irq = pci_irq_vector(dev, msi);
+	if (irq == -EINVAL) {
+		irq_map = pci_msix_alloc_irq_at(dev, msi, NULL);
+		if (!irq_map.virq) {
+			err = irq_map.index;
+			dev_err(gc->dev,
+				"Failed to alloc irq_map msi %d err %d\n",
+				msi, err);
+			gic = NULL;
+			goto out;
+		}
+		irq = irq_map.virq;
+		msi = irq_map.index;
+	}
+
+	gic = kzalloc(sizeof(*gic), GFP_KERNEL);
+	if (!gic) {
+		if (irq_map.virq)
+			pci_msix_free_irq(dev, irq_map);
+		goto out;
+	}
+
+	gic->handler = mana_gd_process_eq_events;
+	gic->msi = msi;
+	gic->irq = irq;
+	INIT_LIST_HEAD(&gic->eq_list);
+	spin_lock_init(&gic->lock);
+
+	if (!gic->msi)
+		snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
+			 pci_name(dev));
+	else
+		snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_msi%d@pci:%s",
+			 gic->msi, pci_name(dev));
+
+	err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
+	if (err) {
+		dev_err(gc->dev, "Failed to request irq %d %s\n",
+			irq, gic->name);
+		kfree(gic);
+		gic = NULL;
+		if (irq_map.virq)
+			pci_msix_free_irq(dev, irq_map);
+		goto out;
+	}
+
+	gic->dyn_msix = !!irq_map.virq;
+	refcount_set(&gic->refcount, 1);
+	gic->bitmap_refs = use_msi_bitmap ? 1 : 0;
+
+	err = xa_err(xa_store(&gc->irq_contexts, msi, gic, GFP_KERNEL));
+	if (err) {
+		dev_err(gc->dev, "Failed to store irq context for msi %d: %d\n",
+			msi, err);
+		free_irq(irq, gic);
+		kfree(gic);
+		gic = NULL;
+		if (irq_map.virq)
+			pci_msix_free_irq(dev, irq_map);
+		goto out;
+	}
+
+	if (use_msi_bitmap)
+		set_bit(msi, gc->msi_bitmap);
+
+out:
+	mutex_unlock(&gc->gic_mutex);
+	return gic;
+}
+EXPORT_SYMBOL_NS(mana_gd_get_gic, "NET_MANA");
+
 int mana_gd_alloc_res_map(u32 res_avail, struct gdma_resource *r)
 {
 	r->map = bitmap_zalloc(res_avail, GFP_KERNEL);
@@ -2147,6 +2305,7 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto release_region;
 
 	mutex_init(&gc->eq_test_event_mutex);
+	mutex_init(&gc->gic_mutex);
 	pci_set_drvdata(pdev, gc);
 	gc->bar0_pa = pci_resource_start(pdev, 0);
 	gc->bar0_size = pci_resource_len(pdev, 0);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 9c05b1e15c3e..fbe3c1427b45 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -388,6 +388,11 @@ struct gdma_irq_context {
 	spinlock_t lock;
 	struct list_head eq_list;
 	char name[MANA_IRQ_NAME_SZ];
+	unsigned int msi;
+	unsigned int irq;
+	refcount_t refcount;
+	unsigned int bitmap_refs;
+	bool dyn_msix;
 };
 
 enum gdma_context_flags {
@@ -449,6 +454,9 @@ struct gdma_context {
 
 	unsigned long		flags;
 
+	/* Protect access to GIC context */
+	struct mutex		gic_mutex;
+
 	/* Indicate if this device is sharing MSI for EQs on MANA */
 	bool msi_sharing;
 
@@ -1026,6 +1034,10 @@ int mana_gd_resume(struct pci_dev *pdev);
 
 bool mana_need_log(struct gdma_context *gc, int err);
 
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc,
+					 bool use_msi_bitmap,
+					 int *msi_requested);
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi);
 int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
 			     u32 proto_minor_ver, u32 proto_micro_ver,
 			     u16 *max_num_vports, u8 *bm_hostmode);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: Long Li @ 2026-05-15  4:05 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui, shradhagupta
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260515040508.491748-1-longli@microsoft.com>

When querying the device, adjust the max number of queues to allow
dedicated MSI-X vectors for each vPort. The per-vPort queue count is
clamped towards MANA_DEF_NUM_QUEUES but will not exceed the hardware
maximum reported by the device.

MSI-X sharing among vPorts is enabled when there are not enough MSI-X
vectors for dedicated allocation, or when the platform does not support
dynamic MSI-X allocation (in which case all vectors are pre-allocated
at probe time and sharing is always used). The msi_sharing flag is
reset at the top of mana_gd_query_max_resources() so it is recomputed
from current hardware state on each probe or resume cycle.

A device reporting zero ports now results in a fatal probe error since
the per-vPort MSI-X math requires at least one port.

Rename mana_query_device_cfg() to mana_gd_query_device_cfg() as it is
used at GDMA device probe time for querying device capabilities.

Signed-off-by: Long Li <longli@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 69 ++++++++++++++++++-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 40 +++++++----
 include/net/mana/gdma.h                       | 13 +++-
 3 files changed, 103 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 3bc3fff55999..13feea93d3c2 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -179,8 +179,21 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_query_max_resources_resp resp = {};
 	struct gdma_general_req req = {};
+	unsigned int max_num_queues;
+	u8 bm_hostmode;
+	u16 num_ports;
 	int err;
 
+	/* Reset msi_sharing so it is recomputed from current hardware
+	 * state. On resume, num_online_cpus() or num_msix_usable may
+	 * have changed, making dedicated MSI-X feasible where it was
+	 * not before. Only reset on platforms that support dynamic
+	 * MSI-X allocation; on non-dyn platforms msi_sharing is
+	 * unconditionally true (set in mana_gd_setup_hwc_irqs).
+	 */
+	if (pci_msix_can_alloc_dyn(to_pci_dev(gc->dev)))
+		gc->msi_sharing = false;
+
 	mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES,
 			     sizeof(req), sizeof(resp));
 
@@ -227,6 +240,43 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	if (gc->max_num_queues == 0)
 		return -ENOSPC;
 
+	err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
+				       MANA_MINOR_VERSION,
+				       MANA_MICRO_VERSION,
+				       &num_ports, &bm_hostmode);
+	if (err)
+		return err;
+
+	if (!num_ports)
+		return -EINVAL;
+
+	/*
+	 * Adjust the per-vPort max queue count to allow dedicated
+	 * MSIx for each vPort. Clamp to no less than MANA_DEF_NUM_QUEUES.
+	 */
+	max_num_queues = (gc->num_msix_usable - 1) / num_ports;
+	max_num_queues = rounddown_pow_of_two(max(max_num_queues, 1U));
+	if (max_num_queues < MANA_DEF_NUM_QUEUES)
+		max_num_queues = MANA_DEF_NUM_QUEUES;
+
+	/*
+	 * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
+	 * Ethernet EQs when (max_num_queues * num_ports > num_msix_usable - 1).
+	 */
+	max_num_queues = min(gc->max_num_queues, max_num_queues);
+	if (max_num_queues * num_ports > gc->num_msix_usable - 1)
+		gc->msi_sharing = true;
+
+	/* If MSI is shared, use max allowed value */
+	if (gc->msi_sharing)
+		gc->max_num_queues_vport = min(gc->num_msix_usable - 1,
+					       gc->max_num_queues);
+	else
+		gc->max_num_queues_vport = max_num_queues;
+
+	dev_info(gc->dev, "MSI sharing mode %d max queues %d\n",
+		 gc->msi_sharing, gc->max_num_queues);
+
 	return 0;
 }
 
@@ -1889,6 +1939,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
 		/* Need 1 interrupt for HWC */
 		max_irqs = min(num_online_cpus(), MANA_MAX_NUM_QUEUES) + 1;
 		min_irqs = 2;
+		gc->msi_sharing = true;
 	}
 
 	nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
@@ -1967,6 +2018,8 @@ static void mana_gd_remove_irqs(struct pci_dev *pdev)
 
 	pci_free_irq_vectors(pdev);
 
+	bitmap_free(gc->msi_bitmap);
+	gc->msi_bitmap = NULL;
 	gc->max_num_msix = 0;
 	gc->num_msix_usable = 0;
 }
@@ -2001,6 +2054,10 @@ static int mana_gd_setup(struct pci_dev *pdev)
 	if (err)
 		goto destroy_hwc;
 
+	err = mana_gd_detect_devices(pdev);
+	if (err)
+		goto destroy_hwc;
+
 	err = mana_gd_query_max_resources(pdev);
 	if (err)
 		goto destroy_hwc;
@@ -2011,9 +2068,15 @@ static int mana_gd_setup(struct pci_dev *pdev)
 		goto destroy_hwc;
 	}
 
-	err = mana_gd_detect_devices(pdev);
-	if (err)
-		goto destroy_hwc;
+	if (!gc->msi_sharing) {
+		gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable, GFP_KERNEL);
+		if (!gc->msi_bitmap) {
+			err = -ENOMEM;
+			goto destroy_hwc;
+		}
+		/* Set bit for HWC */
+		set_bit(0, gc->msi_bitmap);
+	}
 
 	dev_dbg(&pdev->dev, "mana gdma setup successful\n");
 	return 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 18f8f653da3d..d23b856f48f6 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1007,10 +1007,9 @@ static int mana_init_port_context(struct mana_port_context *apc)
 	return !apc->rxqs ? -ENOMEM : 0;
 }
 
-static int mana_send_request(struct mana_context *ac, void *in_buf,
-			     u32 in_len, void *out_buf, u32 out_len)
+static int gdma_mana_send_request(struct gdma_context *gc, void *in_buf,
+				  u32 in_len, void *out_buf, u32 out_len)
 {
-	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 	struct gdma_resp_hdr *resp = out_buf;
 	struct gdma_req_hdr *req = in_buf;
 	struct device *dev = gc->dev;
@@ -1044,6 +1043,14 @@ static int mana_send_request(struct mana_context *ac, void *in_buf,
 	return 0;
 }
 
+static int mana_send_request(struct mana_context *ac, void *in_buf,
+			     u32 in_len, void *out_buf, u32 out_len)
+{
+	struct gdma_context *gc = ac->gdma_dev->gdma_context;
+
+	return gdma_mana_send_request(gc, in_buf, in_len, out_buf, out_len);
+}
+
 static int mana_verify_resp_hdr(const struct gdma_resp_hdr *resp_hdr,
 				const enum mana_command_code expected_code,
 				const u32 min_size)
@@ -1177,11 +1184,10 @@ static void mana_pf_deregister_filter(struct mana_port_context *apc)
 			   err, resp.hdr.status);
 }
 
-static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
-				 u32 proto_minor_ver, u32 proto_micro_ver,
-				 u16 *max_num_vports, u8 *bm_hostmode)
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+			     u32 proto_minor_ver, u32 proto_micro_ver,
+			     u16 *max_num_vports, u8 *bm_hostmode)
 {
-	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 	struct mana_query_device_cfg_resp resp = {};
 	struct mana_query_device_cfg_req req = {};
 	struct device *dev = gc->dev;
@@ -1196,7 +1202,8 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
 	req.proto_minor_ver = proto_minor_ver;
 	req.proto_micro_ver = proto_micro_ver;
 
-	err = mana_send_request(ac, &req, sizeof(req), &resp, sizeof(resp));
+	err = gdma_mana_send_request(gc, &req, sizeof(req),
+				     &resp, sizeof(resp));
 	if (err) {
 		dev_err(dev, "Failed to query config: %d", err);
 		return err;
@@ -1230,8 +1237,6 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
 	else
 		*bm_hostmode = 0;
 
-	debugfs_create_u16("adapter-MTU", 0400, gc->mana_pci_debugfs, &gc->adapter_mtu);
-
 	return 0;
 }
 
@@ -3417,7 +3422,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	int err;
 
 	ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
-				 gc->max_num_queues);
+				 gc->max_num_queues_vport);
 	if (!ndev)
 		return -ENOMEM;
 
@@ -3426,9 +3431,9 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	apc = netdev_priv(ndev);
 	apc->ac = ac;
 	apc->ndev = ndev;
-	apc->max_queues = gc->max_num_queues;
+	apc->max_queues = gc->max_num_queues_vport;
 	/* Use MANA_DEF_NUM_QUEUES as default, still honoring the HW limit */
-	apc->num_queues = min(gc->max_num_queues, MANA_DEF_NUM_QUEUES);
+	apc->num_queues = min(gc->max_num_queues_vport, MANA_DEF_NUM_QUEUES);
 	apc->tx_queue_size = DEF_TX_BUFFERS_PER_QUEUE;
 	apc->rx_queue_size = DEF_RX_BUFFERS_PER_QUEUE;
 	apc->port_handle = INVALID_MANA_HANDLE;
@@ -3692,13 +3697,18 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
 
-	err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
-				    MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
+	err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
+				       MANA_MINOR_VERSION,
+				       MANA_MICRO_VERSION,
+				       &num_ports, &bm_hostmode);
 	if (err)
 		goto out;
 
 	ac->bm_hostmode = bm_hostmode;
 
+	debugfs_create_u16("adapter-MTU", 0400,
+			   gc->mana_pci_debugfs, &gc->adapter_mtu);
+
 	if (!resuming) {
 		ac->num_ports = num_ports;
 	} else {
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 6d836060976a..9c05b1e15c3e 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -399,8 +399,10 @@ struct gdma_context {
 	struct device		*dev;
 	struct dentry		*mana_pci_debugfs;
 
-	/* Per-vPort max number of queues */
+	/* Hardware max number of queues */
 	unsigned int		max_num_queues;
+	/* Per-vPort max number of queues */
+	unsigned int		max_num_queues_vport;
 	unsigned int		max_num_msix;
 	unsigned int		num_msix_usable;
 	struct xarray		irq_contexts;
@@ -446,6 +448,12 @@ struct gdma_context {
 	struct workqueue_struct *service_wq;
 
 	unsigned long		flags;
+
+	/* Indicate if this device is sharing MSI for EQs on MANA */
+	bool msi_sharing;
+
+	/* Bitmap tracks where MSI is allocated when it is not shared for EQs */
+	unsigned long *msi_bitmap;
 };
 
 static inline bool mana_gd_is_mana(struct gdma_dev *gd)
@@ -1018,4 +1026,7 @@ int mana_gd_resume(struct pci_dev *pdev);
 
 bool mana_need_log(struct gdma_context *gc, int err);
 
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+			     u32 proto_minor_ver, u32 proto_micro_ver,
+			     u16 *max_num_vports, u8 *bm_hostmode);
 #endif /* _GDMA_H */
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 1/6] net: mana: Create separate EQs for each vPort
From: Long Li @ 2026-05-15  4:05 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui, shradhagupta
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260515040508.491748-1-longli@microsoft.com>

To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
sharing among the vPorts and create dedicated EQs for each vPort.

Move the EQ definition from struct mana_context to struct mana_port_context
and update related support functions. Export mana_create_eq() and
mana_destroy_eq() for use by the MANA RDMA driver.

RSS QPs now take a vport reference via pd->vport_use_count to ensure
EQs outlive all QP consumers. The vport must already be configured by
a raw QP before an RSS QP can be created. EQs are only destroyed when
the last QP (raw or RSS) on the PD releases its reference.

Reject cross-port PD sharing for raw QPs. Since EQs and vport
configuration are per-port, a PD is bound to the port used by its
first raw QP. Subsequent raw QPs on the same PD must use the same
port or the creation fails with -EINVAL.

Serialize mana_set_channels() against RDMA vport configuration to
prevent num_queues from changing while RDMA holds EQs sized to the
current value. When the port is down, apc->vport_mutex is held for
the entire operation since mana_detach()/mana_attach() do not take
vport_mutex in that case. When the port is up, Ethernet owns the
vport exclusively so no additional locking is needed.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/infiniband/hw/mana/main.c             |  40 ++++--
 drivers/infiniband/hw/mana/mana_ib.h          |   8 ++
 drivers/infiniband/hw/mana/qp.c               |  37 +++++-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 117 +++++++++++-------
 .../ethernet/microsoft/mana/mana_ethtool.c    |  23 +++-
 include/net/mana/mana.h                       |  15 ++-
 6 files changed, 174 insertions(+), 66 deletions(-)

diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index ac5e75dd3494..f8a9013f0ca3 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
 	pd->vport_use_count--;
 	WARN_ON(pd->vport_use_count < 0);
 
-	if (!pd->vport_use_count)
+	if (!pd->vport_use_count) {
+		mana_destroy_eq(mpc);
 		mana_uncfg_vport(mpc);
+	}
 
 	mutex_unlock(&pd->vport_mutex);
 }
@@ -40,13 +42,27 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
 
 	pd->vport_use_count++;
 	if (pd->vport_use_count > 1) {
+		/* Reject cross-port PD sharing. EQs and vport config
+		 * are per-port, so the PD must stay bound to the port
+		 * that was configured on the first raw QP creation.
+		 */
+		if (pd->vport_port != port) {
+			pd->vport_use_count--;
+			mutex_unlock(&pd->vport_mutex);
+			ibdev_dbg(&dev->ib_dev,
+				  "PD already bound to port %u\n",
+				  pd->vport_port);
+			return -EINVAL;
+		}
 		ibdev_dbg(&dev->ib_dev,
 			  "Skip as this PD is already configured vport\n");
 		mutex_unlock(&pd->vport_mutex);
 		return 0;
 	}
 
-	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
+	pd->vport_port = port;
+
+	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id, true);
 	if (err) {
 		pd->vport_use_count--;
 		mutex_unlock(&pd->vport_mutex);
@@ -55,15 +71,23 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
 		return err;
 	}
 
-	mutex_unlock(&pd->vport_mutex);
 
-	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
-	pd->tx_vp_offset = mpc->tx_vp_offset;
+	err = mana_create_eq(mpc);
+	if (err) {
+		mana_uncfg_vport(mpc);
+		pd->vport_use_count--;
+	} else {
+		pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
+		pd->tx_vp_offset = mpc->tx_vp_offset;
+	}
+
+	mutex_unlock(&pd->vport_mutex);
 
-	ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
-		  mpc->port_handle, pd->pdn, doorbell_id);
+	if (!err)
+		ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
+			  mpc->port_handle, pd->pdn, doorbell_id);
 
-	return 0;
+	return err;
 }
 
 int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index c9c94e86a72b..916f03fb96dd 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -102,6 +102,14 @@ struct mana_ib_pd {
 	struct mutex vport_mutex;
 	int vport_use_count;
 
+	/* Port bound to this PD for raw QP usage. A PD can only be
+	 * associated with a single physical port because per-port EQs
+	 * and vport configuration are tied to the PD's refcount.
+	 * Set on the first raw QP creation; subsequent QPs on the
+	 * same PD must use the same port or get -EINVAL.
+	 */
+	u32 vport_port;
+
 	bool tx_shortform_allowed;
 	u32 tx_vp_offset;
 };
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 0fbcf449c134..108ec4c5ce51 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -79,6 +79,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 				 struct ib_qp_init_attr *attr,
 				 struct ib_udata *udata)
 {
+	struct mana_ib_pd *mana_pd = container_of(pd, struct mana_ib_pd, ibpd);
 	struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
 	struct mana_ib_dev *mdev =
 		container_of(pd->device, struct mana_ib_dev, ib_dev);
@@ -155,6 +156,18 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 
 	qp->port = port;
 
+	/* Take a reference on the vport to ensure EQs outlive this QP.
+	 * The vport must already be configured by a raw QP.
+	 */
+	mutex_lock(&mana_pd->vport_mutex);
+	if (!mana_pd->vport_use_count) {
+		mutex_unlock(&mana_pd->vport_mutex);
+		ret = -EINVAL;
+		goto fail;
+	}
+	mana_pd->vport_use_count++;
+	mutex_unlock(&mana_pd->vport_mutex);
+
 	for (i = 0; i < ind_tbl_size; i++) {
 		struct mana_obj_spec wq_spec = {};
 		struct mana_obj_spec cq_spec = {};
@@ -171,13 +184,13 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		cq_spec.gdma_region = cq->queue.gdma_region;
 		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
 		cq_spec.modr_ctx_id = 0;
-		eq = &mpc->ac->eqs[cq->comp_vector];
+		eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
 		cq_spec.attached_eq = eq->eq->id;
 
 		ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
 					 &wq_spec, &cq_spec, &wq->rx_object);
 		if (ret)
-			goto fail;
+			goto free_vport;
 
 		/* The GDMA regions are now owned by the WQ object */
 		wq->queue.gdma_region = GDMA_INVALID_DMA_REGION;
@@ -199,7 +212,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		ret = mana_ib_install_cq_cb(mdev, cq);
 		if (ret) {
 			mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
-			goto fail;
+			goto free_vport;
 		}
 	}
 	resp.num_entries = i;
@@ -210,7 +223,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 					 ucmd.rx_hash_key_len,
 					 ucmd.rx_hash_key);
 	if (ret)
-		goto fail;
+		goto free_vport;
 
 	ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
 	if (ret) {
@@ -226,7 +239,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 
 err_disable_vport_rx:
 	mana_disable_vport_rx(mpc);
-fail:
+free_vport:
 	while (i-- > 0) {
 		ibwq = ind_tbl->ind_tbl[i];
 		ibcq = ibwq->cq;
@@ -237,6 +250,9 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
 	}
 
+	mana_ib_uncfg_vport(mdev, mana_pd, port);
+
+fail:
 	kfree(mana_ind_table);
 
 	return ret;
@@ -321,7 +337,11 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
 	cq_spec.modr_ctx_id = 0;
 	eq_vec = send_cq->comp_vector;
-	eq = &mpc->ac->eqs[eq_vec];
+	if (!mpc->eqs) {
+		err = -EINVAL;
+		goto err_destroy_queue;
+	}
+	eq = &mpc->eqs[eq_vec % mpc->num_queues];
 	cq_spec.attached_eq = eq->eq->id;
 
 	err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
@@ -785,14 +805,17 @@ static int mana_ib_destroy_qp_rss(struct mana_ib_qp *qp,
 {
 	struct mana_ib_dev *mdev =
 		container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev);
+	struct ib_pd *ibpd = qp->ibqp.pd;
 	struct mana_port_context *mpc;
 	struct net_device *ndev;
+	struct mana_ib_pd *pd;
 	struct mana_ib_wq *wq;
 	struct ib_wq *ibwq;
 	int i;
 
 	ndev = mana_ib_get_netdev(qp->ibqp.device, qp->port);
 	mpc = netdev_priv(ndev);
+	pd = container_of(ibpd, struct mana_ib_pd, ibpd);
 
 	/* Disable vPort RX steering before destroying RX WQ objects.
 	 * Otherwise firmware still routes traffic to the destroyed queues,
@@ -817,6 +840,8 @@ static int mana_ib_destroy_qp_rss(struct mana_ib_qp *qp,
 		mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
 	}
 
+	mana_ib_uncfg_vport(mdev, pd, qp->port);
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b2faa7cf398f..18f8f653da3d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1295,7 +1295,7 @@ void mana_uncfg_vport(struct mana_port_context *apc)
 EXPORT_SYMBOL_NS(mana_uncfg_vport, "NET_MANA");
 
 int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
-		   u32 doorbell_pg_id)
+		   u32 doorbell_pg_id, bool check_channel_changing)
 {
 	struct mana_config_vport_resp resp = {};
 	struct mana_config_vport_req req = {};
@@ -1320,7 +1320,8 @@ int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
 	 * Ethernet usage on the same port.
 	 */
 	mutex_lock(&apc->vport_mutex);
-	if (apc->vport_use_count > 0) {
+	if (apc->vport_use_count > 0 ||
+	    (check_channel_changing && apc->channel_changing)) {
 		mutex_unlock(&apc->vport_mutex);
 		return -EBUSY;
 	}
@@ -1615,78 +1616,84 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
 }
 EXPORT_SYMBOL_NS(mana_destroy_wq_obj, "NET_MANA");
 
-static void mana_destroy_eq(struct mana_context *ac)
+void mana_destroy_eq(struct mana_port_context *apc)
 {
+	struct mana_context *ac = apc->ac;
 	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 	struct gdma_queue *eq;
 	int i;
 
-	if (!ac->eqs)
+	if (!apc->eqs)
 		return;
 
-	debugfs_remove_recursive(ac->mana_eqs_debugfs);
-	ac->mana_eqs_debugfs = NULL;
+	debugfs_remove_recursive(apc->mana_eqs_debugfs);
+	apc->mana_eqs_debugfs = NULL;
 
-	for (i = 0; i < gc->max_num_queues; i++) {
-		eq = ac->eqs[i].eq;
+	for (i = 0; i < apc->num_queues; i++) {
+		eq = apc->eqs[i].eq;
 		if (!eq)
 			continue;
 
 		mana_gd_destroy_queue(gc, eq);
 	}
 
-	kfree(ac->eqs);
-	ac->eqs = NULL;
+	kfree(apc->eqs);
+	apc->eqs = NULL;
 }
+EXPORT_SYMBOL_NS(mana_destroy_eq, "NET_MANA");
 
-static void mana_create_eq_debugfs(struct mana_context *ac, int i)
+static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
 {
-	struct mana_eq eq = ac->eqs[i];
+	struct mana_eq eq = apc->eqs[i];
 	char eqnum[32];
 
 	sprintf(eqnum, "eq%d", i);
-	eq.mana_eq_debugfs = debugfs_create_dir(eqnum, ac->mana_eqs_debugfs);
+	eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
 	debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
 	debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
 	debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
 }
 
-static int mana_create_eq(struct mana_context *ac)
+int mana_create_eq(struct mana_port_context *apc)
 {
-	struct gdma_dev *gd = ac->gdma_dev;
+	struct gdma_dev *gd = apc->ac->gdma_dev;
 	struct gdma_context *gc = gd->gdma_context;
 	struct gdma_queue_spec spec = {};
 	int err;
 	int i;
 
-	ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
-	if (!ac->eqs)
+	if (WARN_ON(apc->eqs))
+		return -EEXIST;
+	apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
+	if (!apc->eqs)
 		return -ENOMEM;
 
 	spec.type = GDMA_EQ;
 	spec.monitor_avl_buf = false;
 	spec.queue_size = EQ_SIZE;
 	spec.eq.callback = NULL;
-	spec.eq.context = ac->eqs;
+	spec.eq.context = apc->eqs;
 	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
 
-	ac->mana_eqs_debugfs = debugfs_create_dir("EQs", gc->mana_pci_debugfs);
+	apc->mana_eqs_debugfs =
+		debugfs_create_dir("EQs", apc->mana_port_debugfs);
 
-	for (i = 0; i < gc->max_num_queues; i++) {
+	for (i = 0; i < apc->num_queues; i++) {
 		spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
-		err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
+		err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
 		if (err) {
 			dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
 			goto out;
 		}
-		mana_create_eq_debugfs(ac, i);
+		mana_create_eq_debugfs(apc, i);
 	}
 
 	return 0;
 out:
-	mana_destroy_eq(ac);
+	mana_destroy_eq(apc);
 	return err;
 }
+EXPORT_SYMBOL_NS(mana_create_eq, "NET_MANA");
 
 static int mana_fence_rq(struct mana_port_context *apc, struct mana_rxq *rxq)
 {
@@ -2451,7 +2458,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 		spec.monitor_avl_buf = false;
 		spec.queue_size = cq_size;
 		spec.cq.callback = mana_schedule_napi;
-		spec.cq.parent_eq = ac->eqs[i].eq;
+		spec.cq.parent_eq = apc->eqs[i].eq;
 		spec.cq.context = cq;
 		err = mana_gd_create_mana_wq_cq(gd, &spec, &cq->gdma_cq);
 		if (err)
@@ -2844,13 +2851,12 @@ static void mana_create_rxq_debugfs(struct mana_port_context *apc, int idx)
 static int mana_add_rx_queues(struct mana_port_context *apc,
 			      struct net_device *ndev)
 {
-	struct mana_context *ac = apc->ac;
 	struct mana_rxq *rxq;
 	int err = 0;
 	int i;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		rxq = mana_create_rxq(apc, i, &ac->eqs[i], ndev);
+		rxq = mana_create_rxq(apc, i, &apc->eqs[i], ndev);
 		if (!rxq) {
 			err = -ENOMEM;
 			netdev_err(ndev, "Failed to create rxq %d : %d\n", i, err);
@@ -2869,9 +2875,8 @@ static int mana_add_rx_queues(struct mana_port_context *apc,
 	return err;
 }
 
-static void mana_destroy_vport(struct mana_port_context *apc)
+static void mana_destroy_rxqs(struct mana_port_context *apc)
 {
-	struct gdma_dev *gd = apc->ac->gdma_dev;
 	struct mana_rxq *rxq;
 	u32 rxq_idx;
 
@@ -2883,8 +2888,12 @@ static void mana_destroy_vport(struct mana_port_context *apc)
 		mana_destroy_rxq(apc, rxq, true);
 		apc->rxqs[rxq_idx] = NULL;
 	}
+}
+
+static void mana_destroy_vport(struct mana_port_context *apc)
+{
+	struct gdma_dev *gd = apc->ac->gdma_dev;
 
-	mana_destroy_txq(apc);
 	mana_uncfg_vport(apc);
 
 	if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
@@ -2905,11 +2914,7 @@ static int mana_create_vport(struct mana_port_context *apc,
 			return err;
 	}
 
-	err = mana_cfg_vport(apc, gd->pdid, gd->doorbell);
-	if (err)
-		return err;
-
-	return mana_create_txq(apc, net);
+	return mana_cfg_vport(apc, gd->pdid, gd->doorbell, false);
 }
 
 static int mana_rss_table_alloc(struct mana_port_context *apc)
@@ -3195,21 +3200,36 @@ int mana_alloc_queues(struct net_device *ndev)
 
 	err = mana_create_vport(apc, ndev);
 	if (err) {
-		netdev_err(ndev, "Failed to create vPort %u : %d\n", apc->port_idx, err);
+		netdev_err(ndev, "Failed to create vPort %u : %d\n",
+			   apc->port_idx, err);
 		return err;
 	}
 
+	err = mana_create_eq(apc);
+	if (err) {
+		netdev_err(ndev, "Failed to create EQ on vPort %u: %d\n",
+			   apc->port_idx, err);
+		goto destroy_vport;
+	}
+
+	err = mana_create_txq(apc, ndev);
+	if (err) {
+		netdev_err(ndev, "Failed to create TXQ on vPort %u: %d\n",
+			   apc->port_idx, err);
+		goto destroy_eq;
+	}
+
 	err = netif_set_real_num_tx_queues(ndev, apc->num_queues);
 	if (err) {
 		netdev_err(ndev,
 			   "netif_set_real_num_tx_queues () failed for ndev with num_queues %u : %d\n",
 			   apc->num_queues, err);
-		goto destroy_vport;
+		goto destroy_txq;
 	}
 
 	err = mana_add_rx_queues(apc, ndev);
 	if (err)
-		goto destroy_vport;
+		goto destroy_rxq;
 
 	apc->rss_state = apc->num_queues > 1 ? TRI_STATE_TRUE : TRI_STATE_FALSE;
 
@@ -3218,7 +3238,7 @@ int mana_alloc_queues(struct net_device *ndev)
 		netdev_err(ndev,
 			   "netif_set_real_num_rx_queues () failed for ndev with num_queues %u : %d\n",
 			   apc->num_queues, err);
-		goto destroy_vport;
+		goto destroy_rxq;
 	}
 
 	mana_rss_table_init(apc);
@@ -3226,19 +3246,25 @@ int mana_alloc_queues(struct net_device *ndev)
 	err = mana_config_rss(apc, TRI_STATE_TRUE, true, true);
 	if (err) {
 		netdev_err(ndev, "Failed to configure RSS table: %d\n", err);
-		goto destroy_vport;
+		goto destroy_rxq;
 	}
 
 	if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode) {
 		err = mana_pf_register_filter(apc);
 		if (err)
-			goto destroy_vport;
+			goto destroy_rxq;
 	}
 
 	mana_chn_setxdp(apc, mana_xdp_get(apc));
 
 	return 0;
 
+destroy_rxq:
+	mana_destroy_rxqs(apc);
+destroy_txq:
+	mana_destroy_txq(apc);
+destroy_eq:
+	mana_destroy_eq(apc);
 destroy_vport:
 	mana_destroy_vport(apc);
 	return err;
@@ -3343,6 +3369,9 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	mana_fence_rqs(apc);
 
 	/* Even in err case, still need to cleanup the vPort */
+	mana_destroy_rxqs(apc);
+	mana_destroy_txq(apc);
+	mana_destroy_eq(apc);
 	mana_destroy_vport(apc);
 
 	return 0;
@@ -3663,12 +3692,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
 
-	err = mana_create_eq(ac);
-	if (err) {
-		dev_err(dev, "Failed to create EQs: %d\n", err);
-		goto out;
-	}
-
 	err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
 				    MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
 	if (err)
@@ -3808,8 +3831,6 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 		free_netdev(ndev);
 	}
 
-	mana_destroy_eq(ac);
-
 	if (ac->per_port_queue_reset_wq) {
 		destroy_workqueue(ac->per_port_queue_reset_wq);
 		ac->per_port_queue_reset_wq = NULL;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 04350973e19e..4633acc976f0 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -454,6 +454,11 @@ static int mana_set_coalesce(struct net_device *ndev,
 	return err;
 }
 
+/* mana_set_channels - change the number of queues on a port
+ *
+ * Returns -EBUSY if RDMA holds the vport with EQs sized to the
+ * current num_queues.
+ */
 static int mana_set_channels(struct net_device *ndev,
 			     struct ethtool_channels *channels)
 {
@@ -462,10 +467,22 @@ static int mana_set_channels(struct net_device *ndev,
 	unsigned int old_count = apc->num_queues;
 	int err;
 
+	/* Set channel_changing to block RDMA from grabbing the vport
+	 * during the detach/attach window. mana_cfg_vport() checks
+	 * this flag under vport_mutex and returns -EBUSY if set.
+	 */
+	mutex_lock(&apc->vport_mutex);
+	if (!apc->port_is_up && apc->vport_use_count) {
+		mutex_unlock(&apc->vport_mutex);
+		return -EBUSY;
+	}
+	apc->channel_changing = true;
+	mutex_unlock(&apc->vport_mutex);
+
 	err = mana_pre_alloc_rxbufs(apc, ndev->mtu, new_count);
 	if (err) {
 		netdev_err(ndev, "Insufficient memory for new allocations");
-		return err;
+		goto clear_flag;
 	}
 
 	err = mana_detach(ndev, false);
@@ -483,6 +500,10 @@ static int mana_set_channels(struct net_device *ndev,
 
 out:
 	mana_pre_dealloc_rxbufs(apc);
+clear_flag:
+	mutex_lock(&apc->vport_mutex);
+	apc->channel_changing = false;
+	mutex_unlock(&apc->vport_mutex);
 	return err;
 }
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index aa90a858c8e3..e2384c5ab1df 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -480,8 +480,6 @@ struct mana_context {
 	u8 bm_hostmode;
 
 	struct mana_ethtool_hc_stats hc_stats;
-	struct mana_eq *eqs;
-	struct dentry *mana_eqs_debugfs;
 	struct workqueue_struct *per_port_queue_reset_wq;
 	/* Workqueue for querying hardware stats */
 	struct delayed_work gf_stats_work;
@@ -501,6 +499,9 @@ struct mana_port_context {
 
 	u8 mac_addr[ETH_ALEN];
 
+	struct mana_eq *eqs;
+	struct dentry *mana_eqs_debugfs;
+
 	enum TRI_STATE rss_state;
 
 	mana_handle_t default_rxobj;
@@ -547,6 +548,12 @@ struct mana_port_context {
 	struct mutex vport_mutex;
 	int vport_use_count;
 
+	/* Set by mana_set_channels() under vport_mutex to block RDMA
+	 * from grabbing the vport during the detach/attach window.
+	 * Checked by mana_cfg_vport() when called from the RDMA path.
+	 */
+	bool channel_changing;
+
 	/* Net shaper handle*/
 	struct net_shaper_handle handle;
 
@@ -1032,8 +1039,10 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
 			 mana_handle_t wq_obj);
 
 int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
-		   u32 doorbell_pg_id);
+		   u32 doorbell_pg_id, bool check_channel_changing);
 void mana_uncfg_vport(struct mana_port_context *apc);
+int mana_create_eq(struct mana_port_context *apc);
+void mana_destroy_eq(struct mana_port_context *apc);
 
 struct net_device *mana_get_primary_netdev(struct mana_context *ac,
 					   u32 port_index,
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management
From: Long Li @ 2026-05-15  4:05 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui, shradhagupta
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel

This series moves EQ ownership from the shared mana_context to per-vPort
mana_port_context, enabling each vPort to have dedicated MSI-X vectors
when the hardware provides enough vectors. When vectors are limited, the
driver falls back to sharing MSI-X among vPorts.

The series introduces a GDMA IRQ Context (GIC) abstraction with reference
counting to manage interrupt context lifecycle. This allows both Ethernet
and RDMA EQs to dynamically acquire dedicated or shared MSI-X vectors at
vPort creation time rather than pre-allocating all vectors at probe time.

This series touches both the net and RDMA MANA drivers and is intended
to go through the net-next tree. The patches are available on a shared
branch for both netdev and RDMA maintainers to review.

The following changes since commit 73d587ae684d176fac9db94173f77d78a794ea4f:

  net: ethtool: fix missing closing paren in rings_reply_size() (2026-05-11 18:42:25 -0700)

are available in the Git repository at:

  https://github.com/longlimsft/linux.git tags/mana-eq-msi-v10

for you to fetch changes up to fb1f61f57f46f9511c64a09aa20a3a94374aebf4:

  RDMA/mana_ib: Allocate interrupt contexts on EQs
to go through the net-next tree. The patches are available on a shared
branch for both netdev and RDMA maintainers to review.

Changes in v10:
- Add channel_changing flag to block RDMA from grabbing the vport
  during mana_set_channels() detach/attach window. The flag is checked
  in mana_cfg_vport() only when called from the RDMA path via a new
  check_channel_changing parameter (patch 1)
- Bind each PD to a single physical port via pd->vport_port to prevent
  cross-port PD sharing which would cause EQ scope mismatch. Returns
  -EINVAL if a second port tries to use an already-bound PD (patch 1)
- Guard gc->msi_sharing reset with pci_msix_can_alloc_dyn() to avoid
  overwriting the non-dyn platform constraint set by
  mana_gd_setup_hwc_irqs() (patch 2)

Changes in v9:
- RSS QPs now take a vport reference via pd->vport_use_count to ensure
  EQs outlive all QP consumers. EQs are only destroyed when the last
  QP (raw or RSS) on the PD releases its reference (patch 1)
- Serialize mana_set_channels() against RDMA vport configuration via
  apc->vport_mutex when the port is down. When the port is up, Ethernet
  owns the vport exclusively so no locking is needed (patch 1)
- Change WARN_ON(apc->eqs) to bail out with -EEXIST to prevent
  leaking prior EQ array if invariant is violated (patch 1)
- Only commit pd->tx_shortform_allowed and pd->tx_vp_offset after
  mana_create_eq() succeeds (patch 1)
- Reset gc->msi_sharing at the top of mana_gd_query_max_resources()
  so it is recomputed from current hardware state on resume (patch 2)
- Fix reverse Christmas tree variable declaration ordering (patches
  1, 3, 5)

Changes in v8:
- Fix comment to reference per-vPort queue count instead of
  gc->max_num_queues (patch 2)
- Remove duplicate irq_update_affinity_hint() calls from error paths
  and mana_gd_remove_irqs(); the clearing is now centralized in
  mana_gd_put_gic() (patch 4)
- Note the IRQ name change (mana_q -> mana_msi) in the commit
  message (patch 4)
- Remove dead conditional write to spec.eq.msix_index (patch 5)
- Document GIC ownership contract and msix_index invariant change
  in commit message (patch 5)
- Populate eq.irq on RDMA EQs for consistency with the Ethernet
  path (patch 6)
- Document BIT(6) relocation and capability flag semantics in
  commit message (patch 6)
- Fix checkpatch --strict alignment and line length warnings

Changes in v7:
- Use rounddown_pow_of_two() instead of roundup_pow_of_two() when
  computing per-vPort queue count to avoid unnecessarily forcing shared
  MSI-X mode (patch 2)
- Call mana_gd_setup_remaining_irqs() unconditionally to ensure
  irq_contexts are populated in both dedicated and shared MSI-X modes,
  fixing bisectability between patches 2 and 5 (patch 2)
- Guard ibdev_dbg() in mana_ib_cfg_vport() with error check so the
  vport handle is not logged on the failure path (patch 1)
- Use cached gic->irq instead of pci_irq_vector() lookup in
  mana_gd_put_gic() for consistency with the allocation path (patch 3)
- Fix unsigned int* to int* pointer type mismatch when calling
  mana_gd_get_gic() by using a local int variable for the MSI index
  (patches 5, 6)

Changes in v6:
- Rebased on net-next/main (v7.1-rc1)

Changes in v5:
- Rebased on net-next/main

Changes in v4:
- Rebased on net-next/main 7.0-rc4
- Patch 2: Use MANA_DEF_NUM_QUEUES instead of hardcoded 16 for
  max_num_queues clamping
- Patch 3: Track dyn_msix in GIC context instead of re-checking
  pci_msix_can_alloc_dyn() on each call; improved remove_irqs iteration
  to skip unallocated entries

Changes in v3:
- Rebased on net-next/main
- Patch 1: Added NULL check for mpc->eqs in mana_ib_create_qp_rss() to
  prevent NULL pointer dereference when RSS QP is created before a raw QP
  has configured the vport and allocated EQs

Changes in v2:
- Rebased on net-next/main (adapted to kzalloc_objs/kzalloc_obj macros,
  new GDMA_DRV_CAP_FLAG definitions)
- Patch 2: Fixed misleading comment for max_num_queues vs
  max_num_queues_vport in gdma.h
- Patch 3: Fixed spelling typo in gdma_main.c ("difference" -> "different")

Long Li (6):
  net: mana: Create separate EQs for each vPort
  net: mana: Query device capabilities and configure MSI-X sharing for
    EQs
  net: mana: Introduce GIC context with refcounting for interrupt
    management
  net: mana: Use GIC functions to allocate global EQs
  net: mana: Allocate interrupt context for each EQ when creating vPort
  RDMA/mana_ib: Allocate interrupt contexts on EQs

 drivers/infiniband/hw/mana/main.c             |  83 ++++-
 drivers/infiniband/hw/mana/mana_ib.h          |   8 +
 drivers/infiniband/hw/mana/qp.c               |  37 +-
 .../net/ethernet/microsoft/mana/gdma_main.c   | 326 +++++++++++++-----
 drivers/net/ethernet/microsoft/mana/mana_en.c | 175 ++++++----
 .../ethernet/microsoft/mana/mana_ethtool.c    |  23 +-
 include/net/mana/gdma.h                       |  33 +-
 include/net/mana/mana.h                       |  15 +-
 8 files changed, 520 insertions(+), 180 deletions(-)

-- 
2.43.0

^ permalink raw reply

* Re: [PATCH v2] scsi: storvsc: Replace symbolic permissions with octal
From: Martin K. Petersen @ 2026-05-15  1:45 UTC (permalink / raw)
  To: Md Shofiqul Islam
  Cc: linux-scsi, linux-hyperv, linux-kernel, longli, kys, haiyangz,
	wei.liu, decui, mhklinux
In-Reply-To: <20260506004948.2172-1-shofiqtest@gmail.com>


> Symbolic permissions like S_IRUGO and S_IWUSR are not preferred by
> checkpatch. Replace with their octal equivalents:

Applied to 7.2/scsi-staging, thanks!

-- 
Martin K. Petersen

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox