Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v5 05/51] x86/sev: Mark TSC as reliable when configuring Secure TSC
From: Sean Christopherson @ 2026-07-01 20:07 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv, kvm
In-Reply-To: <20260701195652.5DEF11F000E9@smtp.kernel.org>

On Wed, Jul 01, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> commit cdbd80de837f6d512a759452a18164ddd83b9b7b
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/sev: Mark TSC as reliable when configuring Secure TSC
> 
> This commit moves the setting of the X86_FEATURE_TSC_RELIABLE flag from
> sme_early_init() to snp_secure_tsc_init(). It consolidates the handling of
> known-frequency and reliable TSC flags when overriding the TSC calibration
> routine.
> 
> [Severity: High]
> This isn't a bug introduced by this patch, but does the early return in
> arch/x86/kernel/tsc.c:tsc_early_init() allow an untrusted hypervisor to
> bypass Secure TSC initialization?
> 
> The tsc_early_init() function checks is_early_uv_system() before initializing
> Secure TSC. The is_early_uv_system() check relies on the presence of
> UV_SYSTEM_TABLE_GUID in the EFI configuration tables.
> 
> Because the hypervisor can provide tables to the guest firmware which may
> expose them as EFI configuration tables, an untrusted hypervisor could inject
> this specific GUID. Would this trick the kernel into returning early,
> bypassing both Secure TSC and TDX TSC initialization, and causing the guest
> to fall back to untrusted legacy timers?

I suppose so?  But that's a problem best handled in a dedicated, separate patch.

> 
> [Severity: High]
> This is also a pre-existing issue, but should the newly updated
> arch/x86/coco/sev/core.c:snp_secure_tsc_init() validate that the calculated
> frequency is non-zero before returning?
> 
> If the calculated frequency evaluates to 0, which could happen if the
> hypervisor causes MSR_AMD64_GUEST_TSC_FREQ to read as 0 or if the secrets
> factor is invalid, the function simply returns 0 instead of terminating
> the VM.

The hypervisor can't do that?  If it can, that too should be addressed separately.

> The caller tsc_early_init() treats a return value of 0 as a failure to find
> a trusted frequency and gracefully falls back to x86_init.hyper.get_tsc_khz()
> or legacy timers. In a confidential computing threat model, should this
> security-critical initialization fail closed and terminate the VM, rather
> than falling back to the hypervisor-controlled time sources that Secure TSC
> is designed to protect against?
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260701193212.749551-1-seanjc@google.com?part=5

^ permalink raw reply

* Re: [PATCH v5 06/51] x86/sev: Don't override CPU frequency calibration for SNP's Secure TSC
From: Sean Christopherson @ 2026-07-01 20:09 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: kvm, linux-hyperv
In-Reply-To: <20260701195337.2E1F01F000E9@smtp.kernel.org>

On Wed, Jul 01, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] The patch removes the `calibrate_cpu` override to allow the CPU base frequency to differ from the TSC frequency, but fails to account for `determine_cpu_tsc_frequencies()` which will forcefully clobber `cpu_khz` if it diverges by more than 10%.
> --
> 
> commit 6d3b9e81661ead5f25434666266aec6514dadaa8
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/sev: Don't override CPU frequency calibration for SNP's Secure TSC
> 
> This commit removes the override of the kernel's CPU frequency calibration
> routine when registering SNP's Secure TSC. It aims to decouple the CPU base
> frequency from the TSC frequency, as exposing APERF/MPERF requires an
> accurate CPU frequency.
> 
> > diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
> > --- a/arch/x86/coco/sev/core.c
> > +++ b/arch/x86/coco/sev/core.c
> > @@ -2046,7 +2046,6 @@ void __init snp_secure_tsc_init(void)
> >  
> >  	snp_tsc_freq_khz = SNP_SCALE_TSC_FREQ(tsc_freq_mhz * 1000, secrets->tsc_factor);
> >  
> > -	x86_platform.calibrate_cpu = securetsc_get_tsc_khz;
> 
> [Severity: Medium]
> Will this removal be sufficient to decouple cpu_khz from tsc_khz if they
> diverge significantly?

No, but IMO it's a step in the right direction.

> Looking at determine_cpu_tsc_frequencies() in arch/x86/kernel/tsc.c, there
> is a generic sanity check that enforces a 10% limit on the difference
> between the two frequencies:
> 
> arch/x86/kernel/tsc.c:determine_cpu_tsc_frequencies() {
>     ...
>     else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
>         cpu_khz = tsc_khz;
>     ...
> }
> 
> If the true host CPU frequency diverges from the guest TSC frequency by more
> than 10%, won't this logic forcefully overwrite the correct cpu_khz back to
> tsc_khz, unintentionally undoing the fix for the APERF/MPERF calculation?

Probably, but this series is already boiling a pretty large lake, I think the
ocean can be left for a future cleanup.

^ permalink raw reply

* Re: [PATCH v5 12/51] x86/acrn: Register TSC/CPU frequency callbacks iff frequency is actually in CPUID
From: Sean Christopherson @ 2026-07-01 20:10 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv, kvm
In-Reply-To: <20260701195245.467DB1F00A3A@smtp.kernel.org>

On Wed, Jul 01, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> > index ad8f2da8003b3..dc71a6fdd4613 100644
> > --- a/arch/x86/kernel/cpu/acrn.c
> > +++ b/arch/x86/kernel/cpu/acrn.c
> [ ... ]
> > @@ -26,13 +28,19 @@ static u32 __init acrn_detect(void)
> >  
> >  static unsigned int __init acrn_get_tsc_khz(void)
> >  {
> > -	return cpuid_eax(ACRN_CPUID_TIMING_INFO);
> > +	return acrn_tsc_khz_cpuid;
> >  }
> >  
> >  static void __init acrn_init_platform(void)
> >  {
> >  	/* Install system interrupt handler for ACRN hypervisor callback */
> >  	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
> > +
> > +	acrn_tsc_khz_cpuid = cpuid_eax(ACRN_CPUID_TIMING_INFO);
> 
> [Severity: High]
> This is a pre-existing issue, but does hardcoding ACRN_CPUID_TIMING_INFO to
> 0x40000010 without dynamically applying the base offset break nested setups?
> 
> If Linux is booted as an L2 guest with an L0 hypervisor like KVM occupying
> the base 0x40000000 leaf, and L1 ACRN resides at 0x40000100, acrn_detect()
> correctly finds ACRN using cpuid_base_hypervisor() at the 0x40000100 base.
> 
> However, acrn_init_platform() hardcodes the fetch:
> 
> arch/x86/include/asm/acrn.h:
> #define ACRN_CPUID_TIMING_INFO        0x40000010
> 
> Will this bypass the dynamic base detected earlier and read the timing info
> from the wrong hypervisor's CPUID space (e.g., KVM's L0 frequency)?
> 
> If nested under a hypervisor that returns garbage for 0x40000010, could
> acrn_tsc_khz_cpuid be populated with a non-zero bogus value and corrupt
> guest timekeeping?

Yes, but it's not at all clear that ACRN supports advertising itself anywhere
but at 0x40000000.  Regardless, this can and should be addressed separately.

^ permalink raw reply

* Re: [PATCH v5 22/51] x86/kvm: Mark TSC as reliable when it's constant and nonstop
From: Sean Christopherson @ 2026-07-01 20:13 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv, kvm
In-Reply-To: <20260701200321.256EF1F000E9@smtp.kernel.org>

On Wed, Jul 01, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index 909d3e5e5bcd5..1cef54e1e7d9e 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> [ ... ]
> > @@ -1040,7 +1041,16 @@ static void __init kvm_init_platform(void)
> >  		}
> >  	}
> >  
> > -	kvmclock_init();
> > +        /*
> > +         * If the TSC counts at a constant frequency across P/T states and in
> > +         * deep C-states, treat the TSC reliable, as guaranteed by KVM.
> > +         */
> > +	tsc_is_reliable = boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > +			  boot_cpu_has(X86_FEATURE_NONSTOP_TSC);
> > +	if (tsc_is_reliable)
> > +		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> 
> [Severity: High]
> Does setting X86_FEATURE_TSC_RELIABLE here bypass SMP cross-CPU TSC
> synchronization checks?

No, KVM_FEATURE_CLOCKSOURCE_STABLE_BIT is for kvmclock, not for TSC.  The
CONSTANT+NONSTOP combination is KVM's way of guaranteeing cross-CPU synchronization.

> Forcing X86_FEATURE_TSC_RELIABLE based only on CONSTANT_TSC and NONSTOP_TSC,
> without verifying if the hypervisor actually guarantees cross-CPU
> synchronization (e.g., via KVM_FEATURE_CLOCKSOURCE_STABLE_BIT), sets
> tsc_clocksource_reliable = 1.
> 
> This bypasses the synchronization check in check_tsc_sync_source():
> 
> arch/x86/kernel/tsc_sync.c:check_tsc_sync_source() {
>     ...
>     if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable)
>         return;
>     ...
> }
> 
> Could this cause time to go backwards for guests running on hosts with
> unsynchronized TSCs when threads migrate between vCPUs?

^ permalink raw reply

* Re: [PATCH v5 29/51] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
From: Sean Christopherson @ 2026-07-01 20:43 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: kvm, linux-hyperv
In-Reply-To: <20260701200318.B83061F000E9@smtp.kernel.org>

On Wed, Jul 01, 2026, sashiko-bot@kernel.org wrote:
> [Severity: High]
> Does this change inadvertently leave the BSP's kvmclock enabled when
> powering off the machine for ACPI S4 hibernation?
> 
> Looking at the final S4 power-off path in hibernation_platform_enter() in
> kernel/power/hibernate.c, it calls syscore_suspend() but does not call
> save_processor_state(). Because this patch removes the unconditional BSP
> kvmclock_disable() from syscore_suspend() and relies on
> kvm_save_sched_clock_state() (which only fires during image creation),
> kvmclock appears to remain fully enabled during the S4 power-off.

No?

  hibernate()
  |
  -> hibernation_snapshot()
  |  |
  |  -> create_image()
  |     |
  |     -> save_processor_state()
  |
  -> power_down()
     |
     -> hibernation_platform_enter()

I'm not familiar with the /dev/snapshot interface, but I gotta imagine userspace
needs to do SNAPSHOT_CREATE_IMAGE before SNAPSHOT_POWER_OFF if it actually wants
to restore state after S4.

> [Severity: Critical]
> This isn't a bug introduced by this patch, but is there a use-after-free
> risk with the stack-allocated waitqueue head in the async page fault
> handling in arch/x86/kernel/kvm.c?

And I thought I had a problem with scope creep during review...

> When a guest task waits for an async page fault in
> kvm_async_pf_task_wait_schedule(), it waits on a stack-allocated struct
> kvm_task_sleep_node. Concurrently, when a waker CPU executes
> apf_task_wake_one(), it unhashes the node via hlist_del_init().
> 
> The waiter checks hlist_unhashed() locklessly. If it observes the unhashed
> state before the waker is finished, the waiter can break its loop, return,
> and destroy the stack frame. The waker might then dereference the
> deallocated waitqueue head to call swake_up_one(), potentially corrupting
> stack memory.
> 
> Can this sequence occur during concurrent page fault completions?

No idea, definitely something to sort out at a different time.

^ permalink raw reply

* [PATCH net v3 0/2] Fix MANA RX with bounce buffering
From: Dexuan Cui @ 2026-07-02  4:12 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable

With swiotlb=force, the MANA NIC fails to work properly due to commit
730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead
of full pages to improve memory efficiency.").

This happens because, with the standard MTU=1500, the aforementioned
commit uses page pool frags with PP_FLAG_DMA_MAP, but fails to call
page_pool_dma_sync_for_cpu() to sync the received packet for CPU acces
before handing the RX buffer to the stack.

Here patch #2 adds the required page_pool_dma_sync_for_cpu().

Patch #1 validates the packet length reported by the NIC. With patch #2,
page_pool_dma_sync_for_cpu() uses the packet length, so we don't want
to blindly trust the packet length, just in case.

There is no change between v2 and v3.
v3 just swaps the order of the 2 patches in v2, as suggested by Simon [3].

Please review.

Thanks,
Dexuan

References:
[1] v1: https://lore.kernel.org/netdev/20260618035029.249361-1-decui@microsoft.com/
[2] v2: https://lore.kernel.org/netdev/20260624222605.1794719-1-decui@microsoft.com/
[3] https://lore.kernel.org/netdev/20260626145048.GB1310988@horms.kernel.org/

Dexuan Cui (2):
  net: mana: Validate the packet length reported by the NIC
  net: mana: Sync page pool RX frags for CPU

 drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
 include/net/mana/mana.h                       |  8 +++
 2 files changed, 58 insertions(+), 11 deletions(-)

-- 
2.34.1


^ permalink raw reply

* [PATCH net v3 2/2] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-07-02  4:12 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable
In-Reply-To: <20260702041237.617719-1-decui@microsoft.com>

MANA allocates RX buffers from page pool fragments when frag_count is
greater than 1. In that case the buffers remain DMA mapped by page pool
and the RX completion path does not call dma_unmap_single(). As a result,
the implicit sync-for-CPU normally performed by dma_unmap_single() is
missing before the packet data is passed to the networking stack.

This breaks RX on configurations which require explicit DMA syncing, for
example when booted with swiotlb=force.

Fix this by recording the page pool page and DMA sync offset when the RX
buffer is allocated, and syncing the received packet range for CPU access
before handing the RX buffer to the stack.

Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes since v1:
    v1 is split into two patches in the v2.
    Add Haiyang's Reviewed-by.

Changes since v2:
    Swapped the order of the 2 patches in v2.
    No extra change.

 drivers/net/ethernet/microsoft/mana/mana_en.c | 40 +++++++++++++++----
 include/net/mana/mana.h                       |  8 ++++
 2 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index edc504b2447a..0b44c51ae6ec 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2044,12 +2044,16 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
 }
 
 static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
-			     dma_addr_t *da, bool *from_pool)
+			     dma_addr_t *da, bool *from_pool,
+			     struct page **pp_page, u32 *dma_sync_offset)
 {
 	struct page *page;
 	u32 offset;
 	void *va;
+
 	*from_pool = false;
+	*pp_page = NULL;
+	*dma_sync_offset = 0;
 
 	/* Don't use fragments for jumbo frames or XDP where it's 1 fragment
 	 * per page.
@@ -2087,31 +2091,47 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 	va  = page_to_virt(page) + offset;
 	*da = page_pool_get_dma_addr(page) + offset + rxq->headroom;
 	*from_pool = true;
+	*pp_page = page;
+	*dma_sync_offset = offset + rxq->headroom;
 
 	return va;
 }
 
 /* Allocate frag for rx buffer, and save the old buf */
 static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
-			       struct mana_recv_buf_oob *rxoob, void **old_buf,
-			       bool *old_fp)
+			       struct mana_recv_buf_oob *rxoob, u32 pktlen,
+			       void **old_buf, bool *old_fp)
 {
+	struct page *pp_page;
+	u32 dma_sync_offset;
 	bool from_pool;
 	dma_addr_t da;
 	void *va;
 
-	va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+	va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+			     &dma_sync_offset);
 	if (!va)
 		return;
-	if (!rxoob->from_pool || rxq->frag_count == 1)
+	if (!rxoob->from_pool || rxq->frag_count == 1) {
 		dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
 				 DMA_FROM_DEVICE);
+	} else {
+		/* The page pool maps the whole page and only syncs for device
+		 * automatically (PP_FLAG_DMA_SYNC_DEV). Sync the received bytes
+		 * for the CPU before they are read: this is required if DMA
+		 * is incoherent or bounce buffers are used.
+		 */
+		page_pool_dma_sync_for_cpu(rxq->page_pool, rxoob->pp_page,
+					   rxoob->dma_sync_offset, pktlen);
+	}
 	*old_buf = rxoob->buf_va;
 	*old_fp = rxoob->from_pool;
 
 	rxoob->buf_va = va;
 	rxoob->sgl[0].address = da;
 	rxoob->from_pool = from_pool;
+	rxoob->pp_page = pp_page;
+	rxoob->dma_sync_offset = dma_sync_offset;
 }
 
 static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -2182,7 +2202,8 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 			/* Reuse the RX buffer since rxbuf_oob is unchanged. */
 		} else {
 
-			mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+			mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen,
+					   &old_buf, &old_fp);
 
 			/* Unsuccessful refill will have old_buf == NULL.
 			 * In this case, mana_rx_skb() will drop the packet.
@@ -2579,6 +2600,8 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 			    struct mana_rxq *rxq, struct device *dev)
 {
 	struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+	struct page *pp_page = NULL;
+	u32 dma_sync_offset = 0;
 	bool from_pool = false;
 	dma_addr_t da;
 	void *va;
@@ -2586,13 +2609,16 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 	if (mpc->rxbufs_pre)
 		va = mana_get_rxbuf_pre(rxq, &da);
 	else
-		va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+		va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+				     &dma_sync_offset);
 
 	if (!va)
 		return -ENOMEM;
 
 	rx_oob->buf_va = va;
 	rx_oob->from_pool = from_pool;
+	rx_oob->pp_page = pp_page;
+	rx_oob->dma_sync_offset = dma_sync_offset;
 
 	rx_oob->sgl[0].address = da;
 	rx_oob->sgl[0].size = rxq->datasize;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..4111b93169d2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -305,6 +305,14 @@ struct mana_recv_buf_oob {
 
 	void *buf_va;
 	bool from_pool; /* allocated from a page pool */
+	/* head page of the page_pool fragment; valid only when
+	 * from_pool && frag_count > 1.
+	 */
+	struct page *pp_page;
+	/* Fragment offset plus rxq->headroom, passed to
+	 * page_pool_dma_sync_for_cpu().
+	 */
+	u32 dma_sync_offset;
 
 	/* SGL of the buffer going to be sent as part of the work request. */
 	u32 num_sge;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v3 1/2] net: mana: Validate the packet length reported by the NIC
From: Dexuan Cui @ 2026-07-02  4:12 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable
In-Reply-To: <20260702041237.617719-1-decui@microsoft.com>

Validate the packet length reported in the RX CQE before passing it
to skb processing. The CQE is supplied by the NIC device and should
not be blindly trusted.

Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes since v1:
    v1 is split into two patches in the v2.
    Add Haiyang's Reviewed-by.

Changes since v2:
    Swapped the order of the 2 patches in v2.
    No extra change.

 drivers/net/ethernet/microsoft/mana/mana_en.c | 23 +++++++++++++++----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..edc504b2447a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2170,12 +2170,25 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		rxbuf_oob = &rxq->rx_oobs[curr];
 		WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-		mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+		if (unlikely(pktlen > rxq->datasize)) {
+			/* Increase it even if mana_rx_skb() isn't called. */
+			rxq->rx_cq.work_done++;
 
-		/* Unsuccessful refill will have old_buf == NULL.
-		 * In this case, mana_rx_skb() will drop the packet.
-		 */
-		mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+			++ndev->stats.rx_dropped;
+			netdev_warn_once(ndev,
+				"Dropped oversized RX packet: len=%u, datasize=%u\n",
+				pktlen, rxq->datasize);
+
+			/* Reuse the RX buffer since rxbuf_oob is unchanged. */
+		} else {
+
+			mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+
+			/* Unsuccessful refill will have old_buf == NULL.
+			 * In this case, mana_rx_skb() will drop the packet.
+			 */
+			mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+		}
 
 		mana_move_wq_tail(rxq->gdma_rq,
 				  rxbuf_oob->wqe_inf.wqe_size_in_bu);
-- 
2.34.1


^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-07-02  4:20 UTC (permalink / raw)
  To: Simon Horman
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Long Li,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Konstantin Taranov,
	ernis@linux.microsoft.com, dipayanroy@linux.microsoft.com,
	kees@kernel.org, jacob.e.keller@intel.com,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260626145048.GB1310988@horms.kernel.org>

> From: Simon Horman <horms@kernel.org>
> Sent: Friday, June 26, 2026 7:51 AM
>  ...
> Hi,
> 
> I'm sorry to be bothersome but I think that the order of the two patches
> that comprise this series should be reversed. Or if that is not possible,
> go back to a single patch.

Hi Simon,
Thanks for suggesting swapping the order of the 2 patches in v2! 
Totally makes sense. 

Please review v3 I just posted:
https://lore.kernel.org/netdev/20260702041237.617719-1-decui@microsoft.com/

Thanks,
Dexuan

^ permalink raw reply

* Re: [PATCH v1 1/4] drm/hyperv: Unregister pci driver in error path before module unload
From: Thomas Zimmermann @ 2026-07-02  6:42 UTC (permalink / raw)
  To: Uwe Kleine-König (The Capable Hub), Dexuan Cui, Long Li,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard, David Airlie,
	Simona Vetter, Deepak Rawat
  Cc: linux-hyperv, dri-devel, linux-kernel
In-Reply-To: <4b7dbf00ce4ff664b7d5dd74b2f39d8d87c1ade9.1782925276.git.u.kleine-koenig@baylibre.com>



Am 01.07.26 um 19:05 schrieb Uwe Kleine-König (The Capable Hub):
> The pci driver must not kept registered if the module is unloaded after
> vmbus_driver_register() fails. So check the return value of
> vmbus_driver_register() and unregister the pci driver on failure.
>
> Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
> Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>

Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>

> ---
>   drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> index 20f35c48c0b8..2e75fb793495 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> @@ -249,7 +249,11 @@ static int __init hv_drm_init(void)
>   	if (ret != 0)
>   		return ret;
>   
> -	return vmbus_driver_register(&hv_drm_hv_driver);
> +	ret = vmbus_driver_register(&hv_drm_hv_driver);
> +	if (ret)
> +		pci_unregister_driver(&hv_drm_pci_driver);
> +
> +	return ret;
>   }
>   
>   static void __exit hv_drm_exit(void)

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH v1 2/4] drm/hyperv: Explicitly set subvendor and subdevice for pci match array
From: Thomas Zimmermann @ 2026-07-02  6:43 UTC (permalink / raw)
  To: Uwe Kleine-König (The Capable Hub), Dexuan Cui, Long Li,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard, David Airlie,
	Simona Vetter
  Cc: linux-hyperv, dri-devel, linux-kernel
In-Reply-To: <019450ffb519d02821364afca32b9f48bcd8d2b6.1782925276.git.u.kleine-koenig@baylibre.com>

Hi

Am 01.07.26 um 19:05 schrieb Uwe Kleine-König (The Capable Hub):
> .subvendor and .subdevice were set to 0 implicitly, so only devices with
> these two values set to 0 in hardware can probe automatically. Make this
> requirement explicit.
>
> While touching this array item, also make use of the pci macro designed
> for that case.
>
> Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
> ---
>   drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> index 2e75fb793495..e766d87b7a9d 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> @@ -51,8 +51,8 @@ static void hv_drm_pci_remove(struct pci_dev *pdev)
>   
>   static const struct pci_device_id hv_drm_pci_tbl[] = {
>   	{
> -		.vendor = PCI_VENDOR_ID_MICROSOFT,
> -		.device = PCI_DEVICE_ID_HYPERV_VIDEO,
> +		PCI_VDEVICE_SUB(MICROSOFT, PCI_DEVICE_ID_HYPERV_VIDEO,
> +				0, 0),

IDK, but it looks like an oversight to me.  Setting the sub-fields to 
ANY seems like the better fix.

Best regards
Thomas

>   	},
>   	{ /* end of list */ }
>   };

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH v1 3/4] drm/hyperv: Drop useless empty remove callback
From: Thomas Zimmermann @ 2026-07-02  6:45 UTC (permalink / raw)
  To: Uwe Kleine-König (The Capable Hub), Dexuan Cui, Long Li,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard, David Airlie,
	Simona Vetter
  Cc: linux-hyperv, dri-devel, linux-kernel
In-Reply-To: <8a85b5f4a5ed8ec35b5a213423d4be40e34f9cb9.1782925276.git.u.kleine-koenig@baylibre.com>

Hi

Am 01.07.26 um 19:05 schrieb Uwe Kleine-König (The Capable Hub):
> Having an empty remove callback is equivalent to no remove callback.
> (The only minor difference is that with an empty remove callback
> pm_runtime_get_sync() and pm_runtime_put_noidle() are called.)
>
> Drop this useless function.
>
> Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
> ---
>   drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 5 -----
>   1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> index e766d87b7a9d..e3f41336a831 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> @@ -45,10 +45,6 @@ static int hv_drm_pci_probe(struct pci_dev *pdev,
>   	return 0;
>   }
>   
> -static void hv_drm_pci_remove(struct pci_dev *pdev)
> -{

It would be better to call drm_dev_unplug() from here.  With a bit more 
work, the driver can have hot-unplug functionality.

Best regards
Thomas

> -}
> -
>   static const struct pci_device_id hv_drm_pci_tbl[] = {
>   	{
>   		PCI_VDEVICE_SUB(MICROSOFT, PCI_DEVICE_ID_HYPERV_VIDEO,
> @@ -64,7 +60,6 @@ static struct pci_driver hv_drm_pci_driver = {
>   	.name =		KBUILD_MODNAME,
>   	.id_table =	hv_drm_pci_tbl,
>   	.probe =	hv_drm_pci_probe,
> -	.remove =	hv_drm_pci_remove,
>   };
>   
>   static int hv_drm_setup_vram(struct hv_drm_device *hv,

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH v1 4/4] drm/hyperv: Move MODULE_DEVICE_TABLE to the device_id arrays
From: Thomas Zimmermann @ 2026-07-02  6:45 UTC (permalink / raw)
  To: Uwe Kleine-König (The Capable Hub), Dexuan Cui, Long Li,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard, David Airlie,
	Simona Vetter
  Cc: linux-hyperv, dri-devel, linux-kernel
In-Reply-To: <7f9d4a239c76b6bb384048ea5591a21ed87d9b0e.1782925276.git.u.kleine-koenig@baylibre.com>



Am 01.07.26 um 19:05 schrieb Uwe Kleine-König (The Capable Hub):
> It matches the usual coding style to have the MODULE_DEVICE_TABLE macro
> directly after the respective arrays.
>
> Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>

Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>

> ---
>   drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> index e3f41336a831..6a28048f687b 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> @@ -52,6 +52,7 @@ static const struct pci_device_id hv_drm_pci_tbl[] = {
>   	},
>   	{ /* end of list */ }
>   };
> +MODULE_DEVICE_TABLE(pci, hv_drm_pci_tbl);
>   
>   /*
>    * PCI stub to support gen1 VM.
> @@ -219,6 +220,7 @@ static const struct hv_vmbus_device_id hv_drm_vmbus_tbl[] = {
>   	{HV_SYNTHVID_GUID},
>   	{}
>   };
> +MODULE_DEVICE_TABLE(vmbus, hv_drm_vmbus_tbl);
>   
>   static struct hv_driver hv_drm_hv_driver = {
>   	.name = KBUILD_MODNAME,
> @@ -260,8 +262,6 @@ static void __exit hv_drm_exit(void)
>   module_init(hv_drm_init);
>   module_exit(hv_drm_exit);
>   
> -MODULE_DEVICE_TABLE(pci, hv_drm_pci_tbl);
> -MODULE_DEVICE_TABLE(vmbus, hv_drm_vmbus_tbl);
>   MODULE_LICENSE("GPL");
>   MODULE_AUTHOR("Deepak Rawat <drawat.floss@gmail.com>");
>   MODULE_DESCRIPTION("DRM driver for Hyper-V synthetic video device");

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH v1 2/4] drm/hyperv: Explicitly set subvendor and subdevice for pci match array
From: Uwe Kleine-König (The Capable Hub) @ 2026-07-02  8:52 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Dexuan Cui, Long Li, Saurabh Sengar, Maarten Lankhorst,
	Maxime Ripard, David Airlie, Simona Vetter, linux-hyperv,
	dri-devel, linux-kernel
In-Reply-To: <7a747d47-d275-48ad-a4ea-1e4897df1d28@suse.de>

[-- Attachment #1: Type: text/plain, Size: 1895 bytes --]

Hallo Thomas,

On Thu, Jul 02, 2026 at 08:43:32AM +0200, Thomas Zimmermann wrote:
> Am 01.07.26 um 19:05 schrieb Uwe Kleine-König (The Capable Hub):
> > .subvendor and .subdevice were set to 0 implicitly, so only devices with
> > these two values set to 0 in hardware can probe automatically. Make this
> > requirement explicit.
> > 
> > While touching this array item, also make use of the pci macro designed
> > for that case.
> > 
> > Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
> > ---
> >   drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> > index 2e75fb793495..e766d87b7a9d 100644
> > --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> > +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> > @@ -51,8 +51,8 @@ static void hv_drm_pci_remove(struct pci_dev *pdev)
> >   static const struct pci_device_id hv_drm_pci_tbl[] = {
> >   	{
> > -		.vendor = PCI_VENDOR_ID_MICROSOFT,
> > -		.device = PCI_DEVICE_ID_HYPERV_VIDEO,
> > +		PCI_VDEVICE_SUB(MICROSOFT, PCI_DEVICE_ID_HYPERV_VIDEO,
> > +				0, 0),
> 
> IDK, but it looks like an oversight to me.  Setting the sub-fields to ANY
> seems like the better fix.

That was my initial reflex, too. However while writing the commit log
for that change I noticed that since commit d750785f305e ("Staging: hv:
fix hv_utils module to properly autoload") from 2010 (applied to
v2.6.35-rc4) the driver never worked for hardware with .subvendor != 0
or .subdevice != 0. I cannot believe that something like that is
discovered 16 years later by chance during a rework by someone who
didn't try to run that hardware. And if I understand correctly, this is
emulated hardware and so I guess used quite a lot.

Best regards
Uwe

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v4] net: mana: Add Interrupt Moderation support
From: Paolo Abeni @ 2026-07-02  8:56 UTC (permalink / raw)
  To: Haiyang Zhang, linux-hyperv, netdev, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Konstantin Taranov,
	Simon Horman, Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg,
	Breno Leitao, linux-kernel, linux-rdma
  Cc: paulros
In-Reply-To: <20260629213652.11682-1-haiyangz@linux.microsoft.com>

On 6/29/26 11:36 PM, Haiyang Zhang wrote:
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 7438ea6b3f26..9391e9564605 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1591,6 +1591,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
>  
>  	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
>  			     sizeof(req), sizeof(resp));
> +
> +	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> +	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;

Double checking the above is intentional; it feels strange to me that
request and reply use different versions. Possibly a comment for future
memory would make sense.

>  	req.vport = vport;
>  	req.wq_type = wq_type;
>  	req.wq_gdma_region = wq_spec->gdma_region;
> @@ -1599,6 +1602,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
>  	req.cq_size = cq_spec->queue_size;
>  	req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
>  	req.cq_parent_qid = cq_spec->attached_eq;
> +	req.req_cq_moderation = cq_spec->req_cq_moderation;
> +	req.cq_moderation_comp = cq_spec->cq_moderation_comp;
> +	req.cq_moderation_usec = cq_spec->cq_moderation_usec;
>  
>  	err = mana_send_request(apc->ac, &req, sizeof(req), &resp,
>  				sizeof(resp));
> @@ -1856,6 +1862,7 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
>  	struct gdma_posted_wqe_info *wqe_info;
>  	unsigned int pkt_transmitted = 0;
>  	unsigned int wqe_unit_cnt = 0;
> +	unsigned int tx_bytes = 0;
>  	struct mana_txq *txq = cq->txq;
>  	struct mana_port_context *apc;
>  	struct netdev_queue *net_txq;
> @@ -1937,6 +1944,8 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
>  
>  		mana_unmap_skb(skb, apc);
>  
> +		tx_bytes += skb->len;
> +
>  		napi_consume_skb(skb, cq->budget);
>  
>  		pkt_transmitted++;
> @@ -1967,6 +1976,10 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
>  	if (atomic_sub_return(pkt_transmitted, &txq->pending_sends) < 0)
>  		WARN_ON_ONCE(1);
>  
> +	/* Feed DIM with the completion rate observed here, in NAPI context. */
> +	cq->tx_dim_pkts += pkt_transmitted;
> +	cq->tx_dim_bytes += tx_bytes;
> +
>  	cq->work_done = pkt_transmitted;
>  }
>  
> @@ -2318,6 +2331,119 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
>  		xdp_do_flush();
>  }
>  
> +static void mana_rx_dim_work(struct work_struct *work)
> +{
> +	struct dim *dim = container_of(work, struct dim, work);
> +	struct dim_cq_moder cur_moder;
> +	struct mana_cq *cq;
> +
> +	cur_moder = net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
> +	cq = container_of(dim, struct mana_cq, dim);
> +
> +	cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
> +	cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
> +
> +	mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
> +			 cur_moder.pkts, true);
> +
> +	dim->state = DIM_START_MEASURE;
> +}
> +
> +static void mana_tx_dim_work(struct work_struct *work)
> +{
> +	struct dim *dim = container_of(work, struct dim, work);
> +	struct dim_cq_moder cur_moder;
> +	struct mana_cq *cq;
> +
> +	cur_moder = net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
> +	cq = container_of(dim, struct mana_cq, dim);
> +
> +	cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
> +	cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
> +
> +	mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
> +			 cur_moder.pkts, true);
> +
> +	dim->state = DIM_START_MEASURE;
> +}
> +
> +/* The caller must update apc->rx/tx_dim_enabled before disabling and
> + * after enabling. And synchronize_net() before draining the DIM work,
> + * so that NAPI cannot observe a stale flag.
> + */
> +int mana_dim_change(struct mana_cq *cq, bool enable)

This always return 0, and the return value is not checked by the
callers; return type should likelly changed to void

/P


^ permalink raw reply

* Re: [PATCH v1 2/4] drm/hyperv: Explicitly set subvendor and subdevice for pci match array
From: Thomas Zimmermann @ 2026-07-02  9:15 UTC (permalink / raw)
  To: Uwe Kleine-König (The Capable Hub)
  Cc: Dexuan Cui, Long Li, Saurabh Sengar, Maarten Lankhorst,
	Maxime Ripard, David Airlie, Simona Vetter, linux-hyperv,
	dri-devel, linux-kernel
In-Reply-To: <akYkWQzXIo-y3n4J@monoceros>

Hi

Am 02.07.26 um 10:52 schrieb Uwe Kleine-König (The Capable Hub):
> Hallo Thomas,
>
> On Thu, Jul 02, 2026 at 08:43:32AM +0200, Thomas Zimmermann wrote:
>> Am 01.07.26 um 19:05 schrieb Uwe Kleine-König (The Capable Hub):
>>> .subvendor and .subdevice were set to 0 implicitly, so only devices with
>>> these two values set to 0 in hardware can probe automatically. Make this
>>> requirement explicit.
>>>
>>> While touching this array item, also make use of the pci macro designed
>>> for that case.
>>>
>>> Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
>>> ---
>>>    drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
>>> index 2e75fb793495..e766d87b7a9d 100644
>>> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
>>> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
>>> @@ -51,8 +51,8 @@ static void hv_drm_pci_remove(struct pci_dev *pdev)
>>>    static const struct pci_device_id hv_drm_pci_tbl[] = {
>>>    	{
>>> -		.vendor = PCI_VENDOR_ID_MICROSOFT,
>>> -		.device = PCI_DEVICE_ID_HYPERV_VIDEO,
>>> +		PCI_VDEVICE_SUB(MICROSOFT, PCI_DEVICE_ID_HYPERV_VIDEO,
>>> +				0, 0),
>> IDK, but it looks like an oversight to me.  Setting the sub-fields to ANY
>> seems like the better fix.
> That was my initial reflex, too. However while writing the commit log
> for that change I noticed that since commit d750785f305e ("Staging: hv:
> fix hv_utils module to properly autoload") from 2010 (applied to
> v2.6.35-rc4) the driver never worked for hardware with .subvendor != 0
> or .subdevice != 0. I cannot believe that something like that is
> discovered 16 years later by chance during a rework by someone who
> didn't try to run that hardware. And if I understand correctly, this is
> emulated hardware and so I guess used quite a lot.

I wouldn't be surprised. To my knowledge, there's just one 
implementation of this device, which is Windows. If they clear their 
host-side structures to 0 and pass them to the guest, no one would ever 
notice the issue. But let's see what the driver maintainers can comment 
on the issue.

Best regards
Thomas

>
> Best regards
> Uwe

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH 00/32] x86/msr: Drop 32-bit MSR interfaces
From: Ingo Molnar @ 2026-07-02 10:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Arnd Bergmann, Juergen Gross, linux-kernel, linux-pm,
	linux-edac@vger.kernel.org, x86, linux-acpi, kvm, linux-coco,
	linux-pci, virtualization, linux-ide, dri-devel, linux-fbdev,
	linux-crypto, open list:GPIO SUBSYSTEM, linux-hyperv, linux-hwmon,
	linux-perf-users, linux-mtd, platform-driver-x86,
	Rafael J . Wysocki, Daniel Lezcano, Zhang Rui,
	lukasz.luba@arm.com, Jason Baron, Borislav Petkov, Tony Luck,
	Yazen Ghannam, Len Brown, Pavel Machek, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Kirill A. Shutemov, Rick Edgecombe, Pu Wen, Bjorn Helgaas,
	Ajay Kaher, Alexey Makhalov, Broadcom internal kernel review list,
	Viresh Kumar, Reinette Chatre, Dave Martin, James Morse,
	Babu Moger, Tony W Wang-oc, Damien Le Moal, Niklas Cassel,
	Dave Airlie, Helge Deller, linux-geode, Olivia Mackall,
	Herbert Xu, Linus Walleij, Bartosz Golaszewski,
	Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Guenter Roeck, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Josh Poimboeuf, Pawan Gupta, Vitaly Kuznetsov,
	Andy Lutomirski, Boris Ostrovsky, Huang Rui, Mario Limonciello,
	Perry Yuan, K Prateek Nayak, srinivas.pandruvada@linux.intel.com,
	Artem Bityutskiy, Artem Bityutskiy, Miquel Raynal,
	Richard Weinberger, Vignesh Raghavendra, Ashok Raj, Hans de Goede,
	Ilpo Järvinen, Rajneesh Bhardwaj, David E Box, xen-devel
In-Reply-To: <akQR9YMtMHReJTfB@google.com>


* Sean Christopherson <seanjc@google.com> wrote:

> > Note that the individual patches are IMO significantly easier to review
> > through the actual 32-bit => 64-bit variable assignment changes done
> > in isolation (which sometimes include minor cleanups), while
> > the Coccinelle semantic patch:
> > 
> >    { a(b,c) => c = a(b) }
> > 
> > which changes both the function signature and the order of terms as
> > well, is just a single add-on treewide patch.
> 
> Is the plan for subsystem maintainers to pick up the relevant patches,
> and then do the treewide change one release cycle later?

I'll try to keep the patches in a single tree (tip:x86/msr)
in the hope of not prolonging the pain two cycles - but it's
of course fine for maintainers to pick up the patches too
(most of them are standalone), we'll sort it all out in the end.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 00/32] x86/msr: Drop 32-bit MSR interfaces
From: Juergen Gross @ 2026-07-02 11:03 UTC (permalink / raw)
  To: Ingo Molnar, Sean Christopherson
  Cc: Arnd Bergmann, linux-kernel, linux-pm, linux-edac@vger.kernel.org,
	x86, linux-acpi, kvm, linux-coco, linux-pci, virtualization,
	linux-ide, dri-devel, linux-fbdev, linux-crypto,
	open list:GPIO SUBSYSTEM, linux-hyperv, linux-hwmon,
	linux-perf-users, linux-mtd, platform-driver-x86,
	Rafael J . Wysocki, Daniel Lezcano, Zhang Rui,
	lukasz.luba@arm.com, Jason Baron, Borislav Petkov, Tony Luck,
	Yazen Ghannam, Len Brown, Pavel Machek, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Kirill A. Shutemov, Rick Edgecombe, Pu Wen, Bjorn Helgaas,
	Ajay Kaher, Alexey Makhalov, Broadcom internal kernel review list,
	Viresh Kumar, Reinette Chatre, Dave Martin, James Morse,
	Babu Moger, Tony W Wang-oc, Damien Le Moal, Niklas Cassel,
	Dave Airlie, Helge Deller, linux-geode, Olivia Mackall,
	Herbert Xu, Linus Walleij, Bartosz Golaszewski,
	Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Guenter Roeck, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Josh Poimboeuf, Pawan Gupta, Vitaly Kuznetsov,
	Andy Lutomirski, Boris Ostrovsky, Huang Rui, Mario Limonciello,
	Perry Yuan, K Prateek Nayak, srinivas.pandruvada@linux.intel.com,
	Artem Bityutskiy, Artem Bityutskiy, Miquel Raynal,
	Richard Weinberger, Vignesh Raghavendra, Ashok Raj, Hans de Goede,
	Ilpo Järvinen, Rajneesh Bhardwaj, David E Box, xen-devel
In-Reply-To: <akY4U0jUZm4HOGZ_@gmail.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 1367 bytes --]

On 02.07.26 12:07, Ingo Molnar wrote:
> 
> * Sean Christopherson <seanjc@google.com> wrote:
> 
>>> Note that the individual patches are IMO significantly easier to review
>>> through the actual 32-bit => 64-bit variable assignment changes done
>>> in isolation (which sometimes include minor cleanups), while
>>> the Coccinelle semantic patch:
>>>
>>>     { a(b,c) => c = a(b) }
>>>
>>> which changes both the function signature and the order of terms as
>>> well, is just a single add-on treewide patch.
>>
>> Is the plan for subsystem maintainers to pick up the relevant patches,
>> and then do the treewide change one release cycle later?
> 
> I'll try to keep the patches in a single tree (tip:x86/msr)
> in the hope of not prolonging the pain two cycles - but it's
> of course fine for maintainers to pick up the patches too
> (most of them are standalone), we'll sort it all out in the end.

Ingo, would you be fine with me posting patch updates just as replies to the
original patch emails? This would speed things up, as I wouldn't need to wait
for more review input of all the patches before sending out new versions.

As the patches are (mostly) standalone, this should not cause any weird
problems.

The last two patches might need updates, but those can be applied only after
the rest has been accepted anyway.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v11 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: sashiko-bot @ 2026-07-02 14:18 UTC (permalink / raw)
  To: Dipayaan Roy; +Cc: bpf, linux-hyperv
In-Reply-To: <20260701141808.461554-3-dipayanroy@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

New issues:
- [High] Forcing `port_is_up = true` in the `mana_attach()` error path leaves the port state inconsistent, leading to a double-teardown of hardware queues and a NULL pointer dereference in stats gathering.
- [High] The device restart sequence in `mana_set_priv_flags()` lacks `vport_mutex` synchronization and does not set `apc->channel_changing`, creating a race condition with concurrent RDMA operations.

Pre-existing issues:
- [High] Bulk pre-allocation of RX buffers uses GFP_ATOMIC and emergency memory reserves in a process context (ethtool), which can lead to memory depletion and spurious failures.
- [High] Pre-allocated RX buffers with `from_pool = false` are leaked when dropped in a queue configured with `frag_count > 1`.
--

commit 40ed2f3da9bc06c10116aa35575da8c740d52002
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>

net: mana: force full-page RX buffers via ethtool private flag
    
This patch adds an ethtool private flag "full-page-rx" to force one RX
buffer per page on ARM64 platforms. The single-buffer-per-page logic is
consolidated into a new helper function. The driver is detached and
reattached when the flag is modified to apply the new buffer configurations.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 3531abfdaf433..24f6ce63b031b 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
[ ... ]
> @@ -611,6 +630,88 @@ static int mana_get_link_ksettings(struct net_device *ndev,
>  	return 0;
>  }
>  
> +static u32 mana_get_priv_flags(struct net_device *ndev)
> +{
> +	struct mana_port_context *apc = netdev_priv(ndev);
> +
> +	return apc->priv_flags;
> +}
> +
> +static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> +{
> +	struct mana_port_context *apc = netdev_priv(ndev);
> +	u32 changed = apc->priv_flags ^ priv_flags;
> +	u32 old_priv_flags = apc->priv_flags;
> +	bool schedule_port_reset = false;
> +	int err = 0;
> +
> +	if (!changed)
> +		return 0;
> +
> +	/* Reject unknown bits */
> +	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
> +		return -EINVAL;
> +
> +	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
> +		apc->priv_flags = priv_flags;
> +
> +		if (!apc->port_is_up) {
> +			/* Port is down, flag updated to apply on next up
> +			 * so just return.
> +			 */
> +			return 0;
> +		}
> +
> +		/* Pre-allocate buffers to prevent failure in mana_attach
> +		 * later
> +		 */
> +		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);

[Severity: High]
This is a pre-existing issue, but does calling mana_pre_alloc_rxbufs() from
ethtool context cause memory allocation problems?

mana_pre_alloc_rxbufs() calls dev_alloc_pages() in a bulk loop, which uses
GFP_ATOMIC and emergency memory reserves. Since this is now triggered from
process context via ethtool, could this deplete system emergency reserves
intended for atomic receive paths, leading to spurious failures or network
drops under memory pressure?

[Severity: High]
This isn't a bug introduced by this patch, but does toggling this flag
dynamically create conditions that permanently leak pre-allocated buffers?

When a packet is dropped, mana_rx_skb() checks if from_pool is false (which
it is for pre-allocated buffers) and unconditionally saves the buffer to
rxq->xdp_save_va. However, mana_get_rxfrag() only reuses xdp_save_va if
rxq->frag_count == 1. 

If this flag causes the queue to operate with frag_count > 1 after buffers
are pre-allocated (e.g. by disabling the flag), xdp_save_va is ignored.
Subsequent drops will trigger WARN_ON_ONCE(rxq->xdp_save_va) and permanently
leak the old buffer.

> +		if (err) {
> +			netdev_err(ndev,
> +				   "Insufficient memory for new allocations\n");
> +			apc->priv_flags = old_priv_flags;
> +			return err;
> +		}
> +
> +		err = mana_detach(ndev, false);

[Severity: High]
Does this device restart sequence lack vport_mutex synchronization with
concurrent RDMA operations?

In mana_set_channels(), detach and re-attach is protected by asserting
apc->channel_changing under vport_mutex. Here, mana_detach() tears down
the vport (dropping vport_use_count to 0) without setting
apc->channel_changing. 

Can an RDMA process concurrently call mana_cfg_vport(), observe a zero use
count, and successfully claim the vport before mana_attach() blindly
attempts to configure the exact same vport, resulting in conflicting
hardware programming?

> +		if (err) {
> +			netdev_err(ndev, "mana_detach failed: %d\n", err);
> +			apc->priv_flags = old_priv_flags;
> +
> +			/* Port is in an inconsistent state. Restore
> +			 * 'port_is_up' so that queue reset work handler
> +			 * can properly detach and re-attach.
> +			 */
> +			apc->port_is_up = true;
> +			schedule_port_reset = true;
> +			goto out;
> +		}
> +
> +		err = mana_attach(ndev);
> +		if (err) {
> +			netdev_err(ndev, "mana_attach failed: %d\n", err);
> +			apc->priv_flags = old_priv_flags;
> +
> +			/* Restore 'port_is_up' so the reset work handler
> +			 * can properly detach/attach. Without this,
> +			 * the handler sees port_is_up=false and skips
> +			 * queue allocation, leaving the port dead.
> +			 */
> +			apc->port_is_up = true;

[Severity: High]
Does forcing port_is_up to true in this error path lead to a NULL pointer
dereference in stats gathering?

The internal failure path of mana_attach() calls
mana_cleanup_port_context(), which frees apc->rxqs and sets it to NULL. 

By forcing apc->port_is_up = true here, if userspace queries stats via
ethtool -S or ip -s link, mana_get_stats64() and mana_get_ethtool_stats()
will bypass their safety checks and blindly dereference the NULL
apc->rxqs[q] array.

> +			schedule_port_reset = true;
> +		}
> +	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260701141808.461554-1-dipayanroy@linux.microsoft.com?part=2

^ permalink raw reply

* [PATCH v2 0/4] Hyper-V: Add para-virtualized IOMMU support for Linux guests
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor

This patch series introduces a para-virtualized IOMMU driver for
Linux guests running on Microsoft Hyper-V. The driver enables two
primary use cases:
  1) In-kernel DMA protection for devices assigned to the guest.
  2) Device assignment to guest user space (e.g., via VFIO).

The driver implements the following core functionality:
*   Hypercall-based Enumeration
    Unlike traditional ACPI-based discovery (e.g., DMAR/IVRS),
    this driver enumerates the Hyper-V IOMMU capabilities directly
    via hypercalls. This approach allows the guest to discover
    IOMMU presence and features without requiring specific virtual
    firmware extensions or modifications.

*   Domain Management
    The driver manages IOMMU domains through a new set of Hyper-V
    hypercall interfaces, handling domain allocation and attachment
    for endpoint devices.

*   Nested Translation Support
    This implementation leverages guest-managed stage-1 I/O page
    tables nested with host stage-2 translations. It is built
    upon the consolidated IOMMU page table framework (IOMMU_PT).
    This design eliminates the need for emulating map operations.
    Both Intel VT-d and AMD IOMMU platforms are supported.

*   IOTLB Invalidation
    IOTLB invalidation requests are marshaled and issued to the
    hypervisor through the same hypercall mechanism. Both domain-
    selective and page-selective flushes are supported.

Implementation Notes:
*   Platform Support
    The current implementation targets x86 platforms with Intel
    VT-d and AMD IOMMU hardware.

*   MSI Region Handling
    The hardware MSI region is hard-coded to the standard x86
    interrupt range (0xfee00000 - 0xfeefffff). Future updates may
    allow this configuration to be queried via hypercalls if new
    hardware platforms are to be supported.

*   Reserved Regions (RMRR)
    There is currently no requirement to support assigned devices with
    ACPI RMRR limitations. Consequently, this patch series does not
    specify or query reserved memory regions.

Testing:
This series has been validated with the following configurations:
- Intel DSA devices assigned to the guest, tested with dmatest.
- NVMe devices assigned to the guest on AMD platforms, tested
  with fio.
- dma_map_benchmark for DMA mapping performance evaluation.

Changelog:

v1[1] -> v2:
- Dropped the "move to subdirectory" patch; the directory now exists
  upstream.

- hv: logical device ID registry:
  - Moved the registry to hv_common.c so it can be shared, and derived
    the prefix via a shared helper instead of caching it in pci-hyperv's
    private struct.
  - Moved the lookup out of the irq-disabled region (PREEMPT_RT).

- iommu/hyperv: para-virtualized IOMMU:
  - Removed the unused detach_dev op.
  - Rejected a hypervisor not advertising x86 page sizes instead of
    masking and warning.
  - Statically initialized the identity and blocking domains.
  - Gave the blocking domain its own attach op, which returns the hypercall
    status and WARNs on failure.

- iommu/hyperv: page-selective IOTLB flush:
  - Used a single descriptor covering a slightly larger power-of-two
    range, instead of splitting the range into multiple descriptors.
  - Fixed the inclusive-end corner case in the flush range calculation.

RFC v1[2] -> v1[1]:
- Scoped platform support to x86 only (Intel VT-d and AMD IOMMU);
  initialization now uses x86_init.iommu.iommu_init
- Added page-selective IOTLB flush support
- Disable device ATS in hv_iommu_release_device()
- Addressed review comments from Michael Kelley:
  - Reversed dependency: pvIOMMU exports registration API for
    pci-hyperv to call, instead of pci-hyperv exporting
    hv_build_logical_dev_id()
  - Dropped separate output page allocation patch; hypercall input
    and output now share the same per-CPU page
  - Cleaned up Kconfig (removed PCI_HYPERV dependency, unnecessary
    selects)
  - Removed dev_list, per-domain spinlock, and syscore_ops
  - Removed forward declarations by reordering functions
  - Fixed typos, cleaned up Kconfig selects, improved pr_info
    messages, etc.

[1] https://lore.kernel.org/linux-hyperv/20260511162408.1180069-1-zhangyu1@linux.microsoft.com/
[2] https://lore.kernel.org/linux-hyperv/20251209051128.76913-1-zhangyu1@linux.microsoft.com/


Easwar Hariharan (1):
  Drivers: hv: Add logical device ID registry for vPCI devices

Wei Liu (1):
  hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU

Yu Zhang (2):
  iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
  iommu/hyperv: Add page-selective IOTLB flush support

 arch/x86/hyperv/hv_init.c           |   4 +
 arch/x86/include/asm/mshyperv.h     |   4 +
 drivers/hv/hv_common.c              |  95 ++++
 drivers/iommu/Kconfig               |   1 +
 drivers/iommu/hyperv/Kconfig        |  16 +
 drivers/iommu/hyperv/Makefile       |   1 +
 drivers/iommu/hyperv/iommu.c        | 686 ++++++++++++++++++++++++++++
 drivers/iommu/hyperv/iommu.h        |  51 +++
 drivers/pci/controller/pci-hyperv.c |  21 +-
 include/asm-generic/mshyperv.h      |  13 +
 include/hyperv/hvgdk_mini.h         |   9 +
 include/hyperv/hvhdk_mini.h         | 141 ++++++
 include/linux/hyperv.h              |   8 +
 13 files changed, 1045 insertions(+), 5 deletions(-)
 create mode 100644 drivers/iommu/hyperv/Kconfig
 create mode 100644 drivers/iommu/hyperv/iommu.c
 create mode 100644 drivers/iommu/hyperv/iommu.h

-- 
2.52.0


^ permalink raw reply

* [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor
In-Reply-To: <20260702160518.311234-1-zhangyu1@linux.microsoft.com>

From: Wei Liu <wei.liu@kernel.org>

Hyper-V guest IOMMU is a para-virtualized IOMMU based on hypercalls.
Introduce the hypercalls used by the child partition to interact with
this facility.

These hypercalls fall into below categories:
- Detection and capability: HVCALL_GET_IOMMU_CAPABILITIES is used to
  detect the existence and capabilities of the guest IOMMU.

- Device management: HVCALL_GET_LOGICAL_DEVICE_PROPERTY is used to
  check whether an endpoint device is managed by the guest IOMMU.

- Domain management: A set of hypercalls is provided to handle the
  creation, configuration, and deletion of guest domains, as well as
  the attachment/detachment of endpoint devices to/from those domains.

- IOTLB flushing: HVCALL_FLUSH_DEVICE_DOMAIN is used to ask Hyper-V
  for a domain-selective IOTLB flush (which in its handler may flush
  the device TLB as well).

Signed-off-by: Wei Liu <wei.liu@kernel.org>
Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
---
 include/hyperv/hvgdk_mini.h |   8 +++
 include/hyperv/hvhdk_mini.h | 124 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+)

diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 6a4e8b9d570f..5bdbb44da112 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -486,10 +486,16 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_GET_VP_INDEX_FROM_APIC_ID		0x009a
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE	0x00af
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST	0x00b0
+#define HVCALL_CREATE_DEVICE_DOMAIN			0x00b1
+#define HVCALL_ATTACH_DEVICE_DOMAIN			0x00b2
 #define HVCALL_SIGNAL_EVENT_DIRECT			0x00c0
 #define HVCALL_POST_MESSAGE_DIRECT			0x00c1
 #define HVCALL_DISPATCH_VP				0x00c2
+#define HVCALL_DETACH_DEVICE_DOMAIN			0x00c4
+#define HVCALL_DELETE_DEVICE_DOMAIN			0x00c5
 #define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
+#define HVCALL_CONFIGURE_DEVICE_DOMAIN			0x00ce
+#define HVCALL_FLUSH_DEVICE_DOMAIN			0x00d0
 #define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
 #define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
 #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY	0x00db
@@ -502,6 +508,8 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_MMIO_READ				0x0106
 #define HVCALL_MMIO_WRITE				0x0107
 #define HVCALL_DISABLE_HYP_EX                           0x010f
+#define HVCALL_GET_IOMMU_CAPABILITIES			0x0125
+#define HVCALL_GET_LOGICAL_DEVICE_PROPERTY		0x0127
 #define HVCALL_MAP_STATS_PAGE2				0x0131
 
 /* HV_HYPERCALL_INPUT */
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index b4cb2fa26e9b..493608e791b4 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -547,4 +547,128 @@ union hv_device_id {		/* HV_DEVICE_ID */
 	} acpi;
 } __packed;
 
+/* Device domain types */
+#define HV_DEVICE_DOMAIN_TYPE_S1	1 /* Stage 1 domain */
+
+/* ID for default domain and NULL domain */
+#define HV_DEVICE_DOMAIN_ID_DEFAULT 0
+#define HV_DEVICE_DOMAIN_ID_NULL    0xFFFFFFFFULL
+
+union hv_device_domain_id {
+	u64 as_uint64;
+	struct {
+		u32 type: 4;
+		u32 reserved: 28;
+		u32 id;
+	} __packed;
+};
+
+struct hv_input_device_domain {
+	u64 partition_id;
+	union hv_input_vtl owner_vtl;
+	u8 padding[7];
+	union hv_device_domain_id domain_id;
+} __packed;
+
+union hv_create_device_domain_flags {
+	u32 as_uint32;
+	struct {
+		u32 forward_progress_required: 1;
+		u32 inherit_owning_vtl: 1;
+		u32 reserved: 30;
+	} __packed;
+};
+
+struct hv_input_create_device_domain {
+	struct hv_input_device_domain device_domain;
+	union hv_create_device_domain_flags create_device_domain_flags;
+} __packed;
+
+struct hv_input_delete_device_domain {
+	struct hv_input_device_domain device_domain;
+} __packed;
+
+struct hv_input_attach_device_domain {
+	struct hv_input_device_domain device_domain;
+	union hv_device_id device_id;
+} __packed;
+
+struct hv_input_detach_device_domain {
+	u64 partition_id;
+	union hv_device_id device_id;
+} __packed;
+
+struct hv_device_domain_settings {
+	struct {
+		/*
+		 * Enable translations. If not enabled, all transaction bypass
+		 * S1 translations.
+		 */
+		u64 translation_enabled: 1;
+		u64 blocked: 1;
+		/*
+		 * First stage address translation paging mode:
+		 * 0: 4-level paging (default)
+		 * 1: 5-level paging
+		 */
+		u64 first_stage_paging_mode: 1;
+		u64 reserved: 61;
+	} flags;
+
+	/* Address of translation table */
+	u64 page_table_root;
+} __packed;
+
+struct hv_input_configure_device_domain {
+	struct hv_input_device_domain device_domain;
+	struct hv_device_domain_settings settings;
+} __packed;
+
+struct hv_input_get_iommu_capabilities {
+	u64 partition_id;
+	u64 reserved;
+} __packed;
+
+struct hv_output_get_iommu_capabilities {
+	u32 size;
+	u16 reserved;
+	u8  max_iova_width;
+	u8  max_pasid_width;
+
+#define HV_IOMMU_CAP_PRESENT (1ULL << 0)
+#define HV_IOMMU_CAP_S2 (1ULL << 1)
+#define HV_IOMMU_CAP_S1 (1ULL << 2)
+#define HV_IOMMU_CAP_S1_5LVL (1ULL << 3)
+#define HV_IOMMU_CAP_PASID (1ULL << 4)
+#define HV_IOMMU_CAP_ATS (1ULL << 5)
+#define HV_IOMMU_CAP_PRI (1ULL << 6)
+
+	u64 iommu_cap;
+	u64 pgsize_bitmap;
+} __packed;
+
+enum hv_logical_device_property_code {
+	HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU = 10,
+};
+
+struct hv_input_get_logical_device_property {
+	u64 partition_id;
+	u64 logical_device_id;
+	/* Takes values from enum hv_logical_device_property_code. */
+	u32 code;
+	u32 reserved;
+} __packed;
+
+struct hv_output_get_logical_device_property {
+#define HV_DEVICE_IOMMU_ENABLED (1ULL << 0)
+	u64 device_iommu;
+	u64 reserved;
+} __packed;
+
+struct hv_input_flush_device_domain {
+	struct hv_input_device_domain device_domain;
+	u32 flags;
+	u32 reserved;
+} __packed;
+
 #endif /* _HV_HVHDK_MINI_H */
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 2/4] Drivers: hv: Add logical device ID registry for vPCI devices
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor
In-Reply-To: <20260702160518.311234-1-zhangyu1@linux.microsoft.com>

From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>

Hyper-V identifies each PCI pass-thru device by a logical device ID in
its hypercall interface. This ID consists of a per-bus prefix, derived
from the VMBus device instance GUID, combined with the PCI function
number of the endpoint device.

Add a small registry in hv_common.c that maps a PCI domain number to its
logical device ID prefix. The vPCI bus driver (pci-hyperv) registers the
prefix when a bus is probed and unregisters it when the bus is removed.
Consumers such as the para-virtualized IOMMU driver look up the prefix
by PCI domain number and combine it with the function number to form the
complete logical device ID for hypercalls.

The prefix construction is shared via hv_build_logical_dev_id_prefix() so
that pci-hyperv's interrupt retargeting path and the registry use exactly
the same byte layout. It is derived on demand from the constant hv_device
instance GUID rather than cached in struct hv_pcibus_device, which is
private to the pci-hyperv module; this keeps the interface narrow and
avoids depending on pci-hyperv internals.

Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
---
 drivers/hv/hv_common.c              | 95 +++++++++++++++++++++++++++++
 drivers/pci/controller/pci-hyperv.c | 21 +++++--
 include/asm-generic/mshyperv.h      | 13 ++++
 include/linux/hyperv.h              |  8 +++
 4 files changed, 132 insertions(+), 5 deletions(-)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 6b67ac616789..53493f8d14dc 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -26,6 +26,8 @@
 #include <linux/kmsg_dump.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
 #include <linux/dma-map-ops.h>
 #include <linux/set_memory.h>
 #include <hyperv/hvhdk.h>
@@ -863,3 +865,96 @@ const char *hv_result_to_string(u64 status)
 	return "Unknown";
 }
 EXPORT_SYMBOL_GPL(hv_result_to_string);
+
+#ifdef CONFIG_HYPERV_PVIOMMU
+/*
+ * Logical device ID registry shared between the vPCI bus driver
+ * (pci-hyperv) and the para-virtualized IOMMU driver. The vPCI driver
+ * registers the per-bus logical device ID prefix at bus probe time, and
+ * the pvIOMMU driver looks it up to build the full logical device ID used
+ * in IOMMU hypercalls.
+ */
+struct hv_pci_busdata {
+	int		 pci_domain_nr;
+	u32		 logical_dev_id_prefix;
+	struct list_head list;
+};
+
+static LIST_HEAD(hv_pci_bus_list);
+static DEFINE_SPINLOCK(hv_pci_bus_lock);
+
+int hv_iommu_register_pci_bus(int pci_domain_nr, u32 logical_dev_id_prefix)
+{
+	struct hv_pci_busdata *bus, *new;
+	int ret = 0;
+
+	new = kzalloc_obj(*new, GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+
+	spin_lock(&hv_pci_bus_lock);
+	list_for_each_entry(bus, &hv_pci_bus_list, list) {
+		if (bus->pci_domain_nr != pci_domain_nr)
+			continue;
+
+		if (bus->logical_dev_id_prefix != logical_dev_id_prefix) {
+			pr_err("stale registration for PCI domain %d (old prefix 0x%08x, new 0x%08x)\n",
+			       pci_domain_nr, bus->logical_dev_id_prefix,
+			       logical_dev_id_prefix);
+			ret = -EEXIST;
+		}
+
+		goto out_free;
+	}
+
+	new->pci_domain_nr = pci_domain_nr;
+	new->logical_dev_id_prefix = logical_dev_id_prefix;
+	list_add(&new->list, &hv_pci_bus_list);
+	spin_unlock(&hv_pci_bus_lock);
+	return 0;
+
+out_free:
+	spin_unlock(&hv_pci_bus_lock);
+	kfree(new);
+	return ret;
+}
+EXPORT_SYMBOL_FOR_MODULES(hv_iommu_register_pci_bus, "pci-hyperv");
+
+void hv_iommu_unregister_pci_bus(int pci_domain_nr)
+{
+	struct hv_pci_busdata *bus, *tmp;
+
+	spin_lock(&hv_pci_bus_lock);
+	list_for_each_entry_safe(bus, tmp, &hv_pci_bus_list, list) {
+		if (bus->pci_domain_nr == pci_domain_nr) {
+			list_del(&bus->list);
+			kfree(bus);
+			break;
+		}
+	}
+	spin_unlock(&hv_pci_bus_lock);
+}
+EXPORT_SYMBOL_FOR_MODULES(hv_iommu_unregister_pci_bus, "pci-hyperv");
+
+/*
+ * Look up the logical device ID prefix registered for @pci_domain_nr.
+ * Returns 0 on success with *prefix filled in; -ENODEV if no entry is
+ * registered for that PCI domain.
+ */
+int hv_iommu_lookup_logical_dev_id(int pci_domain_nr, u32 *prefix)
+{
+	struct hv_pci_busdata *bus;
+	int ret = -ENODEV;
+
+	spin_lock(&hv_pci_bus_lock);
+	list_for_each_entry(bus, &hv_pci_bus_list, list) {
+		if (bus->pci_domain_nr == pci_domain_nr) {
+			*prefix = bus->logical_dev_id_prefix;
+			ret = 0;
+			break;
+		}
+	}
+	spin_unlock(&hv_pci_bus_lock);
+	return ret;
+}
+#endif /* CONFIG_HYPERV_PVIOMMU */
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index cfc8fa403dad..58ca2c95bd10 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -641,10 +641,7 @@ static void hv_irq_retarget_interrupt(struct irq_data *data)
 	params->int_entry.source = HV_INTERRUPT_SOURCE_MSI;
 	params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff;
 	params->int_entry.msi_entry.data.as_uint32 = int_desc->data;
-	params->device_id = (hbus->hdev->dev_instance.b[5] << 24) |
-			   (hbus->hdev->dev_instance.b[4] << 16) |
-			   (hbus->hdev->dev_instance.b[7] << 8) |
-			   (hbus->hdev->dev_instance.b[6] & 0xf8) |
+	params->device_id = hv_build_logical_dev_id_prefix(hbus->hdev) |
 			   PCI_FUNC(pdev->devfn);
 	params->int_target.vector = hv_msi_get_int_vector(data);
 
@@ -3715,6 +3712,7 @@ static int hv_pci_probe(struct hv_device *hdev,
 	struct hv_pcibus_device *hbus;
 	int ret, dom;
 	u16 dom_req;
+	u32 prefix;
 	char *name;
 
 	bridge = devm_pci_alloc_host_bridge(&hdev->device, 0);
@@ -3857,13 +3855,22 @@ static int hv_pci_probe(struct hv_device *hdev,
 
 	hbus->state = hv_pcibus_probed;
 
-	ret = create_root_hv_pci_bus(hbus);
+	/* Notify pvIOMMU before any device on the bus is scanned. */
+	prefix = hv_build_logical_dev_id_prefix(hdev);
+
+	ret = hv_iommu_register_pci_bus(dom, prefix);
 	if (ret)
 		goto free_windows;
 
+	ret = create_root_hv_pci_bus(hbus);
+	if (ret)
+		goto unregister_pviommu;
+
 	mutex_unlock(&hbus->state_lock);
 	return 0;
 
+unregister_pviommu:
+	hv_iommu_unregister_pci_bus(dom);
 free_windows:
 	hv_pci_free_bridge_windows(hbus);
 exit_d0:
@@ -3977,6 +3984,8 @@ static void hv_pci_remove(struct hv_device *hdev)
 
 	hbus = hv_get_drvdata(hdev);
 	if (hbus->state == hv_pcibus_installed) {
+		int dom = hbus->bridge->domain_nr;
+
 		tasklet_disable(&hdev->channel->callback_event);
 		hbus->state = hv_pcibus_removing;
 		tasklet_enable(&hdev->channel->callback_event);
@@ -3994,6 +4003,8 @@ static void hv_pci_remove(struct hv_device *hdev)
 		hv_pci_remove_slots(hbus);
 		pci_remove_root_bus(hbus->bridge->bus);
 		pci_unlock_rescan_remove();
+
+		hv_iommu_unregister_pci_bus(dom);
 	}
 
 	hv_pci_bus_exit(hdev, false);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index bf601d67cecb..f65344f2bb81 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -73,6 +73,19 @@ extern enum hv_partition_type hv_curr_partition_type;
 extern void * __percpu *hyperv_pcpu_input_arg;
 extern void * __percpu *hyperv_pcpu_output_arg;
 
+#ifdef CONFIG_HYPERV_PVIOMMU
+int  hv_iommu_register_pci_bus(int pci_domain_nr, u32 logical_dev_id_prefix);
+void hv_iommu_unregister_pci_bus(int pci_domain_nr);
+int  hv_iommu_lookup_logical_dev_id(int pci_domain_nr, u32 *prefix);
+#else
+static inline int hv_iommu_register_pci_bus(int pci_domain_nr,
+					    u32 logical_dev_id_prefix)
+{
+	return 0;
+}
+static inline void hv_iommu_unregister_pci_bus(int pci_domain_nr) { }
+#endif
+
 u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
 u64 hv_do_fast_hypercall8(u16 control, u64 input8);
 u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 9de2c8d6037a..10ee2c462d7c 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1287,6 +1287,14 @@ struct hv_device {
 #define device_to_hv_device(d)	container_of_const(d, struct hv_device, device)
 #define drv_to_hv_drv(d)	container_of_const(d, struct hv_driver, driver)
 
+static inline u32 hv_build_logical_dev_id_prefix(struct hv_device *hdev)
+{
+	return ((u32)hdev->dev_instance.b[5] << 24) |
+	       ((u32)hdev->dev_instance.b[4] << 16) |
+	       ((u32)hdev->dev_instance.b[7] << 8) |
+	       (hdev->dev_instance.b[6] & 0xf8u);
+}
+
 static inline void hv_set_drvdata(struct hv_device *dev, void *data)
 {
 	dev_set_drvdata(&dev->device, data);
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor
In-Reply-To: <20260702160518.311234-1-zhangyu1@linux.microsoft.com>

Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
This driver implements stage-1 IO translation within the guest OS.
It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
for:
 - Capability discovery
 - Domain allocation, configuration, and deallocation
 - Device attachment and detachment
 - IOTLB invalidation

The driver constructs x86-compatible stage-1 IO page tables in the
guest memory using consolidated IO page table helpers. This allows
the guest to manage stage-1 translations independently of vendor-
specific drivers (like Intel VT-d or AMD IOMMU).

Hyper-V consumes this stage-1 IO page table when a device domain is
created and configured, and nests it with the host's stage-2 IO page
tables, therefore eliminating the VM exits for guest IOMMU mapping
operations. For unmapping operations, VM exits to perform the IOTLB
flush are still unavoidable.

To identify a device in its hypercall interface, the driver looks up the
logical device ID prefix registered for the device's PCI domain (see the
logical device ID registry in hv_common.c) and combines it with the PCI
function number of the endpoint device.

Co-developed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
---
 arch/x86/hyperv/hv_init.c       |   4 +
 arch/x86/include/asm/mshyperv.h |   4 +
 drivers/iommu/Kconfig           |   1 +
 drivers/iommu/hyperv/Kconfig    |  16 +
 drivers/iommu/hyperv/Makefile   |   1 +
 drivers/iommu/hyperv/iommu.c    | 620 ++++++++++++++++++++++++++++++++
 drivers/iommu/hyperv/iommu.h    |  51 +++
 7 files changed, 697 insertions(+)
 create mode 100644 drivers/iommu/hyperv/Kconfig
 create mode 100644 drivers/iommu/hyperv/iommu.c
 create mode 100644 drivers/iommu/hyperv/iommu.h

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 55a8b6de2865..094f9f7ddb72 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -578,6 +578,10 @@ void __init hyperv_init(void)
 	old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev;
 	x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev;
 
+#ifdef CONFIG_HYPERV_PVIOMMU
+	x86_init.iommu.iommu_init = hv_iommu_init;
+#endif
+
 	hv_apic_init();
 
 	x86_init.pci.arch_init = hv_pci_init;
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index f64393e853ee..20d947c2c758 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -313,6 +313,10 @@ static inline void mshv_vtl_return_hypercall(void) {}
 static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
 #endif
 
+#ifdef CONFIG_HYPERV_PVIOMMU
+int __init hv_iommu_init(void);
+#endif
+
 #include <asm-generic/mshyperv.h>
 
 #endif
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 6e07bd69467a..0d128f377929 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -195,6 +195,7 @@ config MSM_IOMMU
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/arm/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/hyperv/Kconfig"
 source "drivers/iommu/iommufd/Kconfig"
 source "drivers/iommu/riscv/Kconfig"
 
diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig
new file mode 100644
index 000000000000..8b6abbaaf9b8
--- /dev/null
+++ b/drivers/iommu/hyperv/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+# HyperV paravirtualized IOMMU support
+config HYPERV_PVIOMMU
+	bool "Microsoft Hypervisor para-virtualized IOMMU support"
+	depends on X86_64 && HYPERV
+	select IOMMU_API
+	select GENERIC_PT
+	select IOMMU_PT
+	select IOMMU_PT_X86_64
+	select IOMMU_IOVA
+	default HYPERV
+	help
+	  Para-virtualized IOMMU driver for Linux guests running on
+	  Microsoft Hyper-V. Provides DMA remapping and IOTLB
+	  flush support to enable DMA isolation for devices
+	  assigned to the guest.
diff --git a/drivers/iommu/hyperv/Makefile b/drivers/iommu/hyperv/Makefile
index 6ef0ef97f3dd..fefb409d976b 100644
--- a/drivers/iommu/hyperv/Makefile
+++ b/drivers/iommu/hyperv/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_IRQ_REMAP) += hv-irq-remap-x86.o
+obj-$(CONFIG_HYPERV_PVIOMMU) += iommu.o
diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
new file mode 100644
index 000000000000..254136946404
--- /dev/null
+++ b/drivers/iommu/hyperv/iommu.c
@@ -0,0 +1,620 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Hyper-V IOMMU driver.
+ *
+ * Copyright (C) 2019, 2024-2026 Microsoft, Inc.
+ */
+
+#define pr_fmt(fmt) "Hyper-V pvIOMMU: " fmt
+#define dev_fmt(fmt) pr_fmt(fmt)
+
+#include <linux/iommu.h>
+#include <linux/pci.h>
+#include <linux/dma-map-ops.h>
+#include <linux/generic_pt/iommu.h>
+#include <linux/pci-ats.h>
+
+#include <asm/iommu.h>
+#include <asm/hypervisor.h>
+#include <asm/mshyperv.h>
+
+#include "iommu.h"
+#include "../iommu-pages.h"
+
+struct hv_iommu_dev *hv_iommu_device;
+
+/*
+ * Identity and blocking domains are static singletons: identity is a 1:1
+ * passthrough with no page table, blocking rejects all DMA. Neither holds
+ * per-IOMMU state, so one instance suffices even with multiple vIOMMUs.
+ */
+static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
+static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;
+static struct iommu_ops hv_iommu_ops;
+
+static struct hv_iommu_domain hv_identity_domain = {
+	.domain = {
+		.type	= IOMMU_DOMAIN_IDENTITY,
+		.ops	= &hv_iommu_identity_domain_ops,
+		.owner	= &hv_iommu_ops,
+	},
+};
+static struct hv_iommu_domain hv_blocking_domain = {
+	.domain = {
+		.type	= IOMMU_DOMAIN_BLOCKED,
+		.ops	= &hv_iommu_blocking_domain_ops,
+		.owner	= &hv_iommu_ops,
+	},
+};
+
+static inline bool hv_iommu_present(u64 cap)
+{
+	return cap & HV_IOMMU_CAP_PRESENT;
+}
+
+static inline bool hv_iommu_s1_domain_supported(u64 cap)
+{
+	return cap & HV_IOMMU_CAP_S1;
+}
+
+static inline bool hv_iommu_5lvl_supported(u64 cap)
+{
+	return cap & HV_IOMMU_CAP_S1_5LVL;
+}
+
+static inline bool hv_iommu_ats_supported(u64 cap)
+{
+	return cap & HV_IOMMU_CAP_ATS;
+}
+
+static int hv_create_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_stage)
+{
+	int ret;
+	u64 status;
+	unsigned long flags;
+	struct hv_input_create_device_domain *input;
+
+	ret = ida_alloc_range(&hv_iommu_device->domain_ids,
+			hv_iommu_device->first_domain, hv_iommu_device->last_domain,
+			GFP_KERNEL);
+	if (ret < 0)
+		return ret;
+
+	hv_domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
+	hv_domain->device_domain.domain_id.type = domain_stage;
+	hv_domain->device_domain.domain_id.id = ret;
+	hv_domain->hv_iommu = hv_iommu_device;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	input->create_device_domain_flags.forward_progress_required = 1;
+	input->create_device_domain_flags.inherit_owning_vtl = 0;
+	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status)) {
+		pr_err("HVCALL_CREATE_DEVICE_DOMAIN failed, status %lld\n", status);
+		ida_free(&hv_iommu_device->domain_ids, hv_domain->device_domain.domain_id.id);
+	}
+
+	return hv_result_to_errno(status);
+}
+
+static void hv_delete_device_domain(struct hv_iommu_domain *hv_domain)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_delete_device_domain *input;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_DELETE_DEVICE_DOMAIN failed, status %lld\n", status);
+
+	ida_free(&hv_domain->hv_iommu->domain_ids, hv_domain->device_domain.domain_id.id);
+}
+
+static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+	switch (cap) {
+	case IOMMU_CAP_CACHE_COHERENCY:
+		return true;
+	case IOMMU_CAP_DEFERRED_FLUSH:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_flush_device_domain *input;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_FLUSH_DEVICE_DOMAIN failed, status %lld\n", status);
+}
+
+static int hv_iommu_attach_dev(struct iommu_domain *domain, struct device *dev,
+			       struct iommu_domain *old)
+{
+	u64 status;
+	u32 prefix;
+	unsigned long flags;
+	struct pci_dev *pdev;
+	struct hv_input_attach_device_domain *input;
+	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
+	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
+	int ret;
+
+	if (vdev->hv_domain == hv_domain)
+		return 0;
+
+	pdev = to_pci_dev(dev);
+	dev_dbg(dev, "attaching to domain %d\n",
+		hv_domain->device_domain.domain_id.id);
+
+	ret = hv_iommu_lookup_logical_dev_id(pci_domain_nr(pdev->bus), &prefix);
+	if (ret) {
+		dev_err(&pdev->dev, "no IOMMU registration for vPCI bus\n");
+		return ret;
+	}
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	input->device_id.as_uint64 = (u64)prefix | PCI_FUNC(pdev->devfn);
+	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_ATTACH_DEVICE_DOMAIN failed, status %lld\n", status);
+	else
+		vdev->hv_domain = hv_domain;
+
+	return hv_result_to_errno(status);
+}
+
+static int hv_iommu_blocking_attach_dev(struct iommu_domain *domain,
+					struct device *dev,
+					struct iommu_domain *old)
+{
+	int ret = hv_iommu_attach_dev(domain, dev, old);
+
+	/*
+	 * Attaching to the blocking domain only asks the hypervisor to
+	 * disable translation and IOPF for the device, so it cannot fail
+	 * unless there is a driver or hypervisor bug. Return the hypercall
+	 * status rather than 0 so that a failure on the DMA ownership claim
+	 * path (VFIO/iommufd) fails the claim instead of leaving the device
+	 * unblocked. WARN since such a failure indicates a bug.
+	 */
+	WARN_ON(ret);
+	return ret;
+}
+
+static int hv_iommu_get_logical_device_property(struct device *dev,
+					u32 code,
+					struct hv_output_get_logical_device_property *property)
+{
+	u64 status;
+	u32 prefix;
+	unsigned long flags;
+	int ret;
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct hv_input_get_logical_device_property *input;
+	struct hv_output_get_logical_device_property *output;
+
+	ret = hv_iommu_lookup_logical_dev_id(pci_domain_nr(pdev->bus), &prefix);
+	if (ret)
+		return ret;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = (struct hv_output_get_logical_device_property *)(input + 1);
+	memset(input, 0, sizeof(*input));
+	input->partition_id = HV_PARTITION_ID_SELF;
+	input->logical_device_id = (u64)prefix | PCI_FUNC(pdev->devfn);
+	input->code = code;
+	status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY, input, output);
+	*property = *output;
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_GET_LOGICAL_DEVICE_PROPERTY failed, status %lld\n", status);
+
+	return hv_result_to_errno(status);
+}
+
+static struct iommu_device *hv_iommu_probe_device(struct device *dev)
+{
+	struct pci_dev *pdev;
+	struct hv_iommu_endpoint *vdev;
+	struct hv_output_get_logical_device_property device_iommu_property = {0};
+
+	if (!dev_is_pci(dev))
+		return ERR_PTR(-ENODEV);
+
+	pdev = to_pci_dev(dev);
+
+	if (hv_iommu_get_logical_device_property(dev,
+						 HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU,
+						 &device_iommu_property) ||
+	    !(device_iommu_property.device_iommu & HV_DEVICE_IOMMU_ENABLED))
+		return ERR_PTR(-ENODEV);
+
+	vdev = kzalloc_obj(*vdev, GFP_KERNEL);
+	if (!vdev)
+		return ERR_PTR(-ENOMEM);
+
+	vdev->dev = dev;
+	vdev->hv_iommu = hv_iommu_device;
+	dev_iommu_priv_set(dev, vdev);
+
+	if (hv_iommu_ats_supported(hv_iommu_device->cap) &&
+	    pci_ats_supported(pdev))
+		pci_enable_ats(pdev, __ffs(hv_iommu_device->pgsize_bitmap));
+
+	return &vdev->hv_iommu->iommu;
+}
+
+static void hv_iommu_release_device(struct device *dev)
+{
+	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (pdev->ats_enabled)
+		pci_disable_ats(pdev);
+
+	dev_iommu_priv_set(dev, NULL);
+
+	kfree(vdev);
+}
+
+static struct iommu_group *hv_iommu_device_group(struct device *dev)
+{
+	if (dev_is_pci(dev))
+		return pci_device_group(dev);
+
+	WARN_ON_ONCE(1);
+	return generic_device_group(dev);
+}
+
+static int hv_configure_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_type)
+{
+	u64 status;
+	unsigned long flags;
+	struct pt_iommu_x86_64_hw_info pt_info;
+	struct hv_input_configure_device_domain *input;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	input->settings.flags.blocked = (domain_type == IOMMU_DOMAIN_BLOCKED);
+	/*
+	 * Clearing translation_enabled bypasses translation (DMA uses the GPA
+	 * directly), which only suits identity. The hypervisor requires paging
+	 * and blocked domains to keep it set.
+	 */
+	input->settings.flags.translation_enabled = (domain_type != IOMMU_DOMAIN_IDENTITY);
+
+	if (domain_type & __IOMMU_DOMAIN_PAGING) {
+		pt_iommu_x86_64_hw_info(&hv_domain->pt_iommu_x86_64, &pt_info);
+		input->settings.page_table_root = pt_info.gcr3_pt;
+		input->settings.flags.first_stage_paging_mode =
+			pt_info.levels == 5;
+	}
+	status = hv_do_hypercall(HVCALL_CONFIGURE_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_CONFIGURE_DEVICE_DOMAIN failed, status %lld\n", status);
+
+	return hv_result_to_errno(status);
+}
+
+static int __init hv_initialize_static_domains(void)
+{
+	int ret;
+	struct hv_iommu_domain *hv_domain;
+
+	/* Default stage-1 identity domain */
+	hv_domain = &hv_identity_domain;
+
+	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
+	if (ret)
+		return ret;
+
+	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY);
+	if (ret)
+		goto delete_identity_domain;
+
+	/* Default stage-1 blocked domain */
+	hv_domain = &hv_blocking_domain;
+
+	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
+	if (ret)
+		goto delete_identity_domain;
+
+	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED);
+	if (ret)
+		goto delete_blocked_domain;
+
+	return 0;
+
+delete_blocked_domain:
+	hv_delete_device_domain(&hv_blocking_domain);
+delete_identity_domain:
+	hv_delete_device_domain(&hv_identity_domain);
+	return ret;
+}
+
+/* x86 architectural MSI address range */
+#define INTERRUPT_RANGE_START	(0xfee00000)
+#define INTERRUPT_RANGE_END	(0xfeefffff)
+static void hv_iommu_get_resv_regions(struct device *dev,
+		struct list_head *head)
+{
+	struct iommu_resv_region *region;
+
+	region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
+				      INTERRUPT_RANGE_END - INTERRUPT_RANGE_START + 1,
+				      0, IOMMU_RESV_MSI, GFP_KERNEL);
+	if (!region)
+		return;
+
+	list_add_tail(&region->list, head);
+}
+
+static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
+{
+	hv_flush_device_domain(to_hv_iommu_domain(domain));
+}
+
+static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
+				struct iommu_iotlb_gather *iotlb_gather)
+{
+	hv_flush_device_domain(to_hv_iommu_domain(domain));
+
+	iommu_put_pages_list(&iotlb_gather->freelist);
+}
+
+static void hv_iommu_paging_domain_free(struct iommu_domain *domain)
+{
+	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
+
+	/* Free all remaining mappings */
+	pt_iommu_deinit(&hv_domain->pt_iommu);
+
+	hv_delete_device_domain(hv_domain);
+
+	kfree(hv_domain);
+}
+
+static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
+	.attach_dev	= hv_iommu_attach_dev,
+};
+
+static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
+	.attach_dev	= hv_iommu_blocking_attach_dev,
+};
+
+static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
+	.attach_dev	= hv_iommu_attach_dev,
+	IOMMU_PT_DOMAIN_OPS(x86_64),
+	.flush_iotlb_all = hv_iommu_flush_iotlb_all,
+	.iotlb_sync = hv_iommu_iotlb_sync,
+	.free = hv_iommu_paging_domain_free,
+};
+
+static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
+{
+	int ret;
+	struct hv_iommu_domain *hv_domain;
+	struct pt_iommu_x86_64_cfg cfg = {};
+
+	hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
+	if (!hv_domain)
+		return ERR_PTR(-ENOMEM);
+
+	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
+	if (ret)
+		goto err_free;
+
+	hv_domain->pt_iommu.nid = dev_to_node(dev);
+
+	cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
+	cfg.common.hw_max_oasz_lg2 = 52;
+	cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 : 3;
+
+	ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL);
+	if (ret)
+		goto err_delete_domain;
+
+	/* Constrain to page sizes the hypervisor supports */
+	hv_domain->domain.pgsize_bitmap &= hv_iommu_device->pgsize_bitmap;
+
+	hv_domain->domain.ops = &hv_iommu_paging_domain_ops;
+
+	ret = hv_configure_device_domain(hv_domain, __IOMMU_DOMAIN_PAGING);
+	if (ret)
+		goto err_pt_deinit;
+
+	return &hv_domain->domain;
+
+err_pt_deinit:
+	pt_iommu_deinit(&hv_domain->pt_iommu);
+err_delete_domain:
+	hv_delete_device_domain(hv_domain);
+err_free:
+	kfree(hv_domain);
+	return ERR_PTR(ret);
+}
+
+static struct iommu_ops hv_iommu_ops = {
+	.capable		  = hv_iommu_capable,
+	.domain_alloc_paging	  = hv_iommu_domain_alloc_paging,
+	.probe_device		  = hv_iommu_probe_device,
+	.release_device		  = hv_iommu_release_device,
+	.device_group		  = hv_iommu_device_group,
+	.get_resv_regions	  = hv_iommu_get_resv_regions,
+	.owner			  = THIS_MODULE,
+	.identity_domain	  = &hv_identity_domain.domain,
+	.blocked_domain		  = &hv_blocking_domain.domain,
+	.release_domain		  = &hv_blocking_domain.domain,
+};
+
+static int hv_iommu_detect(struct hv_output_get_iommu_capabilities *hv_iommu_cap)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_get_iommu_capabilities *input;
+	struct hv_output_get_iommu_capabilities *output;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = (struct hv_output_get_iommu_capabilities *)(input + 1);
+	memset(input, 0, sizeof(*input));
+	input->partition_id = HV_PARTITION_ID_SELF;
+	status = hv_do_hypercall(HVCALL_GET_IOMMU_CAPABILITIES, input, output);
+	*hv_iommu_cap = *output;
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_GET_IOMMU_CAPABILITIES failed, status %lld\n", status);
+
+	return hv_result_to_errno(status);
+}
+
+static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
+			struct hv_output_get_iommu_capabilities *hv_iommu_cap)
+{
+	ida_init(&hv_iommu->domain_ids);
+
+	hv_iommu->cap = hv_iommu_cap->iommu_cap;
+	hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
+	if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
+	    hv_iommu->max_iova_width > 48) {
+		pr_info("5-level paging not supported, limiting iova width to 48.\n");
+		hv_iommu->max_iova_width = 48;
+	}
+
+	hv_iommu->geometry = (struct iommu_domain_geometry) {
+		.aperture_start = 0,
+		.aperture_end = (((u64)1) << hv_iommu->max_iova_width) - 1,
+		.force_aperture = true,
+	};
+
+	hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
+	hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
+	hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap;
+	hv_iommu_device = hv_iommu;
+}
+
+int __init hv_iommu_init(void)
+{
+	int ret = 0;
+	struct hv_iommu_dev *hv_iommu = NULL;
+	struct hv_output_get_iommu_capabilities hv_iommu_cap = {0};
+
+	if (no_iommu || iommu_detected)
+		return -ENODEV;
+
+	if (!hv_is_hyperv_initialized())
+		return -ENODEV;
+
+	ret = hv_iommu_detect(&hv_iommu_cap);
+	if (ret) {
+		pr_err("HVCALL_GET_IOMMU_CAPABILITIES failed: %d\n", ret);
+		return -ENODEV;
+	}
+
+	if (!hv_iommu_present(hv_iommu_cap.iommu_cap) ||
+	    !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap)) {
+		pr_err("IOMMU capabilities not sufficient: cap=0x%llx\n",
+		       hv_iommu_cap.iommu_cap);
+		return -ENODEV;
+	}
+
+	/*
+	 * The page table code only maps x86 page sizes (4K/2M/1G); require the
+	 * hypervisor to advertise a non-empty subset of exactly those.
+	 */
+	if (!hv_iommu_cap.pgsize_bitmap ||
+	    (hv_iommu_cap.pgsize_bitmap & ~(u64)(SZ_4K | SZ_2M | SZ_1G))) {
+		pr_err("unsupported page sizes: pgsize_bitmap=0x%llx\n",
+		       hv_iommu_cap.pgsize_bitmap);
+		return -ENODEV;
+	}
+
+	iommu_detected = 1;
+	pci_request_acs();
+
+	hv_iommu = kzalloc_obj(*hv_iommu, GFP_KERNEL);
+	if (!hv_iommu)
+		return -ENOMEM;
+
+	hv_init_iommu_device(hv_iommu, &hv_iommu_cap);
+
+	ret = hv_initialize_static_domains();
+	if (ret) {
+		pr_err("static domains init failed: %d\n", ret);
+		goto err_free;
+	}
+
+	ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL, "%s", "hv-iommu");
+	if (ret) {
+		pr_err("iommu_device_sysfs_add failed: %d\n", ret);
+		goto err_delete_static_domains;
+	}
+
+	ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops, NULL);
+	if (ret) {
+		pr_err("iommu_device_register failed: %d\n", ret);
+		goto err_sysfs_remove;
+	}
+
+	pr_info("successfully initialized\n");
+	return 0;
+
+err_sysfs_remove:
+	iommu_device_sysfs_remove(&hv_iommu->iommu);
+err_delete_static_domains:
+	hv_delete_device_domain(&hv_blocking_domain);
+	hv_delete_device_domain(&hv_identity_domain);
+err_free:
+	kfree(hv_iommu);
+	return ret;
+}
diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
new file mode 100644
index 000000000000..3a9f40fa2403
--- /dev/null
+++ b/drivers/iommu/hyperv/iommu.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Hyper-V IOMMU driver.
+ *
+ * Copyright (C) 2024-2025, Microsoft, Inc.
+ *
+ */
+
+#ifndef _HYPERV_IOMMU_H
+#define _HYPERV_IOMMU_H
+
+struct hv_iommu_dev {
+	struct iommu_device iommu;
+	struct ida domain_ids;
+
+	/* Device configuration */
+	u8  max_iova_width;
+	u8  max_pasid_width;
+	u64 cap;
+	u64 pgsize_bitmap;
+
+	struct iommu_domain_geometry geometry;
+	u64 first_domain;
+	u64 last_domain;
+};
+
+struct hv_iommu_domain {
+	union {
+		struct iommu_domain    domain;
+		struct pt_iommu        pt_iommu;
+		struct pt_iommu_x86_64 pt_iommu_x86_64;
+	};
+	struct hv_iommu_dev *hv_iommu;
+	struct hv_input_device_domain device_domain;
+	u64		pgsize_bitmap;
+};
+
+PT_IOMMU_CHECK_DOMAIN(struct hv_iommu_domain, pt_iommu, domain);
+PT_IOMMU_CHECK_DOMAIN(struct hv_iommu_domain, pt_iommu_x86_64.iommu, domain);
+
+struct hv_iommu_endpoint {
+	struct device *dev;
+	struct hv_iommu_dev *hv_iommu;
+	struct hv_iommu_domain *hv_domain;
+};
+
+#define to_hv_iommu_domain(d) \
+	container_of(d, struct hv_iommu_domain, domain)
+
+#endif /* _HYPERV_IOMMU_H */
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor
In-Reply-To: <20260702160518.311234-1-zhangyu1@linux.microsoft.com>

Add page-selective IOTLB flush using HVCALL_FLUSH_DEVICE_DOMAIN_LIST.
This hypercall accepts a list of (page_number, page_mask_shift) entries,
enabling finer-grained IOTLB invalidation compared to the domain-wide
HVCALL_FLUSH_DEVICE_DOMAIN used by hv_iommu_flush_iotlb_all().

hv_iommu_calc_flush_range() computes the smallest power-of-two aligned
range that covers the target IOVA region, producing a single flush
descriptor. This may over-flush when the range is not naturally aligned,
matching the approach used by Intel VT-d PSI. If the page-selective
flush fails, the code falls back to a full domain flush.

Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
---
 drivers/iommu/hyperv/iommu.c | 68 +++++++++++++++++++++++++++++++++++-
 include/hyperv/hvgdk_mini.h  |  1 +
 include/hyperv/hvhdk_mini.h  | 17 +++++++++
 3 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
index 254136946404..e9b104a322fd 100644
--- a/drivers/iommu/hyperv/iommu.c
+++ b/drivers/iommu/hyperv/iommu.c
@@ -9,6 +9,7 @@
 #define pr_fmt(fmt) "Hyper-V pvIOMMU: " fmt
 #define dev_fmt(fmt) pr_fmt(fmt)
 
+#include <linux/hyperv.h>
 #include <linux/iommu.h>
 #include <linux/pci.h>
 #include <linux/dma-map-ops.h>
@@ -401,10 +402,74 @@ static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
 	hv_flush_device_domain(to_hv_iommu_domain(domain));
 }
 
+/*
+ * Calculate the minimal power-of-two aligned range that covers [start, end]
+ * (end is inclusive). Returns a single (page_number, page_mask_shift)
+ * descriptor that may over-flush when the range is not naturally aligned.
+ */
+static void hv_iommu_calc_flush_range(unsigned long start, unsigned long end,
+				       union hv_iommu_flush_va *va)
+{
+	unsigned long start_pfn = HVPFN_DOWN(start);
+	unsigned long last_pfn = HVPFN_UP(end + 1) - 1;
+	unsigned long mask_shift, aligned_pfn;
+
+	if (start_pfn == last_pfn) {
+		mask_shift = 0;
+	} else {
+		/*
+		 * Find the highest bit position where start_pfn and last_pfn
+		 * differ.  A range aligned to one above that bit is the
+		 * smallest power-of-two region that covers both endpoints.
+		 */
+		mask_shift = __fls(start_pfn ^ last_pfn) + 1;
+	}
+
+	aligned_pfn = ALIGN_DOWN(start_pfn, 1UL << mask_shift);
+	va->page_number = aligned_pfn;
+	va->page_mask_shift = mask_shift;
+}
+
+static void hv_flush_device_domain_list(struct hv_iommu_domain *hv_domain,
+					struct iommu_iotlb_gather *iotlb_gather)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_flush_device_domain_list *input;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain = hv_domain->device_domain;
+	input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
+	hv_iommu_calc_flush_range(iotlb_gather->start, iotlb_gather->end,
+				  &input->iova_list[0]);
+
+	status = hv_do_rep_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
+				     1, 0, input, NULL);
+
+	if (!hv_result_success(status)) {
+		/* Page-selective flush failed, fall back to full flush. */
+		struct hv_input_flush_device_domain *flush_all = (void *)input;
+
+		memset(flush_all, 0, sizeof(*flush_all));
+		flush_all->device_domain = hv_domain->device_domain;
+		status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
+					flush_all, NULL);
+		WARN(!hv_result_success(status),
+		     "HVCALL_FLUSH_DEVICE_DOMAIN fallback also failed: %lld\n",
+		     status);
+	}
+
+	local_irq_restore(flags);
+}
+
 static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
 				struct iommu_iotlb_gather *iotlb_gather)
 {
-	hv_flush_device_domain(to_hv_iommu_domain(domain));
+	hv_flush_device_domain_list(to_hv_iommu_domain(domain), iotlb_gather);
 
 	iommu_put_pages_list(&iotlb_gather->freelist);
 }
@@ -455,6 +520,7 @@ static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
 
 	cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
 	cfg.common.hw_max_oasz_lg2 = 52;
+	cfg.common.features |= BIT(PT_FEAT_FLUSH_RANGE);
 	cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 : 3;
 
 	ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL);
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 5bdbb44da112..eaaf87171478 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -496,6 +496,7 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
 #define HVCALL_CONFIGURE_DEVICE_DOMAIN			0x00ce
 #define HVCALL_FLUSH_DEVICE_DOMAIN			0x00d0
+#define HVCALL_FLUSH_DEVICE_DOMAIN_LIST			0x00d1
 #define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
 #define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
 #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY	0x00db
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 493608e791b4..f51d5d9467f1 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -671,4 +671,21 @@ struct hv_input_flush_device_domain {
 	u32 reserved;
 } __packed;
 
+union hv_iommu_flush_va {
+	u64 iova;
+	struct {
+		u64 page_mask_shift : 12;
+		u64 page_number : 52;
+	};
+} __packed;
+
+
+struct hv_input_flush_device_domain_list {
+	struct hv_input_device_domain device_domain;
+#define HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT (1 << 0)
+	u32 flags;
+	u32 reserved;
+	union hv_iommu_flush_va iova_list[];
+} __packed;
+
 #endif /* _HV_HVHDK_MINI_H */
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
From: sashiko-bot @ 2026-07-02 16:36 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260702160518.311234-2-zhangyu1@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] `struct hv_input_create_device_domain` is missing a 4-byte padding field, resulting in a non-standard 28-byte size.
--

commit 267ecd2e71d0d7f097fc4d970780fef12801efc5
Author: Wei Liu <wei.liu@kernel.org>

hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU

This commit introduces new hypercall definitions to allow a child partition
to interact with the para-virtualized Hyper-V guest IOMMU. It includes
interfaces for capability detection, device/domain management, and IOTLB
flushing.

[ ... ]

> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index b4cb2fa26e9bc..493608e791b45 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h

[ ... ]

> +union hv_create_device_domain_flags {
> +	u32 as_uint32;
> +	struct {
> +		u32 forward_progress_required: 1;
> +		u32 inherit_owning_vtl: 1;
> +		u32 reserved: 30;
> +	} __packed;
> +};
> +
> +struct hv_input_create_device_domain {
> +	struct hv_input_device_domain device_domain;
> +	union hv_create_device_domain_flags create_device_domain_flags;
> +} __packed;

[Severity: Medium]
Does this structure need an explicit 4-byte padding field at the end to
ensure its size is a multiple of 8 bytes?

Because the struct is packed and lacks tail padding, its size is exactly
28 bytes. These structures are often initialized on a reused per-CPU page
like this:

hv_create_device_domain() {
        ...
        memset(input, 0, sizeof(*input));
        ...
}

If the hypervisor expects an 8-byte aligned 32-byte input for this hypercall,
the memset will only clear the first 28 bytes. This leaves 4 bytes of
uninitialized garbage from previous hypercalls in the tail end of the buffer,
which might cause domain creation to fail if the hypervisor strictly validates
padding fields to be zero.

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702160518.311234-1-zhangyu1@linux.microsoft.com?part=1

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox