Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH v2 5/5] driver core: remove driver_set_override()
From: Danilo Krummrich @ 2026-05-05 13:37 UTC (permalink / raw)
  To: gregkh, rafael, linux, nipun.gupta, nikhil.agarwal, kys, haiyangz,
	wei.liu, decui, longli, andersson, mathieu.poirier
  Cc: driver-core, linux-kernel, linux-hyperv, linux-arm-msm,
	linux-remoteproc, Danilo Krummrich
In-Reply-To: <20260505133935.3772495-1-dakr@kernel.org>

All buses have been converted from driver_set_override() to the generic
driver_override infrastructure introduced in commit cb3d1049f4ea
("driver core: generalize driver_override in struct device").

Buses now either opt into the generic sysfs callbacks via the
bus_type::driver_override flag, or use device_set_driver_override() /
__device_set_driver_override() directly.

Thus, remove the now-unused driver_set_override() helper.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
 drivers/base/driver.c         | 75 -----------------------------------
 include/linux/device/driver.h |  2 -
 2 files changed, 77 deletions(-)

diff --git a/drivers/base/driver.c b/drivers/base/driver.c
index 8ab010ddf709..7ed834f7199c 100644
--- a/drivers/base/driver.c
+++ b/drivers/base/driver.c
@@ -30,81 +30,6 @@ static struct device *next_device(struct klist_iter *i)
 	return dev;
 }
 
-/**
- * driver_set_override() - Helper to set or clear driver override.
- * @dev: Device to change
- * @override: Address of string to change (e.g. &device->driver_override);
- *            The contents will be freed and hold newly allocated override.
- * @s: NUL-terminated string, new driver name to force a match, pass empty
- *     string to clear it ("" or "\n", where the latter is only for sysfs
- *     interface).
- * @len: length of @s
- *
- * Helper to set or clear driver override in a device, intended for the cases
- * when the driver_override field is allocated by driver/bus code.
- *
- * Returns: 0 on success or a negative error code on failure.
- */
-int driver_set_override(struct device *dev, const char **override,
-			const char *s, size_t len)
-{
-	const char *new, *old;
-	char *cp;
-
-	if (!override || !s)
-		return -EINVAL;
-
-	/*
-	 * The stored value will be used in sysfs show callback (sysfs_emit()),
-	 * which has a length limit of PAGE_SIZE and adds a trailing newline.
-	 * Thus we can store one character less to avoid truncation during sysfs
-	 * show.
-	 */
-	if (len >= (PAGE_SIZE - 1))
-		return -EINVAL;
-
-	/*
-	 * Compute the real length of the string in case userspace sends us a
-	 * bunch of \0 characters like python likes to do.
-	 */
-	len = strlen(s);
-
-	if (!len) {
-		/* Empty string passed - clear override */
-		device_lock(dev);
-		old = *override;
-		*override = NULL;
-		device_unlock(dev);
-		kfree(old);
-
-		return 0;
-	}
-
-	cp = strnchr(s, len, '\n');
-	if (cp)
-		len = cp - s;
-
-	new = kstrndup(s, len, GFP_KERNEL);
-	if (!new)
-		return -ENOMEM;
-
-	device_lock(dev);
-	old = *override;
-	if (cp != s) {
-		*override = new;
-	} else {
-		/* "\n" passed - clear override */
-		kfree(new);
-		*override = NULL;
-	}
-	device_unlock(dev);
-
-	kfree(old);
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(driver_set_override);
-
 /**
  * driver_for_each_device - Iterator for devices bound to a driver.
  * @drv: Driver we're iterating.
diff --git a/include/linux/device/driver.h b/include/linux/device/driver.h
index bbc67ec513ed..aa3465a369f0 100644
--- a/include/linux/device/driver.h
+++ b/include/linux/device/driver.h
@@ -160,8 +160,6 @@ int __must_check driver_create_file(const struct device_driver *driver,
 void driver_remove_file(const struct device_driver *driver,
 			const struct driver_attribute *attr);
 
-int driver_set_override(struct device *dev, const char **override,
-			const char *s, size_t len);
 int __must_check driver_for_each_device(struct device_driver *drv, struct device *start,
 					void *data, device_iter_t fn);
 struct device *driver_find_device(const struct device_driver *drv,
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net, v3] net: mana: Fix crash from unvalidated SHM offset read from BAR0 during FLR
From: Paolo Abeni @ 2026-05-05 13:42 UTC (permalink / raw)
  To: Dipayaan Roy, kys, haiyangz, wei.liu, decui, andrew+netdev, davem,
	edumazet, kuba, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <afQUMClyjmBVfD+u@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On 5/1/26 4:47 AM, Dipayaan Roy wrote:
> @@ -73,10 +74,28 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
>  	gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
>  
>  	sriov_base_off = mana_gd_r64(gc, GDMA_SRIOV_REG_CFG_BASE_OFF);
> +	if (sriov_base_off >= gc->bar0_size ||
> +	    gc->bar0_size - sriov_base_off <
> +		GDMA_PF_REG_SHM_OFF + sizeof(u64) ||
> +	    !IS_ALIGNED(sriov_base_off, sizeof(u64))) {
> +		dev_err(gc->dev,
> +			"SRIOV base offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> +			sriov_base_off, (u64)gc->bar0_size);
> +		return -EPROTO;
> +	}

I think that the additional fix suggested by sashiko is really worthy,
but should go in a separate patch. @Dipayaan: please follow-up on that
one, thanks!

Paolo


^ permalink raw reply

* Re: [PATCH net, v3] net: mana: Fix crash from unvalidated SHM offset read from BAR0 during FLR
From: patchwork-bot+netdevbpf @ 2026-05-05 13:50 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <afQUMClyjmBVfD+u@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Thu, 30 Apr 2026 19:47:12 -0700 you wrote:
> During Function Level Reset recovery, the MANA driver reads
> hardware BAR0 registers that may temporarily contain garbage values.
> The SHM (Shared Memory) offset read from GDMA_REG_SHM_OFFSET is used
> to compute gc->shm_base, which is later dereferenced via readl() in
> mana_smc_poll_register(). If the hardware returns an unaligned or
> out-of-range value, the driver must not blindly use it, as this would
> propagate the hardware error into a kernel crash.
> 
> [...]

Here is the summary with links:
  - [net,v3] net: mana: Fix crash from unvalidated SHM offset read from BAR0 during FLR
    https://git.kernel.org/netdev/net/c/95084f1883a7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v3] mshv: Simplify GPA map/unmap hypercall helpers
From: Stanislav Kinsburskii @ 2026-05-05 15:00 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260505-efficient-victorious-degu-d5ec2e@anirudhrb>

On Tue, May 05, 2026 at 06:13:01AM +0000, Anirudh Rayabharam wrote:
> On Thu, Apr 30, 2026 at 02:52:17PM +0000, Stanislav Kinsburskii wrote:
> > Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
> > preceding bug-fix patches:
> > 
> > Move "done += completed" before the status checks so that pages mapped
> > by a partially-successful batch are included in the error cleanup unmap.
> > Previously these mappings were leaked on failure.
> > 
> > While here, improve type safety and readability:
> >  - Change "int done" to "u64 done" to match the u64 page_count it is
> >    compared against, avoiding signed/unsigned comparison hazards.
> >  - Use u64 for loop iteration and batch size variables consistently.
> >  - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
> >  - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
> >  - Simplify the error-path unmap to use "done << large_shift" directly
> >    instead of mutating done in place.
> > 
> > v3: aligned changes by 80 colons
> > v2: replaced min with min_t
> 
> This part describing the changes in various version should be placed
> after the "---" line below. This way it won't appear in the final commit
> log.
> 
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html#commentary

Thanks for the geidance, will do next time.

Thanks,
Stanislav

> 
> Thanks,
> Anirudh.
> 
> > 
> > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_hv_call.c |   56 +++++++++++++++-------------------------
> >  1 file changed, 21 insertions(+), 35 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > index e5992c324904a..e1f9e28d5a19b 100644
> > --- a/drivers/hv/mshv_root_hv_call.c
> > +++ b/drivers/hv/mshv_root_hv_call.c
> > @@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  	struct hv_input_map_gpa_pages *input_page;
> >  	u64 status, *pfnlist;
> >  	unsigned long irq_flags, large_shift = 0;
> > -	int ret = 0, done = 0;
> > -	u64 page_count = page_struct_count;
> > +	u64 done = 0, page_count = page_struct_count;
> > +	int ret = 0;
> >  
> >  	if (page_count == 0 || (pages && mmio_spa))
> >  		return -EINVAL;
> > @@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  	}
> >  
> >  	while (done < page_count) {
> > -		ulong i, completed, remain = page_count - done;
> > -		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> > +		u64 i, completed, remain = page_count - done;
> > +		u64 rep_count = min_t(u64, remain, HV_MAP_GPA_BATCH_SIZE);
> >  
> >  		local_irq_save(irq_flags);
> >  		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > @@ -224,23 +224,14 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  		input_page->map_flags = flags;
> >  		pfnlist = input_page->source_gpa_page_list;
> >  
> > -		for (i = 0; i < rep_count; i++)
> > -			if (flags & HV_MAP_GPA_NO_ACCESS) {
> > +		for (i = 0; i < rep_count; i++) {
> > +			if (flags & HV_MAP_GPA_NO_ACCESS)
> >  				pfnlist[i] = 0;
> > -			} else if (pages) {
> > -				u64 index = (done + i) << large_shift;
> > -
> > -				if (index >= page_struct_count) {
> > -					ret = -EINVAL;
> > -					break;
> > -				}
> > -				pfnlist[i] = page_to_pfn(pages[index]);
> > -			} else {
> > +			else if (pages)
> > +				pfnlist[i] = page_to_pfn(pages[(done + i) <<
> > +							 large_shift]);
> > +			else
> >  				pfnlist[i] = mmio_spa + done + i;
> > -			}
> > -		if (ret) {
> > -			local_irq_restore(irq_flags);
> > -			break;
> >  		}
> >  
> >  		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> > @@ -248,29 +239,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  		local_irq_restore(irq_flags);
> >  
> >  		completed = hv_repcomp(status);
> > +		done += completed;
> >  
> >  		if (hv_result_needs_memory(status)) {
> >  			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
> >  						    HV_MAP_GPA_DEPOSIT_PAGES);
> >  			if (ret)
> >  				break;
> > -
> >  		} else if (!hv_result_success(status)) {
> >  			ret = hv_result_to_errno(status);
> >  			break;
> >  		}
> > -
> > -		done += completed;
> >  	}
> >  
> >  	if (ret && done) {
> >  		u32 unmap_flags = 0;
> >  
> > -		if (flags & HV_MAP_GPA_LARGE_PAGE) {
> > +		if (flags & HV_MAP_GPA_LARGE_PAGE)
> >  			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > -			done <<= large_shift;
> > -		}
> > -		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> > +		hv_call_unmap_gpa_pages(partition_id, gfn,
> > +					done << large_shift, unmap_flags);
> >  	}
> >  
> >  	return ret;
> > @@ -305,7 +293,7 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> >  	struct hv_input_unmap_gpa_pages *input_page;
> >  	u64 status, page_count = page_count_4k;
> >  	unsigned long irq_flags, large_shift = 0;
> > -	int ret = 0, done = 0;
> > +	u64 done = 0;
> >  
> >  	if (page_count == 0)
> >  		return -EINVAL;
> > @@ -319,8 +307,8 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> >  	}
> >  
> >  	while (done < page_count) {
> > -		ulong completed, remain = page_count - done;
> > -		int rep_count = min(remain, HV_UMAP_GPA_PAGES);
> > +		u64 completed, remain = page_count - done;
> > +		u64 rep_count = min_t(u64, remain, HV_UMAP_GPA_PAGES);
> >  
> >  		local_irq_save(irq_flags);
> >  		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > @@ -333,15 +321,13 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> >  		local_irq_restore(irq_flags);
> >  
> >  		completed = hv_repcomp(status);
> > -		if (!hv_result_success(status)) {
> > -			ret = hv_result_to_errno(status);
> > -			break;
> > -		}
> > -
> >  		done += completed;
> > +
> > +		if (!hv_result_success(status))
> > +			return hv_result_to_errno(status);
> >  	}
> >  
> > -	return ret;
> > +	return 0;
> >  }
> >  
> >  int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
> > 
> > 

^ permalink raw reply

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-05-05 15:43 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <afmK531eRcPCecKm@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Mon, May 04, 2026 at 11:15:03PM -0700, Shradha Gupta wrote:
> On Sat, May 02, 2026 at 01:15:36PM -0400, Yury Norov wrote:
> > On Sat, May 02, 2026 at 07:37:43AM -0700, Shradha Gupta wrote:
> > > On Fri, May 01, 2026 at 12:22:20PM -0400, Yury Norov wrote:
> > > > On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> > > > > In mana driver, the number of IRQs allocated is capped by the
> > > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > > > their NUMA/core bindings.
> > > > > 
> > > > > This is important, especially in the envs where number of vCPUs are so
> > > > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > > > much more than their overheads if they were spread across sibling vCPUs.
> > > > > 
> > > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > > > IRQs are assigned at a later stage compared to static allocation, other
> > > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > > > same vCPU, while some vCPUs have none.
> > > > > 
> > > > > In such cases when many parallel TCP connections are tested, the
> > > > > throughput drops significantly.
> > > > > 
> > > > > Test envs:
> > > > > =======================================================
> > > > > Case 1: without this patch
> > > > > =======================================================
> > > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > > 
> > > > > 	TYPE		effective vCPU aff
> > > > > =======================================================
> > > > > IRQ0:	HWC		0
> > > > > IRQ1:	mana_q1		0
> > > > > IRQ2:	mana_q2		2
> > > > > IRQ3:	mana_q3		0
> > > > > IRQ4:	mana_q4		3
> > > > > 
> > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > > vCPU		0	1	2	3
> > > > > =======================================================
> > > > > pass 1:		38.85	0.03	24.89	24.65
> > > > > pass 2:		39.15	0.03	24.57	25.28
> > > > > pass 3:		40.36	0.03	23.20	23.17
> > > > > 
> > > > > =======================================================
> > > > > Case 2: with this patch
> > > > > =======================================================
> > > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > > 
> > > > >         TYPE            effective vCPU aff
> > > > > =======================================================
> > > > > IRQ0:   HWC             0
> > > > > IRQ1:   mana_q1         0
> > > > > IRQ2:   mana_q2         1
> > > > > IRQ3:   mana_q3         2
> > > > > IRQ4:   mana_q4         3
> > > > > 
> > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > > vCPU            0       1       2       3
> > > > > =======================================================
> > > > > pass 1:         15.42	15.85	14.99	14.51
> > > > > pass 2:         15.53	15.94	15.81	15.93
> > > > > pass 3:         16.41	16.35	16.40	16.36
> > > > > 
> > > > > =======================================================
> > > > > Throughput Impact(in Gbps, same env)
> > > > > =======================================================
> > > > > TCP conn	with patch	w/o patch
> > > > > 20480		15.65		7.73
> > > > > 10240		15.63		8.93
> > > > > 8192		15.64		9.69
> > > > > 6144		15.64		13.16
> > > > > 4096		15.69		15.75
> > > > > 2048		15.69		15.83
> > > > > 1024		15.71		15.28
> > > > > 
> > > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > > > Cc: stable@vger.kernel.org
> > > > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > > ---
> > > > > Changes in v2
> > > > >  * Removed the unused skip_first_cpu variable
> > > > >  * fixed exit condition in irq_setup_linear() with len == 0
> > > > >  * changed return type of irq_setup_linear() as it will always be 0
> > > > >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > > > >  * added appropriate comments to indicate expected behaviour when
> > > > >    IRQs are more than or equal to num_online_cpus()
> > > > > ---
> > > > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
> > > > >  1 file changed, 40 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > index 098fbda0d128..d740d1dc43da 100644
> > > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > > > >  	} else {
> > > > >  		/* If dynamic allocation is enabled we have already allocated
> > > > >  		 * hwc msi
> > > > > +		 * Also, we make sure in this case the following is always true
> > > > > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > > > >  		 */
> > > > >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > > > >  	}
> > > > > @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > +/* should be called with cpus_read_lock() held */
> > > > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > > > +{
> > > > > +	int cpu;
> > > > > +
> > > > > +	for_each_online_cpu(cpu) {
> > > > > +		if (len == 0)
> > > > > +			break;
> > > > > +
> > > > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > > > +		len--;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > >  {
> > > > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > > >  	struct gdma_irq_context *gic;
> > > > > -	bool skip_first_cpu = false;
> > > > >  	int *irqs, irq, err, i;
> > > > >  
> > > > >  	irqs = kmalloc_objs(int, nvec);
> > > > 
> > > > So what about WARN_ON() and nvec adjustment before kmalloc?
> > > Hey Yury,
> > > 
> > > I am still a bit unsure about the WARN_ON() before kmalloc, as after
> > > that also, in the same function till we take the cpus_read_lock() the
> > > num_online_cpus() can change(or reduce). That's why I introduced the
> > > dev_dbg() to capture hot-remove edge case.
> > 
> > OK.
> >  
> > > Do you still think it adds more value?
> > 
> > It's your driver, so you know better. I just wonder because you said
> > it's good to add WARN_ON(), and then didn't do that.
> > 
> > > > 
> > > > > @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > > > >  	 */
> > > > >  	cpus_read_lock();
> > > > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > > > -		skip_first_cpu = true;
> > > > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > > > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > > > +		if (err) {
> > > > > +			cpus_read_unlock();
> > > > > +			goto free_irq;
> > > > 
> > > > One thing puzzles me: if you skip first CPU with this 'true', and the
> > > > gc->num_msix_usable == num_online_cpus(), it's one more than you can
> > > > distribute. What do I miss?
> > > > 
> > > 
> > > Let me explain this case a bit better then,
> > > 
> > > - num_msix_usable = HWC IRQ + Queue IRQ
> > > - nvec in this functions is only Queue IRQ (HWC already setup)
> > > 
> > > When num_online_cpus == num_msix_usable:
> > > - nvec = num_online_cpus - 1
> > > - first CPU is already assigned to HWC IRQ, so skip it
> > > - Queue IRQs fit in the remaining CPUs
> > > 
> > > please let me know if I did not get your question right
> > 
> > Can you put that in a comment?
> 
> Sure I will. thanks
> 
> > 
> > > > > +		}
> > > > > +	} else {
> > > > > +		/*
> > > > > +		 * When num_msix_usable are more than num_online_cpus, we try to
> > > > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > > > +		 * CPU core affinity does not matter.
> > > > 
> > > > If it doesn't matter, why don't you assign each IRQ to all CPUs then?
> > > > In theory, the system would have most of flexibility to balance them.
> > > > 
> > > 
> > > Okay, let me fix the comment and elaborate on this. It doesn't matter
> > > because in such a case we want to anyway exhaust and distribute the
> > > Queue IRQs to all vCPUs.
> > > We don't want to rely on the system's balancer in this case as it could
> > > be skewed by other devices' IRQ weights
> > 
> > I don't understand this. If I want to reserve some CPUs to solely
> > handle IRQs from my high-priority hardware, then I configure my system
> > accordingly. For example, assign all non-networking IRQs on CPU0, and
> > all networking IRQs to all CPUs.
> > 
> > In your case, you distribute IRQs evenly, which means you've no
> > preferred CPUs. So, assuming the system is only running your IRQ
> > driver, it's at max is as good as all-CPU distribution. In case of
> > heavy loading some particular CPU, your scheme could cause
> > corresponding IRQs to starve.
> > 
> > I recall, when we was working on irq_setup(), the original idea was to
> > distribute IRQs one-to-one, but than I suggested the 
> > 
> >         irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > 
> > and after experiments, you agreed on that.
> > 
> > Can you please run your throughput test for my suggested distribution
> > too? Would be also nice to see how each distribution works when some
> > CPUs are under stress.
> > 
> > Thanks,
> > Yury
> 
> The design of irq_setup() works exactly how we want it for our IRQs for
> almost all of our usecases, so we want to keep that as is. The only
> scenarios where this is an issue in terms of significant throughput drop
> is when we are working with low vCPU VMs (vCPU <= 4 with high TCP
> connection counts) and where there are additional NVMe devices attached
> to the VM.
> 
> The current patch about utilizing all the vCPUs helps in that case and
> doesn't cause any regression for other cases.
> 
> This linear path is only taken when num_msix_usable > num_online_cpus(),
> which is limited to low-vCPU VMs. Larger VMs continue using irq_setup()
> as before.
> 
> We can definately get our throughput run results on other suggestions
> you have. And about that, I just needed a bit more clarity on what to
> test against. Are you suggesting, with irq_setup() intact and in use, we
> configure the non-mana IRQs to say CPU0 and capture the numbers?

Can you try this:

       while(len--)
               // Or cpu_online_mask or cpu_all_mask?
               irq_set_affinity_and_hint(*irqs++, NULL);

And compare it to the linear version under your vCPU scenario?

Can you run your throughput test alone and on parallel with some
IRQ torture test?

        stress-ng --timer 4 --timeout 60s

And maybe pin the stress test to the default CPU. Assuming it's 0:

        taskset -c 0 stress-ng --timer 4 --timeout 60s

Unless the 'linear' version is significantly faster, I'd stick to the
above.

Thanks,
Yury

^ permalink raw reply

* Re: [PATCH net, v3] net: mana: Fix crash from unvalidated SHM offset read from BAR0 during FLR
From: Dipayaan Roy @ 2026-05-05 16:28 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <30f588ad-cf80-432b-bde3-13b3c0d5a124@redhat.com>

On Tue, May 05, 2026 at 03:42:46PM +0200, Paolo Abeni wrote:
> On 5/1/26 4:47 AM, Dipayaan Roy wrote:
> > @@ -73,10 +74,28 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
> >  	gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
> >  
> >  	sriov_base_off = mana_gd_r64(gc, GDMA_SRIOV_REG_CFG_BASE_OFF);
> > +	if (sriov_base_off >= gc->bar0_size ||
> > +	    gc->bar0_size - sriov_base_off <
> > +		GDMA_PF_REG_SHM_OFF + sizeof(u64) ||
> > +	    !IS_ALIGNED(sriov_base_off, sizeof(u64))) {
> > +		dev_err(gc->dev,
> > +			"SRIOV base offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> > +			sriov_base_off, (u64)gc->bar0_size);
> > +		return -EPROTO;
> > +	}
> 
> I think that the additional fix suggested by sashiko is really worthy,
> but should go in a separate patch. @Dipayaan: please follow-up on that
> one, thanks!
> 
> Paolo
>
Hi Paolo,

Thanks for reviewing, and I will cross check and send out a separate patch for
issue pointed out by Sashiko(un-related to the current issue).

Regards
Dipayaan Roy



^ permalink raw reply

* RE: [PATCH v2] Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
From: Dexuan Cui @ 2026-05-05 17:11 UTC (permalink / raw)
  To: Dexuan Cui, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Long Li, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, mhklinux@outlook.com,
	matthew.ruffell@canonical.com, johansen@templeofstupid.com,
	hargar@linux.microsoft.com
  Cc: stable@vger.kernel.org
In-Reply-To: <20260505004846.193441-1-decui@microsoft.com>

> From: Dexuan Cui <decui@microsoft.com>
> Sent: Monday, May 4, 2026 5:49 PM
>  ...
> If vmbus_reserve_fb() in the kdump/kexec kernel fails to properly reserve
> the framebuffer MMIO range (which is below 4GB) due to a Gen2 VM's
> screen.lfb_base being zero [1], there is an MMIO conflict between the
> drivers hyperv-drm and pci-hyperv: when the driver pci-hyperv's
> hv_pci_allocate_bridge_windows() calls vmbus_allocate_mmio() to get a

The " hv_pci_allocate_bridge_windows()" should be changed to
       "hv_allocate_config_window()"

> 32-bit MMIO range, it may get an MMIO range that overlaps with the
> framebuffer MMIO range, and later hv_pci_enter_d0() fails with an
> error message "PCI Pass-through VSP failed D0 Entry with status" since
> the host thinks that PCI devices must not use MMIO space that the
> host has assigned to the framebuffer.


^ permalink raw reply

* [PATCH v2] scsi: storvsc: Replace symbolic permissions with octal
From: Md Shofiqul Islam @ 2026-05-06  0:49 UTC (permalink / raw)
  To: linux-scsi, linux-hyperv, linux-kernel
  Cc: Md Shofiqul Islam, longli, kys, haiyangz, wei.liu, decui,
	mhklinux

Symbolic permissions like S_IRUGO and S_IWUSR are not preferred by
checkpatch. Replace with their octal equivalents:

  - S_IRUGO|S_IWUSR -> 0644
  - S_IRUGO         -> 0444

Signed-off-by: Md Shofiqul Islam <shofiqtest@gmail.com>
---
 drivers/scsi/storvsc_drv.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6977ca8a0..571ea5491 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -156,7 +156,7 @@ static bool hv_dev_is_fc(struct hv_device *hv_dev);
 #define STORVSC_LOGGING_WARN	2
 
 static int logging_level = STORVSC_LOGGING_ERROR;
-module_param(logging_level, int, S_IRUGO|S_IWUSR);
+module_param(logging_level, int, 0644);
 MODULE_PARM_DESC(logging_level,
 	"Logging level, 0 - None, 1 - Error (default), 2 - Warning.");
 
@@ -345,17 +345,17 @@ static int storvsc_change_queue_depth(struct scsi_device *sdev, int queue_depth)
 static int storvsc_vcpus_per_sub_channel = 4;
 static unsigned int storvsc_max_hw_queues;
 
-module_param(storvsc_ringbuffer_size, int, S_IRUGO);
+module_param(storvsc_ringbuffer_size, int, 0444);
 MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer size (bytes)");
 
 module_param(storvsc_max_hw_queues, uint, 0644);
 MODULE_PARM_DESC(storvsc_max_hw_queues, "Maximum number of hardware queues");
 
-module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
+module_param(storvsc_vcpus_per_sub_channel, int, 0444);
 MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to subchannels");
 
 static int ring_avail_percent_lowater = 10;
-module_param(ring_avail_percent_lowater, int, S_IRUGO);
+module_param(ring_avail_percent_lowater, int, 0444);
 MODULE_PARM_DESC(ring_avail_percent_lowater,
 		"Select a channel if available ring size > this in percent");
 
-- 
2.54.0.windows.1


^ permalink raw reply related

* Re: [PATCH net-next v3 0/2] net: mana: Avoid queue struct allocation failure under memory fragmentation
From: patchwork-bot+netdevbpf @ 2026-05-06  2:30 UTC (permalink / raw)
  To: Aditya Garg
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya
In-Reply-To: <20260502074552.23857-1-gargaditya@linux.microsoft.com>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sat,  2 May 2026 00:45:32 -0700 you wrote:
> The MANA driver can fail to load on systems with high memory
> utilization because several allocations in the queue setup paths
> require large physically contiguous blocks via kmalloc. Under memory
> fragmentation these high-order allocations may fail, preventing the
> driver from creating queues when opening the interface or when
> reconfiguring channels, ring parameters or MTU at runtime.
> 
> [...]

Here is the summary with links:
  - [net-next,v3,1/2] net: mana: Use per-queue allocation for tx_qp to reduce allocation size
    https://git.kernel.org/netdev/net-next/c/d07efe5a6e64
  - [net-next,v3,2/2] net: mana: Use kvmalloc for large RX queue and buffer allocations
    https://git.kernel.org/netdev/net-next/c/3af0820c878e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-05-06  5:11 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
In-Reply-To: <SN6PR02MB4157B99838CE7498933B2382D4312@SN6PR02MB4157.namprd02.prod.outlook.com>



On 5/4/2026 9:36 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Wednesday, April 29, 2026 2:57 AM
>>
>> On 4/27/2026 11:08 AM, Michael Kelley wrote:
>>> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>>>
> 
> [snip]
> 
>>>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>>>> index 08278547b84c..b4d80c9a673a 100644
>>>> --- a/arch/x86/include/asm/mshyperv.h
>>>> +++ b/arch/x86/include/asm/mshyperv.h
>>>> @@ -286,7 +286,6 @@ struct mshv_vtl_cpu_context {
>>>>    #ifdef CONFIG_HYPERV_VTL_MODE
>>>>    void __init hv_vtl_init_platform(void);
>>>>    int __init hv_vtl_early_init(void);
>>>> -void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>>>    void mshv_vtl_return_call_init(u64 vtl_return_offset);
>>>>    void mshv_vtl_return_hypercall(void);
>>>>    void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>>> @@ -294,7 +293,6 @@ int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, bool shared);
>>>>    #else
>>>>    static inline void __init hv_vtl_init_platform(void) {}
>>>>    static inline int __init hv_vtl_early_init(void) { return 0; }
>>>> -static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>>>>    static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>>>>    static inline void mshv_vtl_return_hypercall(void) {}
>>>>    static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>>>> diff --git a/drivers/hv/mshv_vtl.h b/drivers/hv/mshv_vtl.h
>>>> index a6eea52f7aa2..103f07371f3f 100644
>>>> --- a/drivers/hv/mshv_vtl.h
>>>> +++ b/drivers/hv/mshv_vtl.h
>>>> @@ -22,4 +22,7 @@ struct mshv_vtl_run {
>>>>    	char vtl_ret_actions[MSHV_MAX_RUN_MSG_SIZE];
>>>>    };
>>>>
>>>> +static_assert(sizeof(struct mshv_vtl_cpu_context) <= 1024,
>>>> +	      "struct mshv_vtl_cpu_context exceeds reserved space in struct mshv_vtl_run");
>>>> +
>>>>    #endif /* _MSHV_VTL_H */
>>>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>>>> index db183c8cfb95..8cdf2a9fbdfb 100644
>>>> --- a/include/asm-generic/mshyperv.h
>>>> +++ b/include/asm-generic/mshyperv.h
>>>> @@ -396,8 +396,10 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
>>>>
>>>>    #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>>>>    u8 __init get_vtl(void);
>>>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>>>    #else
>>>>    static inline u8 get_vtl(void) { return 0; }
>>>> +static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>>>
>>> Is this stub needed? Maybe I missed something, but it looks to me like none
>>> of the code that calls this gets built unless CONFIG_HYPERV_VTL_MODE is set.
>>> See further comments about stubs in Patch 8 of this series.
>>>
>>
>> Config dependencies would handle such cases, and this is not required. I
>> saw similar stubs added in the code, so I thought this is a norm that
>> should be followed, and not rely on config dependencies.
>> I can remove it.
>>
> 
> Others might disagree with me, but I don't think it's the norm to add
> stubs when they aren't truly needed. As you can see from some of my
> other comments, I look for ways to eliminate stubs. Stubs are indicative
> of a boundary between separately built components, and I generally
> try to minimize the surface area of such boundaries. A large surface area
> often means that the overall design could be improved by re-thinking
> which code goes with which component.
> 
> Michael

I agree. I'll remove these in next version.

Thanks,
Naman

^ permalink raw reply

* Re: [PATCH v2 09/15] Drivers: hv: mshv_vtl: Move hv_vtl_configure_reg_page() to x86
From: Naman Jain @ 2026-05-06  5:50 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com, Michael Kelley, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86@kernel.org, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
In-Reply-To: <SN6PR02MB4157AD364DE0755DE070D286D4312@SN6PR02MB4157.namprd02.prod.outlook.com>



On 5/4/2026 9:36 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Wednesday, April 29, 2026 2:58 AM
>>
>> On 4/27/2026 11:10 AM, Michael Kelley wrote:
>>> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>>>
>>>> Move hv_vtl_configure_reg_page() from drivers/hv/mshv_vtl_main.c to
>>>> arch/x86/hyperv/hv_vtl.c. The register page overlay is an x86-specific
>>>> feature that uses HV_X64_REGISTER_REG_PAGE, so its configuration belongs
>>>> in architecture-specific code.
>>>>
>>>> Move struct mshv_vtl_per_cpu and union hv_synic_overlay_page_msr to
>>>> include/asm-generic/mshyperv.h so they are visible to both arch and
>>>> driver code.
>>>>
>>>> Change the return type from void to bool so the caller can determine
>>>> whether the register page was successfully configured and set
>>>> mshv_has_reg_page accordingly.
>>>>
>>>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>>>> ---
>>>>    arch/x86/hyperv/hv_vtl.c       | 32 ++++++++++++++++++++++
>>>>    drivers/hv/mshv_vtl_main.c     | 49 +++-------------------------------
>>>>    include/asm-generic/mshyperv.h | 17 ++++++++++++
>>>>    3 files changed, 53 insertions(+), 45 deletions(-)
>>>>
>>
>> <snip>
>>
>>>>    #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>>>> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
>>>
>>> This comment pre-dates your patch, but I don't understand the point
>>> it is trying to make. The comment is factually true, but I don't know
>>> why calling that out is relevant. The REG_PAGE MSR seems to be
>>> conceptually separate and distinct from the SIMP MSR, so the fact
>>> that the layouts are the same is just a coincidence. Or is there some
>>> relationship between the two MSRs that I'm not aware of, and the
>>> comment is trying (and failing?) to point out?
>>
>> This was added as per suggestion from Nuno in my initial series for
>> MSHV_VTL. If the reference in "identical to" is misleading, I should
>> remove it.
>>
>> https://lore.kernel.org/all/68143eb0-e6a7-4579-bedb-4c2ec5aaef6b@linux.microsoft.com/
>>
>> Quoting:
>> """
>> it is a generic structure that
>> appears to be used for several overlay page MSRs (SIMP, SIEF, etc).
>>
>> But, the type doesn't appear in the hv*dk headers explicitly; it's just
>> used internally by the hypervisor.
>>
>> I think it should be renamed with a hv_ prefix to indicate it's part of
>> the hypervisor ABI, and a brief comment with the provenance:
>>
>> /* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
>> union hv_synic_overlay_page_msr {
>> 	/* <snip> */
>> };
> 
> OK, so this union is not associated *only* with the REG_PAGE MSR
> (though that MSR is the only current user). Instead, it is intended to
> be a more generic description of MSRs that set up overlay pages. I
> don't think I had previously noticed Nuno's comment on the topic.
> 
> Looking through hvgdk_mini.h and hvhdk.h, I see 6 definitions that
> are exactly the same:
> 
> * union hv_reference_tsc_msr
> * union hv_x64_msr_hypercall_contents
> * union hv_vp_assist_msr_contents
> * union hv_synic_simp
> * union hv_synic_siefp
> * union hv_synic_sirbp
> 
> There's an argument to be made for removing these 6 unique definitions
> and using union hv_synic_overlay_page_msr instead (though "synic"
> would need to be removed from the name).  I would not object to such
> an approach. It's a small extra layer of conceptual indirection, but saves
> some lines of code for duplicative definitions. The alternative is to drop
> the idea of a generic overlay page MSR layout, and replace union
> hv_synic_overlay_page_msr with a definition that is specific to the
> REG_PAGE MSR, like the other six above.
> 

Hi Michael,

While having a generic definition looks good to have here, I can see two 
reasons for not going ahead with generic overlay page definition:
1. All of the above definitions are present in Hyper-V headers and 
generalizing them would deviate from the strategy of keeping the kernel 
headers in line with Hyper-V headers.
2. For any of these definitions, if the use-case requires using some of 
these reserved bits, then it would be a problem. I can actually see that 
happening in "hv_x64_msr_hypercall_contents" in the corresponding 
variant in the Hyper-V header.

> I could go either way. If we want to use a generic overlay page definition,
> then that approach should be applied everywhere. With the current
> state of your patch set, we're halfway in between -- the generic definition
> is used one place, but duplicative specific MSR definitions are used other
> places. That's probably the least desirable approach.
> 
> Michael


Now, coming back to the hv_synic_overlay_page_msr definition. While 
Nuno's comment hinted at it being "generic", the same is not documented 
in the name of this structure or its comments. So it should be safe to 
assume that it is specific to synic_overlay_page_msr usage. But since it 
is not part of Hyper-V header as such, we needed that comment:
"/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */"

Please let me know your thoughts on this.

Regards,
Naman

^ permalink raw reply

* Re: [PATCH net-next v7] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-05-06  5:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, shradhagupta, dipayanroy,
	yury.norov, shirazsaleem, kees, linux-hyperv, netdev,
	linux-kernel, linux-rdma
In-Reply-To: <20260502005409.3645108-1-kuba@kernel.org>

On Fri, May 01, 2026 at 05:54:09PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Expose hardware diagnostic info via debugfs
> 
> This patch adds debugfs entries under /sys/kernel/debug/mana/<BDF>/ to
> expose device-level and per-vPort hardware configuration and diagnostic
> information. Debugfs creation and removal are moved into mana_gd_setup()
> and a renamed mana_gd_cleanup_device() so probe, remove, suspend, resume
> and shutdown share a single path.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index a654b3699c4c..077d3a1ff6bf 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> 
> [ ... ]
> 
> > @@ -3659,6 +3684,11 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
> >  
> >  	ac->bm_hostmode = bm_hostmode;
> >  
> > +	debugfs_create_u16("num_vports", 0400, gc->mana_pci_debugfs,
> > +			   &ac->num_ports);
> > +	debugfs_create_u8("bm_hostmode", 0400, gc->mana_pci_debugfs,
> > +			  &ac->bm_hostmode);
> > +
> >  	if (!resuming) {
> >  		ac->num_ports = num_ports;
> >  	} else {
> 
> Is the num_vports debugfs file pointed at ac->num_ports before that
> field has been assigned from the queried num_ports value?
> 
> On a fresh probe (!resuming), the sequence in mana_probe() is:
> 
>     err = mana_query_device_cfg(ac, ..., &num_ports, &bm_hostmode);
>     ...
>     ac->bm_hostmode = bm_hostmode;
> 
>     debugfs_create_u16("num_vports", 0400, gc->mana_pci_debugfs,
>                        &ac->num_ports);
>     ...
> 
>     if (!resuming) {
>             ac->num_ports = num_ports;
>     } else {
>             ...
>     }
>     ...
>     if (ac->num_ports > MAX_PORTS_IN_MANA_DEV)
>             ac->num_ports = MAX_PORTS_IN_MANA_DEV;
> 
> During the window between debugfs_create_u16() and the assignment (and
> subsequent clamp to MAX_PORTS_IN_MANA_DEV), would a reader of the
> num_vports debugfs file observe 0 rather than the hardware-reported
> value?
> 
> Would it be clearer to create this file after ac->num_ports has been
> assigned and clamped, so the exposed value matches what the commit
> message describes as "Device configuration"?
> -- 
> pw-bot: cr
Thankyou for pointing out, Jakub.
I will update the comment in the next version.

- Vennela

^ permalink raw reply

* [PATCH net-next v8] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-05-06  5:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, shradhagupta, ernis,
	dipayanroy, shirazsaleem, yury.norov, kees, linux-hyperv, netdev,
	linux-kernel, linux-rdma

Add debugfs entries to expose hardware configuration and diagnostic
information that aids in debugging driver initialization and runtime
operations without adding noise to dmesg.

The debugfs directory for each PCI device is named using pci_name()
(the unique BDF address), and its creation and removal is integrated
into mana_gd_setup() and mana_gd_cleanup_device() respectively, so
that all callers (probe, remove, suspend, resume, shutdown) share a
single code path.

Device-level entries (under /sys/kernel/debug/mana/<BDF>/):
  - num_msix_usable, max_num_queues: Max resources from hardware
  - gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
  - num_vports, bm_hostmode: Device configuration

Per-vPort entries (under /sys/kernel/debug/mana/<BDF>/vportN/):
  - port_handle: Hardware vPort handle
  - max_sq, max_rq: Max queues from vPort config
  - indir_table_sz: Indirection table size
  - steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
    Last applied steering configuration parameters

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v8:
* Move debugfs_create_u16("num_vports", ...) and
  debugfs_create_u8("bm_hostmode", ...) to after ac->num_ports has been
  assigned and clamped to MAX_PORTS_IN_MANA_DEV, so the value exposed
  via debugfs always reflects the final, hardware-reported count
  rather than a transient zero or unclamped value.
* Update the stale comment above mana_gd_resume() to reflect the new
  rollback-on-failure behavior.
Changes in v7:
* Rebase to latest main.
Changes in v6:
* Move out of patchset and create a separate patch.
Changes in v5:
* Update commit message.
* Fix conflicts to align with the new patches.
* Make it part of patchset.
Changes in v4:
* Rebase and fix conflicts.
Changes in v3:
* Rename mana_gd_cleanup to mana_gd_cleanup_device.
* Add creation of debugfs entries in mana_gd_setup.
* Add removal of debugfs entries in mana_gd_cleanup_device.
* Remove bm_hostmode and num_vports from debugfs in mana_remove itself,
  because "ac" gets freed before debugfs_remove_recursive, to avoid
  Use-After-Free error.
* Add "goto out:" in mana_cfg_vport_steering to avoid populating apc
  values when resp.hdr.status is not NULL.
Changes in v2:
* Add debugfs_remove_recursice for gc>mana_pci_debugfs in
  mana_gd_suspend to handle multiple duplicates creation in
  mana_gd_setup and mana_gd_resume path.
* Move debugfs creation for num_vports and bm_hostmode out of
  if(!resuming) condition since we have to create it again even for
  resume.
* Recreate mana_pci_debugfs in mana_gd_resume.
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 73 +++++++++++--------
 drivers/net/ethernet/microsoft/mana/mana_en.c | 33 +++++++++
 include/net/mana/gdma.h                       |  1 +
 include/net/mana/mana.h                       |  8 ++
 4 files changed, 83 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..9e9a97eef7f0 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -194,6 +194,11 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	if (gc->max_num_queues > gc->num_msix_usable - 1)
 		gc->max_num_queues = gc->num_msix_usable - 1;
 
+	debugfs_create_u32("num_msix_usable", 0400, gc->mana_pci_debugfs,
+			   &gc->num_msix_usable);
+	debugfs_create_u32("max_num_queues", 0400, gc->mana_pci_debugfs,
+			   &gc->max_num_queues);
+
 	return 0;
 }
 
@@ -1264,6 +1269,13 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
 		return err ? err : -EPROTO;
 	}
 	gc->pf_cap_flags1 = resp.pf_cap_flags1;
+	gc->gdma_protocol_ver = resp.gdma_protocol_ver;
+
+	debugfs_create_x64("gdma_protocol_ver", 0400, gc->mana_pci_debugfs,
+			   &gc->gdma_protocol_ver);
+	debugfs_create_x64("pf_cap_flags1", 0400, gc->mana_pci_debugfs,
+			   &gc->pf_cap_flags1);
+
 	if (resp.pf_cap_flags1 & GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG) {
 		err = mana_gd_query_hwc_timeout(pdev, &hwc->hwc_timeout);
 		if (err) {
@@ -1943,15 +1955,20 @@ static int mana_gd_setup(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	int err;
 
+	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
+						  mana_debugfs_root);
+
 	err = mana_gd_init_registers(pdev);
 	if (err)
-		return err;
+		goto remove_debugfs;
 
 	mana_smc_init(&gc->shm_channel, gc->dev, gc->shm_base);
 
 	gc->service_wq = alloc_ordered_workqueue("gdma_service_wq", 0);
-	if (!gc->service_wq)
-		return -ENOMEM;
+	if (!gc->service_wq) {
+		err = -ENOMEM;
+		goto remove_debugfs;
+	}
 
 	err = mana_gd_setup_hwc_irqs(pdev);
 	if (err) {
@@ -1992,11 +2009,14 @@ static int mana_gd_setup(struct pci_dev *pdev)
 free_workqueue:
 	destroy_workqueue(gc->service_wq);
 	gc->service_wq = NULL;
+remove_debugfs:
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
 	dev_err(&pdev->dev, "%s failed (error %d)\n", __func__, err);
 	return err;
 }
 
-static void mana_gd_cleanup(struct pci_dev *pdev)
+static void mana_gd_cleanup_device(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 
@@ -2008,6 +2028,10 @@ static void mana_gd_cleanup(struct pci_dev *pdev)
 		destroy_workqueue(gc->service_wq);
 		gc->service_wq = NULL;
 	}
+
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
+
 	dev_dbg(&pdev->dev, "mana gdma cleanup successful\n");
 }
 
@@ -2065,9 +2089,6 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	gc->dev = &pdev->dev;
 	xa_init(&gc->irq_contexts);
 
-	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
-						  mana_debugfs_root);
-
 	err = mana_gd_setup(pdev);
 	if (err)
 		goto unmap_bar;
@@ -2096,16 +2117,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 cleanup_mana:
 	mana_remove(&gc->mana, false);
 cleanup_gd:
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 unmap_bar:
-	/*
-	 * at this point we know that the other debugfs child dir/files
-	 * are either not yet created or are already cleaned up.
-	 * The pci debugfs folder clean-up now, will only be cleaning up
-	 * adapter-MTU file and apc->mana_pci_debugfs folder.
-	 */
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-	gc->mana_pci_debugfs = NULL;
 	xa_destroy(&gc->irq_contexts);
 	pci_iounmap(pdev, bar0_va);
 free_gc:
@@ -2155,11 +2168,7 @@ static void mana_gd_remove(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, false);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	xa_destroy(&gc->irq_contexts);
 
@@ -2181,14 +2190,13 @@ int mana_gd_suspend(struct pci_dev *pdev, pm_message_t state)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 
 	return 0;
 }
 
-/* In case the NIC hardware stops working, the suspend and resume callbacks will
- * fail -- if this happens, it's safer to just report an error than try to undo
- * what has been done.
+/* If resume fails partway through, roll back any setup that completed so
+ * the device is left in a clean state and resources are not leaked.
  */
 int mana_gd_resume(struct pci_dev *pdev)
 {
@@ -2201,13 +2209,18 @@ int mana_gd_resume(struct pci_dev *pdev)
 
 	err = mana_probe(&gc->mana, true);
 	if (err)
-		return err;
+		goto cleanup_gd;
 
 	err = mana_rdma_probe(&gc->mana_ib);
 	if (err)
-		return err;
+		goto cleanup_mana;
 
 	return 0;
+cleanup_mana:
+	mana_remove(&gc->mana, true);
+cleanup_gd:
+	mana_gd_cleanup_device(pdev);
+	return err;
 }
 
 /* Quiesce the device for kexec. This is also called upon reboot/shutdown. */
@@ -2220,11 +2233,7 @@ static void mana_gd_shutdown(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	pci_disable_device(pdev);
 }
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a654b3699c4c..26bd3d270b5e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1276,6 +1276,9 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
 	apc->port_handle = resp.vport;
 	ether_addr_copy(apc->mac_addr, resp.mac_addr);
 
+	apc->vport_max_sq = *max_sq;
+	apc->vport_max_rq = *max_rq;
+
 	return 0;
 }
 
@@ -1430,6 +1433,11 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 
 	netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
 		    apc->port_handle, apc->indir_table_sz);
+
+	apc->steer_rx = rx;
+	apc->steer_rss = apc->rss_state;
+	apc->steer_update_tab = update_tab;
+	apc->steer_cqe_coalescing = req->cqe_coalescing_enable;
 out:
 	kfree(req);
 	return err;
@@ -3161,6 +3169,23 @@ static int mana_init_port(struct net_device *ndev)
 	eth_hw_addr_set(ndev, apc->mac_addr);
 	sprintf(vport, "vport%d", port_idx);
 	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+
+	debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
+			   &apc->port_handle);
+	debugfs_create_u32("max_sq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_sq);
+	debugfs_create_u32("max_rq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_rq);
+	debugfs_create_u32("indir_table_sz", 0400, apc->mana_port_debugfs,
+			   &apc->indir_table_sz);
+	debugfs_create_u32("steer_rx", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rx);
+	debugfs_create_u32("steer_rss", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rss);
+	debugfs_create_u32("steer_update_tab", 0400, apc->mana_port_debugfs,
+			   &apc->steer_update_tab);
+	debugfs_create_u32("steer_cqe_coalescing", 0400, apc->mana_port_debugfs,
+			   &apc->steer_cqe_coalescing);
 	debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs,
 			   &apc->speed);
 	return 0;
@@ -3678,6 +3703,11 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 	if (ac->num_ports > MAX_PORTS_IN_MANA_DEV)
 		ac->num_ports = MAX_PORTS_IN_MANA_DEV;
 
+	debugfs_create_u16("num_vports", 0400, gc->mana_pci_debugfs,
+			   &ac->num_ports);
+	debugfs_create_u8("bm_hostmode", 0400, gc->mana_pci_debugfs,
+			  &ac->bm_hostmode);
+
 	ac->per_port_queue_reset_wq =
 		create_singlethread_workqueue("mana_per_port_queue_reset_wq");
 	if (!ac->per_port_queue_reset_wq) {
@@ -3800,6 +3830,9 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 
 	mana_gd_deregister_device(gd);
 
+	debugfs_lookup_and_remove("bm_hostmode", gc->mana_pci_debugfs);
+	debugfs_lookup_and_remove("num_vports", gc->mana_pci_debugfs);
+
 	if (suspending)
 		return;
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 6d836060976a..70d62bc32837 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -442,6 +442,7 @@ struct gdma_context {
 	struct gdma_dev		mana_ib;
 
 	u64 pf_cap_flags1;
+	u64 gdma_protocol_ver;
 
 	struct workqueue_struct *service_wq;
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..18215388d2c7 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -568,6 +568,14 @@ struct mana_port_context {
 
 	/* Debugfs */
 	struct dentry *mana_port_debugfs;
+
+	/* Cached vport/steering config for debugfs */
+	u32 vport_max_sq;
+	u32 vport_max_rq;
+	u32 steer_rx;
+	u32 steer_rss;
+	u32 steer_update_tab;
+	u32 steer_cqe_coalescing;
 };
 
 netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Mark Rutland @ 2026-05-06  7:52 UTC (permalink / raw)
  To: Naman Jain
  Cc: Marc Zyngier, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Arnd Bergmann, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Ghiti, Michael Kelley, Timothy Hayes,
	Lorenzo Pieralisi, Sascha Bischoff, mrigendrachaubey,
	linux-hyperv, linux-arm-kernel, linux-kernel, linux-arch,
	linux-riscv, vdso, ssengar
In-Reply-To: <f4059f5d-a82b-40c2-942e-3e24cefab94f@linux.microsoft.com>

On Wed, Apr 29, 2026 at 03:26:11PM +0530, Naman Jain wrote:
> On 4/23/2026 7:26 PM, Mark Rutland wrote:
> > On Thu, Apr 23, 2026 at 12:41:57PM +0000, Naman Jain wrote:

[ non-SMMC hypercall code omitted for brevity ]

> > NAK to this.
> > 
> > * This is a non-SMCCC hypercall, which we have NAK'd in general in the
> >    past for various reasons that I am not going to rehash here.
> > 
> > * It's not clear how this is going to be extended with necessary
> >    architecture state in future (e.g. SVE, SME). This is not
> >    future-proof, and I don't believe this is maintainable.
> > 
> > * This breaks general requirements for reliable stacktracing by
> >    clobbering state (e.g. x29) that we depend upon being valid AT ALL
> >    TIMES outside of entry code.
> > 
> > * IMO, if this needs to be saved/restored, that should happen in
> >    whatever you are calling.
> > 
> > Mark.
> 
> Merging threads for addressing comments from Mark Rutland and Marc Zyngier
> on this patch.
> 
> Thanks for reviewing the changes. Please allow me to briefly explain the use
> case here and then address your comments.
> 
> Hyper-V's Virtual Trust Levels (VTLs) provide hardware-enforced isolation
> within a single VM, analogous to ARM TrustZone. The kernel runs in VTL2
> (higher privilege) as a "paravisor", a security monitor that handles
> intercepts for the primary OS in VTL0 (lower privilege). The VTL switch
> (mshv_vtl_return_call) is functionally equivalent to KVM's guest enter/exit.

It's worth noting that for KVM, the KVM hyp code is *tightly* coupled
with the host kernel (they are one single binary object), and the
calling convention between the two is an implementation detail that can
change at any time without any ABI concerns.

While I appreciate this might be trying to do the same thing from a
*functional* perspective, it's certainly different from a
maintainability perspective, and can't be treated in the same way.

> It saves VTL2 state, loads VTL0's GPRs other registers from a shared context
> structure, issues hvc #3 to let VTL0 run, and on return saves VTL0's updated
> state back.
> 
> Coming to the problems with the code, I have identified a few ways to
> address them.
> 
> I can put the assembly code in a separate .S file with
> SYM_FUNC_START/SYM_FUNC_END and marked as noinstr, to prevent ftrace/kprobes
> from instrumenting between the GPR load and the hvc, which could have
> corrupted VTL0 register state. This should solve x29 clobbering, stack
> tracing problems.

My point was that you must not clobber those registers.

Looking at the TLFS document you linked below, it says:

| Note: X29 (FP/frame pointer), X30 (LR/link register), and SP are private
| per-VTL

... so clobbering those doesn't seem to be necessary anyway. Clearly
having an arbitrary calling convention is confusing for everyone.

> I should use kernel_neon_begin()/kernel_neon_end() to save/restore the full
> extended FP state of the current task in VTL2. VTL0's Q0-Q31 can be
> loaded/saved separately via fpsimd_load_state()/fpsimd_save_state(). This
> way, the assembly touches none of the SIMD registers. This is SVE/SME-safe
> for VTL2's task state. VTL0 still only carries Q0-Q31 in the context struct,
> and extending to SVE, SME is a future context struct change, which will need
> Hyper-V arm64 ABI support.
> This way, VTL2's callee-saved regs (x19-x28, x29, x30) are explicitly saved
> to the stack frame at the top and restored at the bottom of assembly code.
> The C caller (in hv_vtl.c) is a clean function call.

That doesn't really address my concerns here.

I do not think that Linux should have to save/restore anything here;
that should be the job of the real hypervisor. The arbitrary separation
of PE state into private and shared (with shred state being directly
exposed to Linux) is a problem for maintainability and forward
compatibility.

Looking at the TLFS document you linked below, I see:

| Note: SVE state (Z0-Z31, P0-P15, FFR) and SME state are VTL-private.
| The lower 128-bit portion (Q registers) is shared, but the upper bits
| of Z registers may be corrupted on VTL transitions. Software should
| not rely on Z register contents being preserved across VTL switches.

... which is certainly going to be a pain to manage.

Note in particular "SME state" is not an architectural term. I don't
know which state in particular that is intended to cover (e.g. ZA, ZT0,
SVCR, all streaming mode state)?

There's no mention of SVCR, so I don't know how this is going to
interact with management of ZA state (ZA and ZT0, which are dependent
upon SVCR.ZA) or streaming mode (dependent upon SVCR.SM). That state has
been *incredibly* painful for us to manage generally. Regardless of the
SMCCC concerns, that needs to be specified better.

> Regarding Non-SMCCC "hvc #3" call, I have a limitation here owing to the ABI
> that is defined by the Hyper-V hypervisor. Fixing this requires a
> hypervisor-side change to support SMCCC-style dispatch for VTL return. Until
> then, hvc #3 is the only working interface. Moreover there would be backward
> compatibility issues with this new ABI interface, if at all it is added.

To be clear, that's Microsoft's problem, not the Linux kernel
community's problem. My NAK still stands.

Multiple years ago now, we made it clear that we would not accept a
non-SMCCC calling convention. Ignoring the substance of that feedback,
and inventing a new calling convention after that point is a
self-inflicted problem.

[...]

> Link to TLFS: https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm#on-arm64-platforms-3

For shared state, aside fomr GPRs and FPSIMD/SVE/SME state, that says:

| * System Information Registers (read-only or non-security-critical):
|   * System identification and feature registers
|   * Cache and TLB type information

It's *implied* that some of those registers might be writable, but as
the specific set of registers is not described I cannot tell. Are there
any writable system registers which are shared?

I don't see how we can know which registers we might need to
save/restore without that being explicitly documented.

I also see:

| Note: SPE (Statistical Profiling Extension) state is shared across VTLs,
| except for PMBSR_EL1 which is VTL-private.

If "SPE state" includes PMBPTR or PMBLIMITR (which is the obvious
reading), this would be a security problem, as a lower-privileged VTL
could clobber those and cause SPE to write to arbitrary memory
immediately upon return to the higher-privileged VTL. Having PMBSR be
private on its own isn't sufficient to prevent that (e.g. since the
higher-privileged VTL could have its own active SPE profiling session).

I'm not keen on requiring hyper-v specific hooks in the SPE driver to
achieve that, and I'm also not keen on having hyper-v support code poke
SPE registers behind the SPE driver's back.

This does not give me confidence that any future PE state (e.g. things
like TRBE) will be managed in a safe way either.

Mark.

^ permalink raw reply

* Re: [PATCH V2 09/11] x86/hyperv: Implement hyperv virtual IOMMU
From: Souradeep Chakrabarti @ 2026-05-06 10:07 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260501004157.3108202-10-mrathor@linux.microsoft.com>

On Thu, Apr 30, 2026 at 05:41:55PM -0700, Mukesh R wrote:
> Add a new file to implement management of device domains, mapping and
> unmapping of IOMMU memory, and other iommu_ops to fit within the VFIO
> framework for PCI passthru on Hyper-V running Linux as baremetal root
> or L1VH root. This also implements direct attach mechanism (see below),
> a special feature of Hyper-V for PCI passthru, and it is also made to
> work within the VFIO framework.
> 
> At a high level, during boot the hypervisor creates a default identity
> domain and attaches all devices to it. This nicely maps to Linux IOMMU
> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> during boot. As mentioned previously, Hyper-V supports two ways to do
> PCI passthru:
> 
>   1. Device Domain (aka Domain Attach): root must create a device domain
>      in the hypervisor, and do map/unmap hypercalls for mapping and
>      unmapping guest RAM for DMA. All hypervisor communications use
>      device ID of type PCI for identifying and referencing the device.
> 
>   2. Direct Attach: the hypervisor will simply use the guest's HW
>      page table for mappings, thus the root need not map/unmap guest
>      memory for DMA. As such, direct attach passthru setup during guest
>      boot is extremely fast. A direct attached device must always be
>      referenced via logical device ID and not via the PCI device ID.
> 
> At present, L1VH root only supports direct attaches. Also direct attach is
> default in non-L1VH cases because there are some significant performance
> issues with domain attach implementations currently for guests with higher
> RAM (say more than 8GB), and that unfortunately cannot be addressed in
> the short term.
> 
> Co-developed-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  MAINTAINERS                       |   1 +
>  arch/x86/kernel/pci-dma.c         |   2 +
>  drivers/iommu/Kconfig             |   5 +-
>  drivers/iommu/Makefile            |   1 +
>  drivers/iommu/hyperv-iommu-root.c | 908 ++++++++++++++++++++++++++++++
>  include/asm-generic/mshyperv.h    |  17 +
>  include/linux/hyperv.h            |   6 +
>  7 files changed, 937 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/iommu/hyperv-iommu-root.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f803a6a38fee..8ae040b89a56 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11914,6 +11914,7 @@ F:	drivers/clocksource/hyperv_timer.c
>  F:	drivers/hid/hid-hyperv.c
>  F:	drivers/hv/
>  F:	drivers/input/serio/hyperv-keyboard.c
> +F:	drivers/iommu/hyperv-iommu-root.c
>  F:	drivers/iommu/hyperv-irq.c
>  F:	drivers/net/ethernet/microsoft/
>  F:	drivers/net/hyperv/
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 6267363e0189..cfeee6505e17 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -8,6 +8,7 @@
>  #include <linux/gfp.h>
>  #include <linux/pci.h>
>  #include <linux/amd-iommu.h>
> +#include <linux/hyperv.h>
>  
>  #include <asm/proto.h>
>  #include <asm/dma.h>
> @@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
>  	gart_iommu_hole_init();
>  	amd_iommu_detect();
>  	detect_intel_iommu();
> +	hv_iommu_detect();
>  	swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
>  }
>  
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index f86262b11416..7909cf4373a6 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -352,13 +352,12 @@ config MTK_IOMMU_V1
>  	  if unsure, say N here.
>  
>  config HYPERV_IOMMU
> -	bool "Hyper-V IRQ Handling"
> +	bool "Hyper-V IOMMU Unit"
>  	depends on HYPERV && X86
>  	select IOMMU_API
>  	default HYPERV
>  	help
> -	  Stub IOMMU driver to handle IRQs to support Hyper-V Linux
> -	  guest and root partitions.
> +	  Hyper-V pseudo IOMMU unit.
>  
>  config VIRTIO_IOMMU
>  	tristate "Virtio IOMMU driver"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 335ea77cced6..296fbc6ca829 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -31,6 +31,7 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>  obj-$(CONFIG_HYPERV) += hyperv-irq.o
> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu-root.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> diff --git a/drivers/iommu/hyperv-iommu-root.c b/drivers/iommu/hyperv-iommu-root.c
> new file mode 100644
> index 000000000000..739bbf39dea2
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu-root.c
> @@ -0,0 +1,908 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Hyper-V root vIOMMU driver.
> + * Copyright (C) 2026, Microsoft, Inc.
> + */
> +
> +#include <linux/pci.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/interval_tree.h>
> +#include <linux/hyperv.h>
> +#include "dma-iommu.h"
> +#include <asm/iommu.h>
> +#include <asm/mshyperv.h>
> +
> +/* We will not claim these PCI devices, eg hypervisor needs it for debugger */
> +static char *pci_devs_to_skip;
> +static int __init hv_iommu_setup_skip(char *str)
> +{
> +	pci_devs_to_skip = str;
> +
> +	return 0;
> +}
> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
> +
> +bool hv_no_attdev;	 /* disable direct device attach for passthru */
> +EXPORT_SYMBOL_GPL(hv_no_attdev);
> +static int __init setup_hv_no_attdev(char *str)
> +{
> +	hv_no_attdev = true;
> +	return 0;
> +}
> +__setup("hv_no_attdev", setup_hv_no_attdev);
> +
> +/* Iommu device that we export to the world. HyperV supports max of one */
> +static struct iommu_device hv_virt_iommu;
> +
> +struct hv_domain {
> +	struct iommu_domain iommu_dom;
> +	u32 domid_num;			      /* as opposed to domain_id.type */
> +	bool attached_dom;		      /* is this direct attached dom? */
> +	u64 partid;			      /* partition id */
> +	spinlock_t mappings_lock;	      /* protects mappings_tree */
> +	struct rb_root_cached mappings_tree;  /* iova to pa lookup tree */
> +};
> +
> +#define to_hv_domain(d) container_of(d, struct hv_domain, iommu_dom)
> +
> +struct hv_iommu_mapping {
> +	phys_addr_t paddr;
> +	struct interval_tree_node iova;
> +	u32 flags;
> +};
> +
> +/*
> + * By default, during boot the hypervisor creates one Stage 2 (S2) default
> + * domain. Stage 2 means that the page table is controlled by the hypervisor.
> + *   S2 default: access to entire root partition memory. This for us easily
> + *		 maps to IOMMU_DOMAIN_IDENTITY in the iommu subsystem, and
> + *		 is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the hypervisor.
> + *
> + * Device Management:
> + *   There are two ways to manage device attaches to domains:
> + *     1. Domain Attach: A device domain is created in the hypervisor, the
> + *			 device is attached to this domain, and then memory
> + *			 ranges are mapped in the map callbacks.
> + *     2. Direct Attach: No need to create a domain in the hypervisor for direct
> + *			 attached devices. A hypercall is made to tell the
> + *			 hypervisor to attach the device to a guest. There is
> + *			 no need for explicit memory mappings because the
> + *			 hypervisor will just use the guest HW page table.
> + *
> + * Since a direct attach is much faster, it is the default. This can be
> + * changed via hv_no_attdev.
> + *
> + * L1VH: hypervisor only supports direct attach.
> + */
> +
> +/*
> + * Create dummy domains to correspond to hypervisor prebuilt default identity
> + * and null domains (dummy because we do not make hypercalls to create them).
> + */
> +static struct hv_domain hv_def_identity_dom;
> +static struct hv_domain hv_null_dom;
> +
> +static bool hv_special_domain(struct hv_domain *hvdom)
> +{
> +	return hvdom == &hv_def_identity_dom || hvdom == &hv_null_dom;
> +}
> +
> +struct iommu_domain_geometry default_geometry = (struct iommu_domain_geometry) {
> +	.aperture_start = 0,
> +	.aperture_end = -1UL,
> +	.force_aperture = true,
> +};
> +
> +/*
> + * Since the relevant hypercalls can only fit less than 512 PFNs in the pfn
> + * array, report 1M max.
> + */
> +#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
> +
> +static u32 unique_id;	      /* unique numeric id of a new domain */
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
> +				struct device *dev);
> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather);
> +
> +/*
> + * If the current thread is a VMM thread, return the partition id of the VM it
> + * is managing, else return HV_PARTITION_ID_INVALID.
> + */
> +u64 hv_get_current_partid(void)
> +{
> +	u64 (*fn)(void);
> +	u64 ptid;
> +
> +	fn = symbol_get(mshv_current_partid);
> +	if (!fn)
> +		return HV_PARTITION_ID_INVALID;
> +
> +	ptid = fn();
> +	symbol_put(mshv_current_partid);
> +
> +	return ptid;
> +}
> +EXPORT_SYMBOL_GPL(hv_get_current_partid);
> +
> +/* If this is a VMM thread, then this domain is for a guest vm */
> +static bool hv_curr_thread_is_vmm(void)
> +{
> +	return hv_get_current_partid() != HV_PARTITION_ID_INVALID;
> +}
> +
> +/* As opposed to some host app like SPDK etc... */
> +static bool hv_dom_owner_is_vmm(struct hv_domain *hvdom)
> +{
> +	return hvdom && hvdom->partid != HV_PARTITION_ID_INVALID;
> +}
> +
> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_CACHE_COHERENCY:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +/*
> + * Check if given pci device is a direct attached device. Caller must have
> + * verified pdev is a valid pci device.
> + */
> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> +{
> +	struct iommu_domain *iommu_domain;
> +	struct hv_domain *hvdom;
> +	struct device *dev = &pdev->dev;
> +
> +	iommu_domain = iommu_get_domain_for_dev(dev);
> +	if (iommu_domain) {
> +		hvdom = to_hv_domain(iommu_domain);
> +		return hvdom->attached_dom;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
> +
> +bool hv_pcidev_is_pthru_dev(struct pci_dev *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> +
> +	if (hvdom && !hv_special_domain(hvdom))
> +		return true;
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(hv_pcidev_is_pthru_dev);
> +
> +/* Build device id for direct attached devices */
> +static u64 hv_build_devid_type_logical(struct pci_dev *pdev)
> +{
> +	hv_pci_segment segment;
> +	union hv_device_id hv_devid;
> +	union hv_pci_bdf bdf = {.as_uint16 = 0};
> +	u32 rid = PCI_DEVID(pdev->bus->number, pdev->devfn);
> +
> +	segment = pci_domain_nr(pdev->bus);
> +	bdf.bus = PCI_BUS_NUM(rid);
> +	bdf.device = PCI_SLOT(rid);
> +	bdf.function = PCI_FUNC(rid);
> +
> +	hv_devid.as_uint64 = 0;
> +	hv_devid.device_type = HV_DEVICE_TYPE_LOGICAL;
> +	hv_devid.logical.id = (u64)segment << 16 | bdf.as_uint16;
> +
> +	return hv_devid.as_uint64;
> +}
> +
> +u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type)
> +{
> +	if (type == HV_DEVICE_TYPE_LOGICAL) {
> +		if (hv_l1vh_partition())
> +			return hv_pci_vmbus_device_id(pdev);
> +		else
> +			return hv_build_devid_type_logical(pdev);
> +	} else if (type == HV_DEVICE_TYPE_PCI)
> +#ifdef CONFIG_X86
> +		return hv_build_devid_type_pci(pdev);
> +#else
> +		return 0;
> +#endif
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(hv_build_devid_oftype);
> +
> +/* Create a new device domain in the hypervisor */
> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
> +{
> +	u64 status;
> +	struct hv_input_device_domain *ddp;
> +	struct hv_input_create_device_domain *input;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	ddp = &input->device_domain;
> +	ddp->partition_id = HV_PARTITION_ID_SELF;
> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	ddp->domain_id.id = hvdom->domid_num;
> +
> +	input->create_device_domain_flags.forward_progress_required = 1;
> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
> +
> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> +{
> +	struct hv_domain *hvdom;
> +	int rc;
> +
> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm()) {
> +		pr_err("Hyper-V: l1vh iommu does not support host devices\n");
> +		return NULL;
> +	}
> +
> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
> +	if (hvdom == NULL)
> +		return NULL;
> +
> +	spin_lock_init(&hvdom->mappings_lock);
> +	hvdom->mappings_tree = RB_ROOT_CACHED;
> +
> +	/* Called under iommu group mutex, so single threaded */
> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /* ie, 0 */
> +		goto out_err;
> +
> +	hvdom->domid_num = unique_id;
> +	hvdom->partid = hv_get_current_partid();
> +	hvdom->iommu_dom.geometry = default_geometry;
> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
> +
> +	/* For guests, by default we do direct attaches, so no domain in hyp */
> +	if (hv_dom_owner_is_vmm(hvdom) && !hv_no_attdev)
> +		hvdom->attached_dom = true;
> +	else {
> +		rc = hv_iommu_create_hyp_devdom(hvdom);
> +		if (rc)
> +			goto out_err;
> +	}
> +
> +	return &hvdom->iommu_dom;
> +
> +out_err:
> +	unique_id--;
> +	kfree(hvdom);
> +	return NULL;
> +}
> +
> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> +{
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	unsigned long flags;
> +	u64 status;
> +	struct hv_input_delete_device_domain *input;
> +
> +	if (hv_special_domain(hvdom))
> +		return;
> +
> +	if (!hv_dom_owner_is_vmm(hvdom) || hv_no_attdev) {
> +		struct hv_input_device_domain *ddp;
> +
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		ddp = &input->device_domain;
> +		memset(input, 0, sizeof(*input));
> +
> +		ddp->partition_id = HV_PARTITION_ID_SELF;
> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +		ddp->domain_id.id = hvdom->domid_num;
> +
> +		status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
> +					 NULL);
> +		local_irq_restore(flags);
> +
> +		if (!hv_result_success(status))
> +			hv_status_err(status, "\n");
> +	}
> +
> +	kfree(hvdom);
> +}
> +
> +/* Attach a device to a domain previously created in the hypervisor */
> +static int hv_iommu_att_dev2dom(struct hv_domain *hvdom, struct pci_dev *pdev)
> +{
> +	unsigned long flags;
> +	u64 status;
> +	enum hv_device_type dev_type;
> +	struct hv_input_attach_device_domain *input;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +
> +	/* NB: Upon guest shutdown, device is re-attached to the default domain
> +	 *     without explicit detach.
> +	 */
> +	if (hv_l1vh_partition())
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
> +
> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* Caller must have validated that dev is a valid pci dev */
> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev, u64 ptid)
> +{
> +	struct hv_input_attach_device *input;
> +	u64 status;
> +	int rc;
> +	unsigned long flags;
> +	union hv_device_id host_devid;
> +	enum hv_device_type dev_type;
> +
> +	if (ptid == HV_PARTITION_ID_INVALID) {
> +		pr_err("Hyper-V: Invalid partition id in direct attach\n");
> +		return -EINVAL;
> +	}
> +
> +	if (hv_l1vh_partition())
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
> +
> +	do {
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		memset(input, 0, sizeof(*input));
> +		input->partition_id = ptid;
> +		input->device_id = host_devid;
> +
> +		/* Hypervisor associates logical_id with this device, and in
> +		 * some hypercalls like retarget interrupts, logical_id must be
> +		 * used instead of the BDF. It is a required parameter.
> +		 */
> +		input->attdev_flags.logical_id = 1;
> +		input->logical_devid =
> +			   hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
> +
> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE, input, NULL);
> +		local_irq_restore(flags);
> +
> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);
> +			if (rc)
> +				break;
> +		}
> +	} while (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY);
This can become a infinite loop, if for some reason HV continue to fail 
to attach device for some other reason than insufficient memory. We can 
have a max retry count here.
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* Attach a device for passthru to guest VMs, host apps like SPDK, etc */
> +static int hv_iommu_attach_dev(struct iommu_domain *immdom, struct device *dev,
> +			       struct iommu_domain *old)
> +{
> +	struct pci_dev *pdev;
> +	int rc;
> +	struct hv_domain *hvdom_new = to_hv_domain(immdom);
> +	struct hv_domain *hvdom_prev = dev_iommu_priv_get(dev);
> +
> +	/* Only allow PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	if (hv_l1vh_partition() && !hv_special_domain(hvdom_new) &&
> +	    !hvdom_new->attached_dom)
> +		return -EINVAL;
> +
> +	/* VFIO does not do explicit detach calls, hence check first if we need
> +	 * to detach first. Also, in case of guest shutdown, it's the VMM
> +	 * thread that attaches it back to the hv_def_identity_dom, and
> +	 * hvdom_prev will not be null then. It is null during boot.
> +	 */
> +	if (hvdom_prev)
> +		if (!hv_l1vh_partition() || !hv_special_domain(hvdom_prev))
> +			hv_iommu_detach_dev(&hvdom_prev->iommu_dom, dev);
> +
> +	if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets "private" field */
> +		return 0;
> +	}
> +
> +	if (hvdom_new->attached_dom)
> +		rc = hv_iommu_direct_attach_device(pdev, hvdom_new->partid);
> +	else
> +		rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
destructive detach before failable attach with no rollback.
1. If hvdom_prev exists, issue HVCALL_DETACH_DEVICE_DOMAIN / 
HVCALL_DETACH_DEVICE against the old hypervisor domain.
2. Then issue the attach hypercall against the new domain 
3. Only on success, update dev_iommu_priv.

If step 2 fails, the device is left hypervisor-detached while 
dev_iommu_priv still points at the old domain, and the IOMMU core's 
recovery does not help
> +
> +	if (rc == 0)
> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets "private" field */
> +
> +	return rc;
> +}
> +
> +static void hv_iommu_det_dev_from_guest(struct pci_dev *pdev, u64 ptid)
> +{
> +	struct hv_input_detach_device *input;
> +	u64 status, log_devid;
> +	unsigned long flags;
> +
> +	log_devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = ptid;
> +	input->logical_devid = log_devid;
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +}
> +
> +static void hv_iommu_det_dev_from_dom(struct pci_dev *pdev)
> +{
> +	u64 status, devid;
> +	unsigned long flags;
> +	struct hv_input_detach_device_domain *input;
> +
> +	devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	input->device_id.as_uint64 = devid;
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> +{
> +	struct pci_dev *pdev;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +
> +	/* See the attach function, only PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	if (hvdom->attached_dom)
> +		hv_iommu_det_dev_from_guest(pdev, hvdom->partid);
> +
> +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
> +		 * next.
> +		 */
> +	else
> +		hv_iommu_det_dev_from_dom(pdev);
> +}
> +
> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> +				     unsigned long iova, phys_addr_t paddr,
> +				     size_t size, u32 flags)
> +{
> +	unsigned long irqflags;
> +	struct hv_iommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> +	if (!mapping)
> +		return -ENOMEM;
> +
> +	mapping->paddr = paddr;
> +	mapping->iova.start = iova;
> +	mapping->iova.last = iova + size - 1;
> +	mapping->flags = flags;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
> +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
> +
> +	return 0;
> +}
> +
> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> +					unsigned long iova, size_t size)
> +{
> +	unsigned long flags;
> +	size_t unmapped = 0;
> +	unsigned long last = iova + size - 1;
> +	struct hv_iommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/* Trying to split a mapping? Not supported for now. */
> +		if (mapping->iova.start < iova)
> +			break;
> +
> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> +
> +		interval_tree_remove(node, &hvdom->mappings_tree);
> +		kfree(mapping);
> +	}
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> +
> +	return unmapped;
> +}
> +
> +/* Return: must return exact status from the hypercall without changes */
> +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
> +			    unsigned long iova, phys_addr_t paddr,
> +			    unsigned long npages, u32 map_flags)
> +{
> +	u64 status;
> +	int i;
> +	struct hv_input_map_device_gpa_pages *input;
> +	unsigned long flags, pfn;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->map_flags = map_flags;
> +	input->target_device_va_base = iova;
> +
> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
> +	for (i = 0; i < npages; i++, pfn++)
> +		input->gpa_page_list[i] = pfn;
No boundary check for npages here in gpa_page_list, as 512 PFNs in the 
pfn array is the limit for the rep hypercall.
> +
> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
> +				     input, NULL);
npages can get trucated here, as rep_count is u16.
> +
> +	local_irq_restore(flags);
> +	return status;
> +}
> +
> +/*
> + * The core VFIO code loops over memory ranges calling this function with
> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
> + */
> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
> +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
> +			      int prot, gfp_t gfp, size_t *mapped)
> +{
> +	u32 map_flags;
> +	int ret;
> +	u64 status;
> +	unsigned long npages, done = 0;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t size = pgsize * pgcount;
> +
> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> +
> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
> +	if (ret)
> +		return ret;
> +
> +	if (hvdom->attached_dom) {
> +		*mapped = size;
> +		return 0;
> +	}
> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +	while (done < npages) {
> +		ulong completed, remain = npages - done;
> +
> +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
> +					  map_flags);
> +
> +		completed = hv_repcomp(status);
> +		done = done + completed;
> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
> +
> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +						    hv_current_partition_id,
> +						    256);
> +			if (ret)
> +				break;
> +			continue;
> +		}
> +		if (!hv_result_success(status))
> +			break;
> +	}
> +
> +	if (!hv_result_success(status)) {
> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
> +
> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
> +			      done, npages, iova);
> +		/*
> +		 * lookup tree has all mappings [0 - size-1]. Below unmap will
> +		 * only remove from [0 - done], we need to remove second chunk
> +		 * [done+1 - size-1].
> +		 */
> +		hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
> +		hv_iommu_unmap_pages(immdom, iova - done_size, HV_HYP_PAGE_SIZE,
> +				     done, NULL);
> +		if (mapped)
> +			*mapped = 0;
> +	} else
> +		if (mapped)
> +			*mapped = size;
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather)
> +{
> +	unsigned long flags, npages;
> +	struct hv_input_unmap_device_gpa_pages *input;
> +	u64 status;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t unmapped, size = pgsize * pgcount;
> +
> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
> +	if (unmapped < size)
> +		pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
> +		       __func__, iova, unmapped, size);
> +
> +	if (hvdom->attached_dom)
> +		return size;
> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->target_device_va_base = iova;
> +
> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> +				     0, input, NULL);
npages can get truncated here.
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return unmapped;
> +}
> +
> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
> +					 dma_addr_t iova)
> +{
> +	unsigned long flags;
> +	struct hv_iommu_mapping *mapping;
> +	struct interval_tree_node *node;
> +	u64 paddr = 0;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
> +	if (node) {
> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> +		paddr = mapping->paddr + (iova - mapping->iova.start);
> +	}
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> +
> +	return paddr;
> +}
> +
> +/*
> + * Currently, hypervisor does not provide list of devices it is using
> + * dynamically. So use this to allow users to manually specify devices that
> + * should be skipped. (eg. hypervisor debugger using some network device).
> + */
> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> +		int rc, pos = 0;
> +		int parsed;
> +		int segment, bus, slot, func;
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		do {
> +			parsed = 0;
> +
> +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
> +				    &segment, &bus, &slot, &func, &parsed);
> +			if (rc)
> +				break;
> +			if (parsed <= 0)
> +				break;
> +
> +			if (pci_domain_nr(pdev->bus) == segment &&
> +			    pdev->bus->number == bus &&
> +			    PCI_SLOT(pdev->devfn) == slot &&
> +			    PCI_FUNC(pdev->devfn) == func) {
> +
> +				dev_info(dev, "skipped by Hyper-V IOMMU\n");
> +				return ERR_PTR(-ENODEV);
> +			}
> +			pos += parsed;
> +
> +		} while (pci_devs_to_skip[pos]);
> +	}
> +
> +	/* Device will be explicitly attached to the default domain, so no need
> +	 * to do dev_iommu_priv_set() here.
> +	 */
> +
> +	return &hv_virt_iommu;
> +}
> +
> +static void hv_iommu_probe_finalize(struct device *dev)
> +{
> +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
> +
> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
> +		iommu_setup_dma_ops(dev, immdom);
> +	else
> +		set_dma_ops(dev, NULL);
> +}
> +
> +static void hv_iommu_release_device(struct device *dev)
> +{
> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> +
> +	/* Need to detach device from device domain if necessary. */
> +	if (hvdom)
> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
> +
> +	dev_iommu_priv_set(dev, NULL);
> +	set_dma_ops(dev, NULL);
> +}
> +
> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> +{
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +	else
> +		return generic_device_group(dev);
> +}
> +
> +static int hv_iommu_def_domain_type(struct device *dev)
> +{
> +	/* The hypervisor always creates this by default during boot */
> +	return IOMMU_DOMAIN_IDENTITY;
> +}
> +
> +static struct iommu_ops hv_iommu_ops = {
> +	.capable	    = hv_iommu_capable,
> +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
> +	.probe_device	    = hv_iommu_probe_device,
> +	.probe_finalize     = hv_iommu_probe_finalize,
> +	.release_device     = hv_iommu_release_device,
> +	.def_domain_type    = hv_iommu_def_domain_type,
> +	.device_group	    = hv_iommu_device_group,
> +	.default_domain_ops = &(const struct iommu_domain_ops) {
> +		.attach_dev   = hv_iommu_attach_dev,
> +		.map_pages    = hv_iommu_map_pages,
> +		.unmap_pages  = hv_iommu_unmap_pages,
> +		.iova_to_phys = hv_iommu_iova_to_phys,
> +		.free	      = hv_iommu_domain_free,
> +	},
> +	.owner		    = THIS_MODULE,
> +	.identity_domain = &hv_def_identity_dom.iommu_dom,
> +	.blocked_domain  = &hv_null_dom.iommu_dom,
> +};
> +
> +static const struct iommu_domain_ops hv_special_domain_ops = {
> +	.attach_dev = hv_iommu_attach_dev,
> +};
> +
> +static void __init hv_initialize_special_domains(void)
> +{
> +	hv_def_identity_dom.iommu_dom.type = IOMMU_DOMAIN_IDENTITY;
> +	hv_def_identity_dom.iommu_dom.ops = &hv_special_domain_ops;
> +	hv_def_identity_dom.iommu_dom.owner = &hv_iommu_ops;
> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
> +	hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */
> +
> +	hv_null_dom.iommu_dom.type = IOMMU_DOMAIN_BLOCKED;
> +	hv_null_dom.iommu_dom.ops = &hv_special_domain_ops;
> +	hv_null_dom.iommu_dom.owner = &hv_iommu_ops;
> +	hv_null_dom.iommu_dom.geometry = default_geometry;
> +	hv_null_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_NULL;  /* INTMAX */
> +}
> +
> +static int __init hv_iommu_init(void)
> +{
> +	int ret;
> +	struct iommu_device *iommup = &hv_virt_iommu;
> +
> +	if (!hv_is_hyperv_initialized())
> +		return -ENODEV;
> +
> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
> +	if (ret) {
> +		pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
> +		return ret;
> +	}
> +
> +	/* This must come before iommu_device_register because the latter calls
> +	 * into the hooks.
> +	 */
> +	hv_initialize_special_domains();
> +
> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
> +	if (ret) {
> +		pr_err("Hyper-V: iommu_device_register failed: %d\n", ret);
> +		goto err_sysfs_remove;
> +	}
> +
> +	pr_info("Hyper-V IOMMU initialized\n");
> +
> +	return 0;
> +
> +err_sysfs_remove:
> +	iommu_device_sysfs_remove(iommup);
> +	return ret;
> +}
> +
> +void __init hv_iommu_detect(void)
> +{
> +	if (no_iommu || iommu_detected)
> +		return;
> +
> +	/* For l1vh, always expose an iommu unit */
> +	if (!hv_l1vh_partition())
> +		if (!(ms_hyperv.misc_features & HV_DEVICE_DOMAIN_AVAILABLE))
> +			return;
> +
> +	iommu_detected = 1;
> +	x86_init.iommu.iommu_init = hv_iommu_init;
> +
> +	pci_request_acs();
> +}
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index a6878ab685e7..fca5ed68b5c2 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -337,6 +337,23 @@ static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
>  { return 0; }
>  #endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */
>  
> +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> +u64 hv_get_current_partid(void);
> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
> +bool hv_pcidev_is_pthru_dev(struct pci_dev *pdev);
> +u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type);
> +#else
> +static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> +{ return false; }
> +static inline bool hv_pcidev_is_pthru_dev(struct pci_dev *pdev)
> +{ return false; }
> +static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
> +					enum hv_device_type type)
> +{ return 0; }
> +static inline u64 hv_get_current_partid(void)
> +{ return HV_PARTITION_ID_INVALID; }
> +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
> +
>  #if IS_ENABLED(CONFIG_MSHV_ROOT)
>  static inline bool hv_root_partition(void)
>  {
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 5459e776ec17..6eee1cbf6f23 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1769,4 +1769,10 @@ static inline unsigned long virt_to_hvpfn(void *addr)
>  #define HVPFN_DOWN(x)	((x) >> HV_HYP_PAGE_SHIFT)
>  #define page_to_hvpfn(page)	(page_to_pfn(page) * NR_HV_HYP_PAGES_IN_PAGE)
>  
> +#ifdef CONFIG_HYPERV_IOMMU
> +void __init hv_iommu_detect(void);
> +#else
> +static inline void hv_iommu_detect(void) { }
> +#endif /* CONFIG_HYPERV_IOMMU */
> +
>  #endif /* _HYPERV_H */
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply

* [PATCH v3] mshv: support 1G hugepages by passing them as 2M-aligned chunks
From: Anirudh Rayabharam (Microsoft) @ 2026-05-06 13:44 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
  Cc: linux-hyperv, linux-kernel, Anirudh Rayabharam (Microsoft)

The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
chunks into 1G mappings when alignment permits, so the driver can
support 1G hugepages by feeding them in as 2M chunks. Note that this
is the only way to make 1G mappings; there is no way to directly map
a 1G hugepage using the hypercall.

Update mshv_chunk_stride() to:

  - Accept 2M-aligned tail pages of a larger folio. The previous
    PageHead() check rejected every page after the head of a 1G
    hugepage and fell back to 4K mappings for the remaining 1022 MB.
    Replace it with a PFN alignment check so any 2M-aligned page of a
    sufficiently large folio is acceptable.

  - Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
    hypercall has no 1G stride, so 1G folios are processed as a
    sequence of 2M chunks. Folios whose order is neither PMD_ORDER nor
    PUD_ORDER (e.g. mTHP) fall back to single-page stride; mapping
    them as 2M would fail in the hypervisor anyway.

Assisted-by: Copilot-CLI:claude-opus-4.7
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
Changes in v3:
- Fixed various corner cases reported by Sashiko.
- Link to v2: https://lore.kernel.org/r/20260505-huge_1g-v2-1-b6a91327a88d@anirudhrb.com

Changes in v2:
- Handled the case where we can have 2M aligned pages in the middle of a
  1G page
- Brought back the page order check but expanded it to include 1G
- Clamp stride to requested page count in mshv_region_process_chunk
- Link to v1: https://lore.kernel.org/r/20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com
---
 drivers/hv/mshv_regions.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6..1756b733968c 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -29,29 +29,28 @@
  * Uses huge page stride if the backing page is huge and the guest mapping
  * is properly aligned; otherwise falls back to single page stride.
  *
- * Return: Stride in pages, or -EINVAL if page order is unsupported.
+ * Return: Stride in pages.
  */
-static int mshv_chunk_stride(struct page *page,
-			     u64 gfn, u64 page_count)
+static unsigned int mshv_chunk_stride(struct page *page, u64 gfn,
+				      u64 page_count)
 {
-	unsigned int page_order;
+	unsigned int page_order = folio_order(page_folio(page));
 
 	/*
 	 * Use single page stride by default. For huge page stride, the
-	 * page must be compound and point to the head of the compound
-	 * page, and both gfn and page_count must be huge-page aligned.
+	 * page must be compound, the page's PFN must itself be 2M-aligned
+	 * (so that a 2M-aligned tail page of a larger folio is acceptable),
+	 * and both gfn and page_count must be huge-page aligned.
 	 */
-	if (!PageCompound(page) || !PageHead(page) ||
+	if (!PageCompound(page) ||
+	    !IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD) ||
 	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
-	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
+	    !IS_ALIGNED(page_count, PTRS_PER_PMD) ||
+	    (page_order != PMD_ORDER && page_order != PUD_ORDER))
 		return 1;
 
-	page_order = folio_order(page_folio(page));
-	/* The hypervisor only supports 2M huge page */
-	if (page_order != PMD_ORDER)
-		return -EINVAL;
-
-	return 1 << page_order;
+	/* Use 2M stride always i.e. process 1G folios as 2M chunks */
+	return 1 << PMD_ORDER;
 }
 
 /**
@@ -86,15 +85,14 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 	u64 gfn = region->start_gfn + page_offset;
 	u64 count;
 	struct page *page;
-	int stride, ret;
+	unsigned int stride;
+	int ret;
 
 	page = region->mreg_pages[page_offset];
 	if (!page)
 		return -EINVAL;
 
 	stride = mshv_chunk_stride(page, gfn, page_count);
-	if (stride < 0)
-		return stride;
 
 	/* Start at stride since the first stride is validated */
 	for (count = stride; count < page_count; count += stride) {

---
base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
change-id: 20260416-huge_1g-e44461393c8f

Best regards,
-- 
Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply related

* RE: [PATCH v2 09/15] Drivers: hv: mshv_vtl: Move hv_vtl_configure_reg_page() to x86
From: Michael Kelley @ 2026-05-06 14:36 UTC (permalink / raw)
  To: Naman Jain, Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com, Michael Kelley, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86@kernel.org, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
In-Reply-To: <024aed8c-cd97-45f0-a653-489fc334a2b9@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Tuesday, May 5, 2026 10:50 PM
> 
> On 5/4/2026 9:36 PM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Wednesday, April 29, 2026 2:58 AM
> >>
> >> On 4/27/2026 11:10 AM, Michael Kelley wrote:
> >>> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> >>>>
> >>>> Move hv_vtl_configure_reg_page() from drivers/hv/mshv_vtl_main.c to
> >>>> arch/x86/hyperv/hv_vtl.c. The register page overlay is an x86-specific
> >>>> feature that uses HV_X64_REGISTER_REG_PAGE, so its configuration belongs
> >>>> in architecture-specific code.
> >>>>
> >>>> Move struct mshv_vtl_per_cpu and union hv_synic_overlay_page_msr to
> >>>> include/asm-generic/mshyperv.h so they are visible to both arch and
> >>>> driver code.
> >>>>
> >>>> Change the return type from void to bool so the caller can determine
> >>>> whether the register page was successfully configured and set
> >>>> mshv_has_reg_page accordingly.
> >>>>
> >>>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> >>>> ---
> >>>>    arch/x86/hyperv/hv_vtl.c       | 32 ++++++++++++++++++++++
> >>>>    drivers/hv/mshv_vtl_main.c     | 49 +++-------------------------------
> >>>>    include/asm-generic/mshyperv.h | 17 ++++++++++++
> >>>>    3 files changed, 53 insertions(+), 45 deletions(-)
> >>>>
> >>
> >> <snip>
> >>
> >>>>    #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
> >>>> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> >>>
> >>> This comment pre-dates your patch, but I don't understand the point
> >>> it is trying to make. The comment is factually true, but I don't know
> >>> why calling that out is relevant. The REG_PAGE MSR seems to be
> >>> conceptually separate and distinct from the SIMP MSR, so the fact
> >>> that the layouts are the same is just a coincidence. Or is there some
> >>> relationship between the two MSRs that I'm not aware of, and the
> >>> comment is trying (and failing?) to point out?
> >>
> >> This was added as per suggestion from Nuno in my initial series for
> >> MSHV_VTL. If the reference in "identical to" is misleading, I should
> >> remove it.
> >>
> >> https://lore.kernel.org/all/68143eb0-e6a7-4579-bedb-4c2ec5aaef6b@linux.microsoft.com/
> >>
> >> Quoting:
> >> """
> >> it is a generic structure that
> >> appears to be used for several overlay page MSRs (SIMP, SIEF, etc).
> >>
> >> But, the type doesn't appear in the hv*dk headers explicitly; it's just
> >> used internally by the hypervisor.
> >>
> >> I think it should be renamed with a hv_ prefix to indicate it's part of
> >> the hypervisor ABI, and a brief comment with the provenance:
> >>
> >> /* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> >> union hv_synic_overlay_page_msr {
> >> 	/* <snip> */
> >> };
> >
> > OK, so this union is not associated *only* with the REG_PAGE MSR
> > (though that MSR is the only current user). Instead, it is intended to
> > be a more generic description of MSRs that set up overlay pages. I
> > don't think I had previously noticed Nuno's comment on the topic.
> >
> > Looking through hvgdk_mini.h and hvhdk.h, I see 6 definitions that
> > are exactly the same:
> >
> > * union hv_reference_tsc_msr
> > * union hv_x64_msr_hypercall_contents
> > * union hv_vp_assist_msr_contents
> > * union hv_synic_simp
> > * union hv_synic_siefp
> > * union hv_synic_sirbp
> >
> > There's an argument to be made for removing these 6 unique definitions
> > and using union hv_synic_overlay_page_msr instead (though "synic"
> > would need to be removed from the name).  I would not object to such
> > an approach. It's a small extra layer of conceptual indirection, but saves
> > some lines of code for duplicative definitions. The alternative is to drop
> > the idea of a generic overlay page MSR layout, and replace union
> > hv_synic_overlay_page_msr with a definition that is specific to the
> > REG_PAGE MSR, like the other six above.
> >
> 
> Hi Michael,
> 
> While having a generic definition looks good to have here, I can see two
> reasons for not going ahead with generic overlay page definition:
> 1. All of the above definitions are present in Hyper-V headers and
> generalizing them would deviate from the strategy of keeping the kernel
> headers in line with Hyper-V headers.
> 2. For any of these definitions, if the use-case requires using some of
> these reserved bits, then it would be a problem. I can actually see that
> happening in "hv_x64_msr_hypercall_contents" in the corresponding
> variant in the Hyper-V header.

Your points are certainly valid, and I'm good with not going the
generic route.

> 
> > I could go either way. If we want to use a generic overlay page definition,
> > then that approach should be applied everywhere. With the current
> > state of your patch set, we're halfway in between -- the generic definition
> > is used one place, but duplicative specific MSR definitions are used other
> > places. That's probably the least desirable approach.
> >
> > Michael
> 
> 
> Now, coming back to the hv_synic_overlay_page_msr definition. While
> Nuno's comment hinted at it being "generic", the same is not documented
> in the name of this structure or its comments. So it should be safe to
> assume that it is specific to synic_overlay_page_msr usage. But since it
> is not part of Hyper-V header as such, we needed that comment:
> "/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */"
> 

An "overlay page" is a generic concept in the Hyper-V world, and it is used
in multiple places in the guest<->hypervisor interface. The old PDF version of
the Hyper-V TLFS describes overlay pages in the section 5.2.1 entitled "GPA
Overlay Pages". See [1]. Unfortunately, this material about overlay pages
doesn't seem to have been carried over to the web page version of the TLFS.

So in my thinking, the name "hv_synic_overlay_page_msr" is inherently
a generic definition that could be applied to multiple MSRs that are used to
specify overlay pages. Your patch is about a specific MSR,
HV_X64_REGISTER_REG_PAGE, which happens to be used to define an
overlay page. But if the decision is to *not* go the generic route, I
would expect to see something like "union hv_x64_reg_page_msr"
that is specific to the REG_PAGE MSR, and to have that type used in
hv_vtl_configure_reg_page(). The definition of hv_x64_reg_page_msr
would not have a comment referencing the SIMP or any other MSR
because it would be a standalone definition that is specific to
HV_X64_REGISTER_REG_PAGE. Then the pattern would be the same as
the other six cases that I listed above.

When not using the generic approach, hv_synic_overlay_page_msr
really has no purpose, and could go away.

Michael

[1] https://github.com/MicrosoftDocs/Virtualization-Documentation/raw/live/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf

^ permalink raw reply

* RE: [PATCH v2] Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
From: Michael Kelley @ 2026-05-06 15:13 UTC (permalink / raw)
  To: Dexuan Cui, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	Michael Kelley, matthew.ruffell@canonical.com,
	johansen@templeofstupid.com, hargar@linux.microsoft.com
  Cc: stable@vger.kernel.org
In-Reply-To: <20260505004846.193441-1-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, May 4, 2026 5:49 PM
> 
> If vmbus_reserve_fb() in the kdump/kexec kernel fails to properly reserve
> the framebuffer MMIO range (which is below 4GB) due to a Gen2 VM's
> screen.lfb_base being zero [1], there is an MMIO conflict between the
> drivers hyperv-drm and pci-hyperv: when the driver pci-hyperv's
> hv_pci_allocate_bridge_windows() calls vmbus_allocate_mmio() to get a
> 32-bit MMIO range, it may get an MMIO range that overlaps with the
> framebuffer MMIO range, and later hv_pci_enter_d0() fails with an
> error message "PCI Pass-through VSP failed D0 Entry with status" since
> the host thinks that PCI devices must not use MMIO space that the
> host has assigned to the framebuffer.
> 
> This is especially an issue if pci-hyperv is built-in and hyperv-drm is
> built as a module. Consequently, the kdump/kexec kernel fails to detect
> PCI devices via pci-hyperv, and may fail to mount the root file system,
> which may reside in a NVMe disk. The issue described here has existed
> for SR-IOV VF NICs since day one of the pci-hyperv driver, and has been
> worked around on x64 when possible. With the recent introduction of
> ARM64 VMs that boot from NVMe, there is no workaround, so we need a
> formal fix.
> 
> On Gen2 VMs, if the screen.lfb_base is 0 in the kdump/kexec kernel [1],
> fall back to the low MMIO base, which should be equal to the framebuffer
> MMIO base [2] (the statement is true according to my testing on x64
> Windows Server 2016, and on x64 and ARM64 Windows Server 2025 and on
> Azure. I checked with the Hyper-V team and they said the statement should
> continue to be true for Gen2 VMs). In the first kernel, screen.lfb_base
> is not 0; if the user specifies a very high resolution, it's not enough
> to only reserve 8MB: in this case, reserve half of the space below 4GB,
> but cap the reservation to 128MB, which is the required framebuffer size
> of the highest resolution 7680*4320 supported by Hyper-V.
> 
> While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
> the > to >=. Here the 'end' is an inclusive end (typically, it's
> 0xFFFF_FFFF for the low MMIO range).
> 
> Note: vmbus_reserve_fb() now also reserves an MMIO range at the beginning
> of the low MMIO range on CVMs, which have no framebuffers (the
> 'screen.lfb_base' in vmbus_reserve_fb() is 0 for CVMs), just in case the
> host might treat the beginning of the low MMIO range specially [4]. BTW,
> the OpenHCL kernel is not affected by the change, because that kernel
> boots with DeviceTree rather than ACPI (so vmbus_reserve_fb() won't run
> there), and there is no framebuffer device for that kernel.
> 
> Note: normally Gen1 VMs don't have the MMIO conflict issue because the
> framebuffer MMIO range (which is hardcoded to base=4GB-128MB and
> size=64MB for Gen1 VMs by the host) is always reported via the legacy PCI
> graphics device's BAR, so the kdump/kexec kernel can reserve the 64MB
> MMIO range; however, if the VM is configured to use a very high resolution
> and the required framebuffer size exceeds 64MB (AFAIK, in practice, this
> isn't a typical configuration by users), the hyperv-drm driver may need to
> allocate an MMIO range above 4GB and change the framebuffer MMIO location
> to the allocated MMIO range -- in this case, there can still be issues [3]
> which can't be easily fixed: any possible affected Gen1 users would have
> to use a resolution whose framebuffer size is <= 64MB, or switch to Gen2
> VMs.
> 
> [1]
> https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR2
> 1MB6921.namprd21.prod.outlook.com/
> [2]
> https://lore.kernel.org/all/SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR2
> 1MB6921.namprd21.prod.outlook.com/
> [3]
> https://lore.kernel.org/all/SA1PR21MB69213486F821CA5A2C793C81BF342@SA1PR
> 21MB6921.namprd21.prod.outlook.com/
> [4]
> https://lore.kernel.org/all/SN6PR02MB415726B17D5A6027CD1717E8D4342@SN6P
> R02MB4157.namprd02.prod.outlook.com/
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> CC: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
> 
> Changes since v1 (https://lore.kernel.org/all/20260416183529.838321-1-decui@microsoft.com/):
>   Fixed a typo in the subject: s/logc/logic/.
> 
>   In the commit message, better explained fb_mmio_base is equal to
>   low_mmio_base for Gen2 VMs.
> 
>   Addressed Michael Kelley's comments:
> 
>     In the commit message:
>          Changed the "kdump" to "kdump/kexec" since the described
>          issue is applicable to both kdump and kexec.
> 
>          Provided more detail about the MMIO conflict.
> 
>          Described an scenario where Gen1 VMs can also be affected.
> 
>     Added a pr_warn() in vmbus_reserve_fb() in case the 'start' is 0.
> 
>     Dropped the CVM check in vmbus_reserve(), meaning vmbus_reserve_fb()
>     also reserves MMIO for CVMs.
> 
>   Changed "low_mmio_base >= SZ_4G" to "upper_32_bits(low_mmio_base)"
>   to avoid a compilation warning for the i386 build.
> 
>   Changed "0x%pa" to "%pa", because %pa already adds a "0x" prefix.
> 
> 
> Hi Krister, Matthew, sorry -- I'm not adding your Tested-by's since
> the code changed, though the change is small. If the v2 looks good
> to Michael, please test the patch again.
> 
> Hi Hardik, I'm not adding your Reviewed-by since the patch changed.
> Please review the v2.
> 
>  drivers/hv/vmbus_drv.c | 29 ++++++++++++++++++++++++++---
>  1 file changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index f0d0803d1e16..d73ac5c8dd04 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2327,8 +2327,8 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx)
>  		return AE_NO_MEMORY;
> 
>  	/* If this range overlaps the virtual TPM, truncate it. */
> -	if (end > VTPM_BASE_ADDRESS && start < VTPM_BASE_ADDRESS)
> -		end = VTPM_BASE_ADDRESS;
> +	if (end >= VTPM_BASE_ADDRESS && start < VTPM_BASE_ADDRESS)
> +		end = VTPM_BASE_ADDRESS - 1;
> 
>  	new_res->name = "hyperv mmio";
>  	new_res->flags = IORESOURCE_MEM;
> @@ -2395,6 +2395,7 @@ static void vmbus_mmio_remove(void)
>  static void __maybe_unused vmbus_reserve_fb(void)
>  {
>  	resource_size_t start = 0, size;
> +	resource_size_t low_mmio_base;
>  	struct pci_dev *pdev;
> 
>  	if (efi_enabled(EFI_BOOT)) {
> @@ -2402,6 +2403,24 @@ static void __maybe_unused vmbus_reserve_fb(void)
>  		if (IS_ENABLED(CONFIG_SYSFB)) {
>  			start = sysfb_primary_display.screen.lfb_base;
>  			size = max_t(__u32, sysfb_primary_display.screen.lfb_size, 0x800000);
> +
> +			low_mmio_base = hyperv_mmio->start;
> +			if (!low_mmio_base || upper_32_bits(low_mmio_base) ||
> +			    (start && start < low_mmio_base)) {
> +				pr_warn("Unexpected low mmio base %pa\n", &low_mmio_base);
> +			} else {
> +				/*
> +				 * If the kdump kernel's lfb_base is 0,

Nit:  The case of lfb_base is 0 applies to kexec and kdump kernels, and also to
CVMs.

> +				 * fall back to the low mmio base.
> +				 */
> +				if (!start)
> +					start = low_mmio_base;
> +				/*
> +				 * Reserve half of the space below 4GB for high
> +				 * resolutions, but cap the reservation to 128MB.
> +				 */
> +				size = min((SZ_4G - start) / 2, SZ_128M);
> +			}
>  		}
>  	} else {
>  		/* Gen1 VM: get FB base from PCI */
> @@ -2422,8 +2441,10 @@ static void __maybe_unused vmbus_reserve_fb(void)
>  		pci_dev_put(pdev);
>  	}
> 
> -	if (!start)
> +	if (!start) {
> +		pr_warn("Unexpected framebuffer mmio base of zero\n");
>  		return;
> +	}
> 
>  	/*
>  	 * Make a claim for the frame buffer in the resource tree under the
> @@ -2433,6 +2454,8 @@ static void __maybe_unused vmbus_reserve_fb(void)
>  	 */
>  	for (; !fb_mmio && (size >= 0x100000); size >>= 1)
>  		fb_mmio = __request_region(hyperv_mmio, start, size, fb_mmio_name, 0);
> +
> +	pr_info("hv_mmio=%pR,%pR fb=%pR\n", hyperv_mmio, hyperv_mmio->sibling, fb_mmio);
>  }

Modulo my nit about the comment,

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

^ permalink raw reply

* Re: [PATCH v3] mshv: support 1G hugepages by passing them as 2M-aligned chunks
From: Stanislav Kinsburskii @ 2026-05-06 16:26 UTC (permalink / raw)
  To: Anirudh Rayabharam (Microsoft)
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	linux-hyperv, linux-kernel
In-Reply-To: <20260506-huge_1g-v3-1-26e1e4c439e4@anirudhrb.com>

On Wed, May 06, 2026 at 01:44:53PM +0000, Anirudh Rayabharam (Microsoft) wrote:
> The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
> chunks into 1G mappings when alignment permits, so the driver can
> support 1G hugepages by feeding them in as 2M chunks. Note that this
> is the only way to make 1G mappings; there is no way to directly map
> a 1G hugepage using the hypercall.
> 
> Update mshv_chunk_stride() to:
> 
>   - Accept 2M-aligned tail pages of a larger folio. The previous
>     PageHead() check rejected every page after the head of a 1G
>     hugepage and fell back to 4K mappings for the remaining 1022 MB.
>     Replace it with a PFN alignment check so any 2M-aligned page of a
>     sufficiently large folio is acceptable.
> 
>   - Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
>     hypercall has no 1G stride, so 1G folios are processed as a
>     sequence of 2M chunks. Folios whose order is neither PMD_ORDER nor
>     PUD_ORDER (e.g. mTHP) fall back to single-page stride; mapping
>     them as 2M would fail in the hypervisor anyway.
> 
> Assisted-by: Copilot-CLI:claude-opus-4.7
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> ---
> Changes in v3:
> - Fixed various corner cases reported by Sashiko.
> - Link to v2: https://lore.kernel.org/r/20260505-huge_1g-v2-1-b6a91327a88d@anirudhrb.com
> 
> Changes in v2:

LGTM, thanks.

Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>


> - Handled the case where we can have 2M aligned pages in the middle of a
>   1G page
> - Brought back the page order check but expanded it to include 1G
> - Clamp stride to requested page count in mshv_region_process_chunk
> - Link to v1: https://lore.kernel.org/r/20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com
> ---
>  drivers/hv/mshv_regions.c | 32 +++++++++++++++-----------------
>  1 file changed, 15 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index fdffd4f002f6..1756b733968c 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -29,29 +29,28 @@
>   * Uses huge page stride if the backing page is huge and the guest mapping
>   * is properly aligned; otherwise falls back to single page stride.
>   *
> - * Return: Stride in pages, or -EINVAL if page order is unsupported.
> + * Return: Stride in pages.
>   */
> -static int mshv_chunk_stride(struct page *page,
> -			     u64 gfn, u64 page_count)
> +static unsigned int mshv_chunk_stride(struct page *page, u64 gfn,
> +				      u64 page_count)
>  {
> -	unsigned int page_order;
> +	unsigned int page_order = folio_order(page_folio(page));
>  
>  	/*
>  	 * Use single page stride by default. For huge page stride, the
> -	 * page must be compound and point to the head of the compound
> -	 * page, and both gfn and page_count must be huge-page aligned.
> +	 * page must be compound, the page's PFN must itself be 2M-aligned
> +	 * (so that a 2M-aligned tail page of a larger folio is acceptable),
> +	 * and both gfn and page_count must be huge-page aligned.
>  	 */
> -	if (!PageCompound(page) || !PageHead(page) ||
> +	if (!PageCompound(page) ||
> +	    !IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD) ||
>  	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
> -	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
> +	    !IS_ALIGNED(page_count, PTRS_PER_PMD) ||
> +	    (page_order != PMD_ORDER && page_order != PUD_ORDER))
>  		return 1;
>  
> -	page_order = folio_order(page_folio(page));
> -	/* The hypervisor only supports 2M huge page */
> -	if (page_order != PMD_ORDER)
> -		return -EINVAL;
> -
> -	return 1 << page_order;
> +	/* Use 2M stride always i.e. process 1G folios as 2M chunks */
> +	return 1 << PMD_ORDER;
>  }
>  
>  /**
> @@ -86,15 +85,14 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>  	u64 gfn = region->start_gfn + page_offset;
>  	u64 count;
>  	struct page *page;
> -	int stride, ret;
> +	unsigned int stride;
> +	int ret;
>  
>  	page = region->mreg_pages[page_offset];
>  	if (!page)
>  		return -EINVAL;
>  
>  	stride = mshv_chunk_stride(page, gfn, page_count);
> -	if (stride < 0)
> -		return stride;
>  
>  	/* Start at stride since the first stride is validated */
>  	for (count = stride; count < page_count; count += stride) {
> 
> ---
> base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
> change-id: 20260416-huge_1g-e44461393c8f
> 
> Best regards,
> -- 
> Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> 

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-05-06 16:52 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Wei, kys, haiyangz, wei.liu, decui, andrew+netdev, davem,
	edumazet, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260427131745.2eac52ef@kernel.org>

On Mon, Apr 27, 2026 at 01:17:45PM -0700, Jakub Kicinski wrote:
> On Sat, 25 Apr 2026 01:05:43 -0700 Dipayaan Roy wrote:
> > Hi Jakub,
> > with this new data from David, is it convincing enough for a mana driver
> > specific private flag, which can be set from user space by a udev rule
> > by detecting the underlying platform? If not then I will send the next
> > version with the other rxbuflen approach. 
> 
> I think so, thank you both for the testing.
> Please look out for the net-next opening up and repost the patches.
> (The reopening is delayed, it was supposed to happen already but I
> can't get a clean run out of our CI, sigh)

Hi Jakub,
I have reposted the patches now.

Regards
Dipayaan Roy

^ permalink raw reply

* [PATCH net-next v7 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-05-06 16:58 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov

On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool 
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.

This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.

This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:

  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.

Changes in v7:
  - Rebased onto net-next.
  - Retained private flag approach after David Wei's testing on
    Grace (ARM64) confirmed that fragment mode outperforms
    full-page mode on other platforms, validating this is a
    single-platform workaround rather than a generic issue.
Changes in v6:
  - Added missed maintainers.
Changes in v5:
  - Split prep refactor into separate patch (patch 1/2)
Changes in v4:
  - Dropping the smbios string parsing and add ethtool priv flag
    to reconfigure the queues with full page rx buffers.
Changes in v3:
  - changed u8* to char*
Changes in v2:
  - separate reading string index and the string, remove inline.

Dipayaan Roy (2):
  net: mana: refactor mana_get_strings() and mana_get_sset_count() to
    use switch
  net: mana: force full-page RX buffers via ethtool private flag

 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 164 ++++++++++++++----
 include/net/mana/mana.h                       |   8 +
 3 files changed, 163 insertions(+), 31 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH net-next v7 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch
From: Dipayaan Roy @ 2026-05-06 16:58 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260506170034.327907-1-dipayanroy@linux.microsoft.com>

Refactor mana_get_strings() and mana_get_sset_count() from if/else to
switch statements in preparation for adding ethtool private flags
support which requires handling ETH_SS_PRIV_FLAGS.

No functional change.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../ethernet/microsoft/mana/mana_ethtool.c    | 75 ++++++++++++-------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 6a4b42fe0944..a28ca461c135 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -138,53 +138,70 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 
-	if (stringset != ETH_SS_STATS)
+	switch (stringset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(mana_eth_stats) +
+		       ARRAY_SIZE(mana_phy_stats) +
+		       ARRAY_SIZE(mana_hc_stats)  +
+		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	default:
 		return -EINVAL;
-
-	return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
-			num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	}
 }
 
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 {
-	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 	int i, j;
 
-	if (stringset != ETH_SS_STATS)
-		return;
 	for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
-		ethtool_puts(&data, mana_eth_stats[i].name);
+		ethtool_puts(data, mana_eth_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
-		ethtool_puts(&data, mana_hc_stats[i].name);
+		ethtool_puts(data, mana_hc_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
-		ethtool_puts(&data, mana_phy_stats[i].name);
+		ethtool_puts(data, mana_phy_stats[i].name);
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "rx_%d_packets", i);
-		ethtool_sprintf(&data, "rx_%d_bytes", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
-		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		ethtool_sprintf(data, "rx_%d_packets", i);
+		ethtool_sprintf(data, "rx_%d_bytes", i);
+		ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+		ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+		ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
 		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
-			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+			ethtool_sprintf(data,
+					"rx_%d_coalesced_cqe_%d",
+					i,
+					j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "tx_%d_packets", i);
-		ethtool_sprintf(&data, "tx_%d_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
-		ethtool_sprintf(&data, "tx_%d_tso_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_csum_partial", i);
-		ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+		ethtool_sprintf(data, "tx_%d_packets", i);
+		ethtool_sprintf(data, "tx_%d_bytes", i);
+		ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+		ethtool_sprintf(data, "tx_%d_tso_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+		ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_csum_partial", i);
+		ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+	}
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		mana_get_strings_stats(apc, &data);
+		break;
+	default:
+		break;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v7 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-05-06 16:58 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260506170034.327907-1-dipayanroy@linux.microsoft.com>

On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).

Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate (180+ Gbps) performance on affected platforms.

Usage:
  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.

The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page() which
is now the single decision point for both the automatic and
user-controlled paths.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 89 +++++++++++++++++++
 include/net/mana/mana.h                       |  8 ++
 3 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49c65cc1697c..59a1626c2be1 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+	 * in the RX refill path (~2kB buffer) can cause significant throughput
+	 * regression under high connection counts. Allow user to force one RX
+	 * buffer per page via ethtool private flag to bypass the fragment
+	 * path.
+	 */
+	if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index a28ca461c135..0547c903f613 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
 	{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
 };
 
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+	[MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
 static int mana_get_sset_count(struct net_device *ndev, int stringset)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 		       ARRAY_SIZE(mana_phy_stats) +
 		       ARRAY_SIZE(mana_hc_stats)  +
 		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+
+	case ETH_SS_PRIV_FLAGS:
+		return MANA_PRIV_FLAG_MAX;
+
 	default:
 		return -EINVAL;
 	}
@@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 	}
 }
 
+static void mana_get_strings_priv_flags(u8 **data)
+{
+	int i;
+
+	for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+		ethtool_puts(data, mana_priv_flags[i]);
+}
+
 static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 	case ETH_SS_STATS:
 		mana_get_strings_stats(apc, &data);
 		break;
+	case ETH_SS_PRIV_FLAGS:
+		mana_get_strings_priv_flags(&data);
+		break;
 	default:
 		break;
 	}
@@ -590,6 +609,74 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 	return 0;
 }
 
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u32 changed = apc->priv_flags ^ priv_flags;
+	u32 old_priv_flags = apc->priv_flags;
+	bool schedule_port_reset = false;
+	int err = 0;
+
+	if (!changed)
+		return 0;
+
+	/* Reject unknown bits */
+	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+		return -EINVAL;
+
+	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+		apc->priv_flags = priv_flags;
+
+		if (!apc->port_is_up) {
+			/* Port is down, flag updated to apply on next up
+			 * so just return.
+			 */
+			return 0;
+		}
+
+		/* Pre-allocate buffers to prevent failure in mana_attach
+		 * later
+		 */
+		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+		if (err) {
+			netdev_err(ndev,
+				   "Insufficient memory for new allocations\n");
+			apc->priv_flags = old_priv_flags;
+			return err;
+		}
+
+		err = mana_detach(ndev, false);
+		if (err) {
+			netdev_err(ndev, "mana_detach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			goto out;
+		}
+
+		err = mana_attach(ndev);
+		if (err) {
+			netdev_err(ndev, "mana_attach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			schedule_port_reset = true;
+		}
+	}
+
+out:
+	mana_pre_dealloc_rxbufs(apc);
+
+	if (err && schedule_port_reset)
+		queue_work(apc->ac->per_port_queue_reset_wq,
+			   &apc->queue_reset_work);
+
+	return err;
+}
+
 const struct ethtool_ops mana_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
@@ -608,4 +695,6 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
 	.get_link		= ethtool_op_get_link,
+	.get_priv_flags		= mana_get_priv_flags,
+	.set_priv_flags		= mana_set_priv_flags,
 };
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3336688fed5e..fd87e3d6c1f4 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
 	TRI_STATE_TRUE = 1
 };
 
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+	MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+	MANA_PRIV_FLAG_MAX,
+};
+
 /* Number of entries for hardware indirection table must be in power of 2 */
 #define MANA_INDIRECT_TABLE_MAX_SIZE 512
 #define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
 	u32 rxbpre_headroom;
 	u32 rxbpre_frag_count;
 
+	u32 priv_flags;
+
 	struct bpf_prog *bpf_prog;
 
 	/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
-- 
2.43.0


^ permalink raw reply related

* RE: [EXTERNAL] [PATCH v2] scsi: storvsc: Replace symbolic permissions with octal
From: Long Li @ 2026-05-06 19:23 UTC (permalink / raw)
  To: Md Shofiqul Islam, linux-scsi@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	mhklinux@outlook.com
In-Reply-To: <20260506004948.2172-1-shofiqtest@gmail.com>

> 
> Symbolic permissions like S_IRUGO and S_IWUSR are not preferred by
> checkpatch. Replace with their octal equivalents:
> 
>   - S_IRUGO|S_IWUSR -> 0644
>   - S_IRUGO         -> 0444
> 
> Signed-off-by: Md Shofiqul Islam <shofiqtest@gmail.com>

Reviewed-by: Long Li <longli@microsoft.com>


> ---
>  drivers/scsi/storvsc_drv.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index
> 6977ca8a0..571ea5491 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -156,7 +156,7 @@ static bool hv_dev_is_fc(struct hv_device *hv_dev);
>  #define STORVSC_LOGGING_WARN   2
> 
>  static int logging_level = STORVSC_LOGGING_ERROR; -
> module_param(logging_level, int, S_IRUGO|S_IWUSR);
> +module_param(logging_level, int, 0644);
>  MODULE_PARM_DESC(logging_level,
>         "Logging level, 0 - None, 1 - Error (default), 2 - Warning.");
> 
> @@ -345,17 +345,17 @@ static int storvsc_change_queue_depth(struct
> scsi_device *sdev, int queue_depth)  static int storvsc_vcpus_per_sub_channel =
> 4;  static unsigned int storvsc_max_hw_queues;
> 
> -module_param(storvsc_ringbuffer_size, int, S_IRUGO);
> +module_param(storvsc_ringbuffer_size, int, 0444);
>  MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer size (bytes)");
> 
>  module_param(storvsc_max_hw_queues, uint, 0644);
> MODULE_PARM_DESC(storvsc_max_hw_queues, "Maximum number of
> hardware queues");
> 
> -module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
> +module_param(storvsc_vcpus_per_sub_channel, int, 0444);
>  MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to
> subchannels");
> 
>  static int ring_avail_percent_lowater = 10; -
> module_param(ring_avail_percent_lowater, int, S_IRUGO);
> +module_param(ring_avail_percent_lowater, int, 0444);
>  MODULE_PARM_DESC(ring_avail_percent_lowater,
>                 "Select a channel if available ring size > this in percent");
> 
> --
> 2.54.0.windows.1


^ permalink raw reply

* Re: [PATCH v2] Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
From: Krister Johansen @ 2026-05-06 23:12 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: kys, haiyangz, wei.liu, longli, linux-hyperv, linux-kernel,
	mhklinux, matthew.ruffell, hargar, stable
In-Reply-To: <20260505004846.193441-1-decui@microsoft.com>

On Mon, May 04, 2026 at 05:48:46PM -0700, Dexuan Cui wrote:
> If vmbus_reserve_fb() in the kdump/kexec kernel fails to properly reserve
> the framebuffer MMIO range (which is below 4GB) due to a Gen2 VM's
> screen.lfb_base being zero [1], there is an MMIO conflict between the
> drivers hyperv-drm and pci-hyperv: when the driver pci-hyperv's
> hv_pci_allocate_bridge_windows() calls vmbus_allocate_mmio() to get a
> 32-bit MMIO range, it may get an MMIO range that overlaps with the
> framebuffer MMIO range, and later hv_pci_enter_d0() fails with an
> error message "PCI Pass-through VSP failed D0 Entry with status" since
> the host thinks that PCI devices must not use MMIO space that the
> host has assigned to the framebuffer.
> 
> This is especially an issue if pci-hyperv is built-in and hyperv-drm is
> built as a module. Consequently, the kdump/kexec kernel fails to detect
> PCI devices via pci-hyperv, and may fail to mount the root file system,
> which may reside in a NVMe disk. The issue described here has existed
> for SR-IOV VF NICs since day one of the pci-hyperv driver, and has been
> worked around on x64 when possible. With the recent introduction of
> ARM64 VMs that boot from NVMe, there is no workaround, so we need a
> formal fix.
> 
> On Gen2 VMs, if the screen.lfb_base is 0 in the kdump/kexec kernel [1],
> fall back to the low MMIO base, which should be equal to the framebuffer
> MMIO base [2] (the statement is true according to my testing on x64
> Windows Server 2016, and on x64 and ARM64 Windows Server 2025 and on
> Azure. I checked with the Hyper-V team and they said the statement should
> continue to be true for Gen2 VMs). In the first kernel, screen.lfb_base
> is not 0; if the user specifies a very high resolution, it's not enough
> to only reserve 8MB: in this case, reserve half of the space below 4GB,
> but cap the reservation to 128MB, which is the required framebuffer size
> of the highest resolution 7680*4320 supported by Hyper-V.
> 
> While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
> the > to >=. Here the 'end' is an inclusive end (typically, it's
> 0xFFFF_FFFF for the low MMIO range).
> 
> Note: vmbus_reserve_fb() now also reserves an MMIO range at the beginning
> of the low MMIO range on CVMs, which have no framebuffers (the
> 'screen.lfb_base' in vmbus_reserve_fb() is 0 for CVMs), just in case the
> host might treat the beginning of the low MMIO range specially [4]. BTW,
> the OpenHCL kernel is not affected by the change, because that kernel
> boots with DeviceTree rather than ACPI (so vmbus_reserve_fb() won't run
> there), and there is no framebuffer device for that kernel.
> 
> Note: normally Gen1 VMs don't have the MMIO conflict issue because the
> framebuffer MMIO range (which is hardcoded to base=4GB-128MB and
> size=64MB for Gen1 VMs by the host) is always reported via the legacy PCI
> graphics device's BAR, so the kdump/kexec kernel can reserve the 64MB
> MMIO range; however, if the VM is configured to use a very high resolution
> and the required framebuffer size exceeds 64MB (AFAIK, in practice, this
> isn't a typical configuration by users), the hyperv-drm driver may need to
> allocate an MMIO range above 4GB and change the framebuffer MMIO location
> to the allocated MMIO range -- in this case, there can still be issues [3]
> which can't be easily fixed: any possible affected Gen1 users would have
> to use a resolution whose framebuffer size is <= 64MB, or switch to Gen2
> VMs.
> 
> [1] https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
> [2] https://lore.kernel.org/all/SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com/
> [3] https://lore.kernel.org/all/SA1PR21MB69213486F821CA5A2C793C81BF342@SA1PR21MB6921.namprd21.prod.outlook.com/
> [4] https://lore.kernel.org/all/SN6PR02MB415726B17D5A6027CD1717E8D4342@SN6PR02MB4157.namprd02.prod.outlook.com/
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> CC: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
 
Thanks for the updated patch.  I re-tested this on a D2pdsv6 and was
able to confirm that without the patch the NIC drivers in the dump
environment didn't attach because of a PCI conflict. With the patch the
drivers attached and it was possible to successfully collect a kdump.

Tested-by: Krister Johansen <kjlx@templeofstupid.com>

-K

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox