Linux-HyperV List
 help / color / mirror / Atom feed
* RE: [PATCH v2] mshv: Fix interrupt state corruption in hv_do_map_pfns error path
From: Michael Kelley @ 2026-04-28  0:20 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177730104962.21733.4130809041576931551.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, April 27, 2026 7:44 AM
> 
> Restore interrupt state before breaking out of the loop on error.
> 
> The irq_flags are saved before entering the loop, but the early exit
> path on error fails to restore them. This leaves interrupts in an
> inconsistent state and can lead to lockdep warnings or other
> interrupt-related issues.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_hv_call.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index ab210a7fcb8c3..61291ec6f3468 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -229,8 +229,10 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  			} else {
>  				pfnlist[i] = mmio_spa + done + i;
>  			}
> -		if (ret)
> +		if (ret) {
> +			local_irq_restore(irq_flags);
>  			break;
> +		}
> 

This looks good for fixing the immediate bug.

But I'd note that this error path occurs solely based on the
if (index >= page_struct_count) test in the preceding 'for' loop. That test is a
"can't happen" sanity test that never triggers if hv_do_map_gpa_hcall()
is coded correctly. At the beginning of the function there are validations of
the input arguments, which is reasonable. But this sanity test isn't based
on the input arguments, and it adds non-trivial complexity to the code
because of the nested loops and the need to figure out where the two
"break" statements go. I'd argue for dropping the sanity test entirely,
along with this test of 'ret' and the need to restore the interrupt state.

Michael

^ permalink raw reply

* Re: [PATCH 1/2] hv_sock: fix ARM64 support
From: Jakub Kicinski @ 2026-04-28  0:06 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-kernel, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Himadri Pandya,
	Michael Kelley, linux-hyperv, virtualization, netdev,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Deepak Rawat,
	dri-devel, stable
In-Reply-To: <20260425181719.1538483-1-hamzamahfooz@linux.microsoft.com>

On Sat, 25 Apr 2026 11:17:18 -0700 Hamza Mahfooz wrote:
> VMBUS ring buffers must be page aligned. Therefore, the current value of
> 24K presents a challenge on ARM64 kernels (with 64K pages). So, use
> VMBUS_RING_SIZE() to ensure they are always aligned and large enough to
> hold all of the relevant data.

Please split the fixes into two independent postings.
They have to go via different trees AFAICT

^ permalink raw reply

* Re: [PATCH net-next] net: mana: Force single RX buffer per page for CVM/encrypted guest memory
From: Jakub Kicinski @ 2026-04-27 23:21 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <ae9pxvJfkAZYfKMf@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Mon, 27 Apr 2026 06:51:02 -0700 Dipayaan Roy wrote:
> When page_pool allocates sub-page RX buffer fragments, the bounce buffer
> granularity may not align with these smaller fragment sizes, leading to
> failure in mana driver rx path.
> 
> Refactor the RX buffer decision into mana_use_single_rxbuf_per_page().
> When cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) is true, the driver is
> forced to use a single RX buffer per page. This ensures:
> - Each RX buffer is exactly one PAGE_SIZE.
> - The DMA offset is always 0.
> - SWIOTLB maps full, page-aligned blocks.

As commented on your RFC - I'm not sure why you need this.

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] net: mana: Avoid queue struct allocation failure under memory fragmentation
From: Jakub Kicinski @ 2026-04-27 23:19 UTC (permalink / raw)
  To: Aditya Garg
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya
In-Reply-To: <20260427132807.1642290-1-gargaditya@linux.microsoft.com>

On Mon, 27 Apr 2026 06:23:33 -0700 Aditya Garg wrote:
> The MANA driver can fail to load on systems with high memory
> utilization because several allocations in the queue setup paths
> require large physically contiguous blocks via kmalloc. Under memory
> fragmentation these high-order allocations may fail, preventing the
> driver from creating queues at probe time or when reconfiguring
> channels, ring parameters or MTU at runtime.

net-next wasn't open yet, when you posted.
Please resubmit in a couple of days.
-- 
pw-bot: defer

^ permalink raw reply

* Re: [RFC PATCH net-next] net: mana: Force single RX buffer per page under SWIOTLB bounce modes
From: Jakub Kicinski @ 2026-04-27 22:44 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <ae91hyrLf4n23XE6@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Mon, 27 Apr 2026 07:41:11 -0700 Dipayaan Roy wrote:
> In both cases, sub-page RX buffer fragments allocated via page_pool may
> not be compatible with bounce buffering in this mode, leading to failures
> in the RX path.

What does it mean to not be compatible with swiotlb?
Normally that indicates that DMA API is mis-used.

^ permalink raw reply

* [PATCH v4 3/3] mshv: unmap debugfs stats pages on kexec
From: Jork Loeser @ 2026-04-27 21:38 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	Anirudh Rayabharam, linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260427213855.1675044-1-jloeser@linux.microsoft.com>

On L1VH, debugfs stats pages are overlay pages: the kernel allocates
them and registers the GPAs with the hypervisor via
HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
hypervisor across kexec. If the kexec'd kernel reuses those physical
pages, the hypervisor's overlay semantics cause a machine check
exception.

Fix this by calling mshv_debugfs_exit() from the reboot notifier,
which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
kexec. This releases the overlay bindings so the physical pages can be
safely reused. Guard mshv_debugfs_exit() against being called when
init failed.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_debugfs.c | 7 ++++++-
 drivers/hv/mshv_synic.c   | 1 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
index 418b6dc8f3c2..3c3e02237ae9 100644
--- a/drivers/hv/mshv_debugfs.c
+++ b/drivers/hv/mshv_debugfs.c
@@ -674,8 +674,10 @@ int __init mshv_debugfs_init(void)
 
 	mshv_debugfs = debugfs_create_dir("mshv", NULL);
 	if (IS_ERR(mshv_debugfs)) {
+		err = PTR_ERR(mshv_debugfs);
+		mshv_debugfs = NULL;
 		pr_err("%s: failed to create debugfs directory\n", __func__);
-		return PTR_ERR(mshv_debugfs);
+		return err;
 	}
 
 	if (hv_root_partition()) {
@@ -710,6 +712,9 @@ int __init mshv_debugfs_init(void)
 
 void mshv_debugfs_exit(void)
 {
+	if (!mshv_debugfs)
+		return;
+
 	mshv_debugfs_parent_partition_remove();
 
 	if (hv_root_partition()) {
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 978a1cace341..88170ce6b83f 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -723,6 +723,7 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
 static int mshv_synic_reboot_notify(struct notifier_block *nb,
 			      unsigned long code, void *unused)
 {
+	mshv_debugfs_exit();
 	cpuhp_remove_state(synic_cpuhp_online);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 2/3] mshv: clean up SynIC state on kexec for L1VH
From: Jork Loeser @ 2026-04-27 21:38 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	Anirudh Rayabharam, linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260427213855.1675044-1-jloeser@linux.microsoft.com>

The reboot notifier that tears down the SynIC cpuhp state guards the
cleanup with hv_root_partition(), so on L1VH (where
hv_root_partition() is false) SINT0, SINT5, and SIRBP are never
cleaned up before kexec. The kexec'd kernel then inherits stale
unmasked SINTs and an enabled SIRBP pointing to freed memory.

Remove the hv_root_partition() guard so the cleanup runs for all
parent partitions.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_synic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 2db3b0192eac..978a1cace341 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -723,9 +723,6 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
 static int mshv_synic_reboot_notify(struct notifier_block *nb,
 			      unsigned long code, void *unused)
 {
-	if (!hv_root_partition())
-		return 0;
-
 	cpuhp_remove_state(synic_cpuhp_online);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 1/3] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-04-27 21:38 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	Anirudh Rayabharam, linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260427213855.1675044-1-jloeser@linux.microsoft.com>

The SynIC is shared between VMBus and MSHV. VMBus owns the message
page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).

Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
disables all of them. This is wrong because MSHV can be torn down
while VMBus is still active. In particular, a kexec reboot notifier
tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
from under VMBus causes its later cleanup to write SynIC MSRs while
SynIC is disabled, which the hypervisor does not tolerate.

Restrict MSHV to managing only the resources it owns:
- SINT0, SINT5: mask on cleanup, unmask on init
- SIRBP: enable/disable as before
- SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
  and nested root partition); on a non-nested root partition VMBus
  does not run, so MSHV must enable/disable them

While here, fix the SIEFP and SIRBP memremap() and virt_to_phys()
calls to use HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of
PAGE_SHIFT/PAGE_SIZE. The hypervisor always uses 4K pages for SynIC
register GPAs regardless of the kernel page size, so using PAGE_SHIFT
produces wrong addresses on ARM64 with 64K pages.

Note that initialization order matters - VMBUS first, MSHV second,
and the reverse on de-init. Ideally, we would want a dedicated SYNIC
driver that replaces the cross-dependencies with a clear API and
dynamic tracking. Such refactor should go into its own dedicated
series, outside of this kexec fix series.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/hv.c         |   3 +
 drivers/hv/mshv_synic.c | 150 ++++++++++++++++++++++++++--------------
 2 files changed, 103 insertions(+), 50 deletions(-)

diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index ae60fd542292..ef4b1b03395d 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -272,6 +272,9 @@ void hv_synic_free(void)
 /*
  * hv_hyp_synic_enable_regs - Initialize the Synthetic Interrupt Controller
  * with the hypervisor.
+ *
+ * Note: When MSHV is present, mshv_synic_cpu_init() intializes further
+ * registers later.
  */
 void hv_hyp_synic_enable_regs(unsigned int cpu)
 {
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index e2288a726fec..2db3b0192eac 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -13,6 +13,7 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/cpuhotplug.h>
+#include <linux/hyperv.h>
 #include <linux/reboot.h>
 #include <asm/mshyperv.h>
 #include <linux/acpi.h>
@@ -456,46 +457,75 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
 	union hv_synic_sint sint;
-	union hv_synic_scontrol sctrl;
 	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 			&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 			&spages->synic_event_ring_page;
+	/*
+	 * VMBus owns SIMP/SIEFP/SCONTROL when it is active.
+	 * See hv_hyp_synic_enable_regs() for that initialization.
+	 */
+	bool vmbus_active = hv_vmbus_exists();
 
-	/* Setup the Synic's message page */
+	/*
+	 * Map the SYNIC message page. When VMBus is not active the
+	 * hypervisor pre-provisions the SIMP GPA but may not set
+	 * simp_enabled — enable it here.
+	 */
 	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
-	simp.simp_enabled = true;
+	if (!vmbus_active) {
+		simp.simp_enabled = true;
+		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+	}
 	*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
 			     HV_HYP_PAGE_SIZE,
 			     MEMREMAP_WB);
 
 	if (!(*msg_page))
-		return -EFAULT;
-
-	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+		goto cleanup_simp;
 
-	/* Setup the Synic's event flags page */
+	/*
+	 * Map the event flags page. Same as SIMP: enable when
+	 * VMBus is not active, already enabled by VMBus otherwise.
+	 */
 	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
-	siefp.siefp_enabled = true;
-	*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
-				     PAGE_SIZE, MEMREMAP_WB);
+	if (!vmbus_active) {
+		siefp.siefp_enabled = true;
+		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+	}
+	*event_flags_page = memremap(siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT,
+				     HV_HYP_PAGE_SIZE, MEMREMAP_WB);
 
 	if (!(*event_flags_page))
-		goto cleanup;
-
-	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+		goto cleanup_siefp;
 
 	/* Setup the Synic's event ring page */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
-	sirbp.sirbp_enabled = true;
-	*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
-				    PAGE_SIZE, MEMREMAP_WB);
 
-	if (!(*event_ring_page))
-		goto cleanup;
+	if (hv_root_partition()) {
+		*event_ring_page = memremap(sirbp.base_sirbp_gpa << HV_HYP_PAGE_SHIFT,
+					    HV_HYP_PAGE_SIZE, MEMREMAP_WB);
 
+		if (!(*event_ring_page))
+			goto cleanup_siefp;
+	} else {
+		/*
+		 * On L1VH the hypervisor does not provide a SIRBP page.
+		 * Allocate one and program its GPA into the MSR.
+		 */
+		*event_ring_page = (struct hv_synic_event_ring_page *)
+			get_zeroed_page(GFP_KERNEL);
+
+		if (!(*event_ring_page))
+			goto cleanup_siefp;
+
+		sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
+				>> HV_HYP_PAGE_SHIFT;
+	}
+
+	sirbp.sirbp_enabled = true;
 	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
 
 	if (mshv_sint_irq != -1)
@@ -518,28 +548,30 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
 			      sint.as_uint64);
 
-	/* Enable global synic bit */
-	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
-	sctrl.enable = 1;
-	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	/* When VMBus is active it already enabled SCONTROL. */
+	if (!vmbus_active) {
+		union hv_synic_scontrol sctrl;
+
+		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+		sctrl.enable = 1;
+		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	}
 
 	return 0;
 
-cleanup:
-	if (*event_ring_page) {
-		sirbp.sirbp_enabled = false;
-		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
-		memunmap(*event_ring_page);
-	}
-	if (*event_flags_page) {
+cleanup_siefp:
+	if (*event_flags_page)
+		memunmap(*event_flags_page);
+	if (!vmbus_active) {
 		siefp.siefp_enabled = false;
 		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
-		memunmap(*event_flags_page);
 	}
-	if (*msg_page) {
+cleanup_simp:
+	if (*msg_page)
+		memunmap(*msg_page);
+	if (!vmbus_active) {
 		simp.simp_enabled = false;
 		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
-		memunmap(*msg_page);
 	}
 
 	return -EFAULT;
@@ -548,16 +580,15 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 static int mshv_synic_cpu_exit(unsigned int cpu)
 {
 	union hv_synic_sint sint;
-	union hv_synic_simp simp;
-	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
-	union hv_synic_scontrol sctrl;
 	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 		&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 		&spages->synic_event_ring_page;
+	/* VMBus owns SIMP/SIEFP/SCONTROL when it is active */
+	bool vmbus_active = hv_vmbus_exists();
 
 	/* Disable the interrupt */
 	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
@@ -574,28 +605,47 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
 	if (mshv_sint_irq != -1)
 		disable_percpu_irq(mshv_sint_irq);
 
-	/* Disable Synic's event ring page */
+	/* Disable SYNIC event ring page owned by MSHV */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
 	sirbp.sirbp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
-	memunmap(*event_ring_page);
 
-	/* Disable Synic's event flags page */
-	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
-	siefp.siefp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+	if (hv_root_partition()) {
+		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+		memunmap(*event_ring_page);
+	} else {
+		sirbp.base_sirbp_gpa = 0;
+		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+		free_page((unsigned long)*event_ring_page);
+	}
+
+	/*
+	 * Release our mappings of the message and event flags pages.
+	 * When VMBus is not active, we enabled SIMP/SIEFP — disable
+	 * them. Otherwise VMBus owns the MSRs — leave them.
+	 */
 	memunmap(*event_flags_page);
+	if (!vmbus_active) {
+		union hv_synic_simp simp;
+		union hv_synic_siefp siefp;
 
-	/* Disable Synic's message page */
-	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
-	simp.simp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+		siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
+		siefp.siefp_enabled = false;
+		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+
+		simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
+		simp.simp_enabled = false;
+		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+	}
 	memunmap(*msg_page);
 
-	/* Disable global synic bit */
-	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
-	sctrl.enable = 0;
-	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	/* When VMBus is active it owns SCONTROL — leave it. */
+	if (!vmbus_active) {
+		union hv_synic_scontrol sctrl;
+
+		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+		sctrl.enable = 0;
+		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	}
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 0/3] Hyper-V: kexec fixes for L1VH (mshv)
From: Jork Loeser @ 2026-04-27 21:38 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	Anirudh Rayabharam, linux-kernel, linux-arch, Jork Loeser

This series fixes kexec support when Linux runs as an L1 Virtual Host
(L1VH) under Hyper-V, using the MSHV driver to manage child VMs.

1-2. SynIC cleanup: the MSHV driver manages its own SynIC resources
     separately from vmbus. Add proper teardown of MSHV-owned SINTs
     and SIRBP on kexec, scoped to only the resources MSHV owns.
     Use hv_vmbus_exists() to decide at runtime whether VMBus owns
     SIMP/SIEFP/SCONTROL (so MSHV must not touch them) or whether
     MSHV must manage them itself (bare root partition without VMBus).
     Also fix SIEFP and SIRBP address calculations to use
     HV_HYP_PAGE_SHIFT instead of PAGE_SHIFT, which produces wrong
     addresses on ARM64 with 64K pages.

3.   Debugfs stats pages: unmap the VP statistics overlay pages before
     kexec to avoid machine check exceptions when the new kernel
     reuses those physical pages.

Changes since v3:
- Dropped patches 1-3 (vmbus variable shadowing, stimer cleanup,
  LP/VP skip), now merged via hyperv-next.
- Patch 1: fix SIEFP and SIRBP memremap()/virt_to_phys() to use
  HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of PAGE_SHIFT/PAGE_SIZE.

Changes since v2:
- Rebased onto linux-next/master to adapt to the upstream SynIC
  refactor (commit 5a674ef871fe, "mshv: refactor synic init and
  cleanup").

Changes since v1:
- Patch 1: account for nested root partitions where VMBus is also
  active (not just L1VH); use a vmbus_active local variable; allocate
  SIRBP in L1VH allocation path for when the hypervisor doesn't
  pre-provision the page.

Jork Loeser (3):
  mshv: limit SynIC management to MSHV-owned resources
  mshv: clean up SynIC state on kexec for L1VH
  mshv: unmap debugfs stats pages on kexec

 drivers/hv/hv.c           |   3 +
 drivers/hv/mshv_debugfs.c |   7 +-
 drivers/hv/mshv_synic.c   | 154 +++++++++++++++++++++++++-------------
 3 files changed, 110 insertions(+), 54 deletions(-)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-04-27 20:17 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: David Wei, kys, haiyangz, wei.liu, decui, andrew+netdev, davem,
	edumazet, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <aex119OtL8CEGXkb@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Sat, 25 Apr 2026 01:05:43 -0700 Dipayaan Roy wrote:
> Hi Jakub,
> with this new data from David, is it convincing enough for a mana driver
> specific private flag, which can be set from user space by a udev rule
> by detecting the underlying platform? If not then I will send the next
> version with the other rxbuflen approach. 

I think so, thank you both for the testing.
Please look out for the net-next opening up and repost the patches.
(The reopening is delayed, it was supposed to happen already but I
can't get a clean run out of our CI, sigh)

^ permalink raw reply

* RE: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
From: Michael Kelley @ 2026-04-27 19:05 UTC (permalink / raw)
  To: Dexuan Cui, Hamza Mahfooz, Saurabh Singh Sengar
  Cc: linux-kernel@vger.kernel.org, KY Srinivasan, Haiyang Zhang,
	Wei Liu, Long Li, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Himadri Pandya, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Deepak Rawat,
	dri-devel@lists.freedesktop.org, stable@kernel.vger.org
In-Reply-To: <SA1PR21MB6921B5A2441DB3E9A1312AC0BF362@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Monday, April 27, 2026 11:42 AM
> 
> > From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> > Sent: Monday, April 27, 2026 4:52 AM
> > To: Saurabh Singh Sengar <ssengar@microsoft.com>
> > ...
> > On Sun, Apr 26, 2026 at 05:00:24AM +0000, Saurabh Singh Sengar wrote:
> > > > Subject: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
> > > >
> > > > VMBUS ring buffers must be page aligned. So, use VMBUS_RING_SIZE() to
> > > > ensure they are always aligned and large enough to hold all of the relevant
> > > > data.
> > > >
> > > > Cc: stable@kernel.vger.org
> > > > Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic
> > > >  video device")
> 
> IMO the Fixes tag is unnecessary because the existing VMBUS_RING_BUFSIZE
> is 256KB, which is already aligned to 4KB, 16KB and 64KB.
> 
> VMBUS_RING_SIZE(256 * 1024) is still 256KB.

Not always. If PAGE_SIZE is 64KiB, VMBUS_RING_SIZE(256 * 1024) is
320KiB. If PAGE_SIZE is 16KiB or 4KiB, then VMBUS_RING_SIZE(256 * 1024)
is indeed 256 KiB. See the explanation in the comment for VMBUS_RING_SIZE.

Michael

> 
> > > > Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> > > > ---
> > > >  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > > > b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > > > index 051ecc526832..753d97bff76f 100644
> > > > --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > > > +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > > > @@ -10,7 +10,7 @@
> > > >
> > > >  #include "hyperv_drm.h"
> > > >
> > > > -#define VMBUS_RING_BUFSIZE (256 * 1024)
> > > > +#define VMBUS_RING_BUFSIZE VMBUS_RING_SIZE(256 * 1024)
> > > >  #define VMBUS_VSP_TIMEOUT (10 * HZ)
> > > >
> > > >  #define SYNTHVID_VERSION(major, minor) ((minor) << 16 | (major))
> > > > --
> > > > 2.54.0
> > >
> > > Although this lgtm, but this may change the behaviour on ARM64 systems
> > with page size > 4K ?
> 
> Actually the behavior won't change, because
> VMBUS_RING_SIZE(256 * 1024) is still 256KB.
> 
> > > Have we tested it ?
> >
> > Yup, I tested it on an ARM64 windows machine with a 64K page size guest
> > kernel.
> >
> > >
> > > Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> >
> > Pushed to drm-misc.


^ permalink raw reply

* RE: [PATCH 1/2] hv_sock: fix ARM64 support
From: Dexuan Cui @ 2026-04-27 18:53 UTC (permalink / raw)
  To: Hamza Mahfooz, linux-kernel@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Wei Liu, Long Li,
	Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Himadri Pandya, Michael Kelley,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	netdev@vger.kernel.org, Saurabh Sengar, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Deepak Rawat, dri-devel@lists.freedesktop.org,
	stable@kernel.vger.org
In-Reply-To: <20260425181719.1538483-1-hamzamahfooz@linux.microsoft.com>

> From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> Sent: Saturday, April 25, 2026 11:17 AM
> ...
> Cc: stable@kernel.vger.org
> Fixes: 77ffe33363c0 ("hv_sock: use HV_HYP_PAGE_SIZE for Hyper-V
> communication")

It looks like 77ffe33363c0 was not tested with
CONFIG_ARM64_64K_PAGES=y...

Thanks for the fix!

> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>

Tested-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>

^ permalink raw reply

* RE: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
From: Dexuan Cui @ 2026-04-27 18:42 UTC (permalink / raw)
  To: Hamza Mahfooz, Saurabh Singh Sengar
  Cc: linux-kernel@vger.kernel.org, KY Srinivasan, Haiyang Zhang,
	Wei Liu, Long Li, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Himadri Pandya, Michael Kelley, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Deepak Rawat,
	dri-devel@lists.freedesktop.org, stable@kernel.vger.org
In-Reply-To: <ae9NxmDBTkzPP3H6@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

> From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> Sent: Monday, April 27, 2026 4:52 AM
> To: Saurabh Singh Sengar <ssengar@microsoft.com>
> ...
> On Sun, Apr 26, 2026 at 05:00:24AM +0000, Saurabh Singh Sengar wrote:
> > > Subject: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
> > >
> > > VMBUS ring buffers must be page aligned. So, use VMBUS_RING_SIZE() to
> > > ensure they are always aligned and large enough to hold all of the relevant
> > > data.
> > >
> > > Cc: stable@kernel.vger.org
> > > Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic
> > >  video device")

IMO the Fixes tag is unnecessary because the existing VMBUS_RING_BUFSIZE
is 256KB, which is already aligned to 4KB, 16KB and 64KB.

VMBUS_RING_SIZE(256 * 1024) is still 256KB.

> > > Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> > > ---
> > >  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > > b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > > index 051ecc526832..753d97bff76f 100644
> > > --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > > +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > > @@ -10,7 +10,7 @@
> > >
> > >  #include "hyperv_drm.h"
> > >
> > > -#define VMBUS_RING_BUFSIZE (256 * 1024)
> > > +#define VMBUS_RING_BUFSIZE VMBUS_RING_SIZE(256 * 1024)
> > >  #define VMBUS_VSP_TIMEOUT (10 * HZ)
> > >
> > >  #define SYNTHVID_VERSION(major, minor) ((minor) << 16 | (major))
> > > --
> > > 2.54.0
> >
> > Although this lgtm, but this may change the behaviour on ARM64 systems
> with page size > 4K ?

Actually the behavior won't change, because
VMBUS_RING_SIZE(256 * 1024) is still 256KB.

> > Have we tested it ?
> 
> Yup, I tested it on an ARM64 windows machine with a 64K page size guest
> kernel.
> 
> >
> > Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> 
> Pushed to drm-misc.


^ permalink raw reply

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Dipayaan Roy @ 2026-04-27 18:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260426131555.GA3501894@ziepe.ca>

On Sun, Apr 26, 2026 at 10:15:55AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 23, 2026 at 09:16:28AM -0700, Dipayaan Roy wrote:
> > During Function Level Reset recovery, the MANA driver reads
> > hardware BAR0 registers that may temporarily contain garbage values.
> > The SHM (Shared Memory) offset read from GDMA_REG_SHM_OFFSET is used
> > to compute gc->shm_base, which is later dereferenced via readl() in
> > mana_smc_poll_register(). If the hardware returns an unaligned or
> > out-of-range value, the driver must not blindly use it, as this would
> > propagate the hardware error into a kernel crash.
> 
> It is not what we are calling "hardening" if you are hitting actual
> crashes in actual real systems.
> 
> "hardening" is the driver defending against actively malicious
> hardware, operating in ways that will never be seen in real systems,
> attempting to compromise the kernel.
> 
> Drivers working around real world broken/buggy/malfunctioning HW is
> just entirely normal stuff.
>
Hi Jason,

sure, I will drop the hardening label, in v2. 
> > @@ -73,10 +74,25 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
> >  	gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
> >  
> >  	sriov_base_off = mana_gd_r64(gc, GDMA_SRIOV_REG_CFG_BASE_OFF);
> > +	if (sriov_base_off >= gc->bar0_size ||
> > +	    !IS_ALIGNED(sriov_base_off, sizeof(u32))) {
> > +		dev_err(gc->dev,
> > +			"SRIOV base offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> > +			sriov_base_off, (u64)gc->bar0_size);
> > +		return -EPROTO;
> > +	}
> 
> .. and if it is entirely normal and something that happens is EPROTO
> really the right way to deal with this race, or should the driver be
> looping somehow until the device stabilizes??
This is the current flow of the driver:
mana_serv_reset()
	mana_gd_suspend()
	msleep(MANA_SERVICE_PERIOD * 1000); -> 10s
	mana_gd_resume()
		mana_gd_init_registers();  -> read the garbage followed by fault
	mana_serv_rescan(); -> On mana_gd_resume err(EPROTO) full PCI remove + rescan

The 10-second sleep in mana_serv_reset() already happens before
mana_gd_resume() is called, so by the time we read the registers
hardware should have stabilized. If we still see garbage after 10
seconds, it suggests deeper hardware issue where PCI rescan is
recomended from HW. This patch returns -EPROTO on detection
of unaligned / out of range offset and that err code triggers the
mana_serv_rescan().

> 
> Jason

Thanks Jason for the review comments, will post a v2 to drop the
hardening label.

Regards
Dipayaan Roy


^ permalink raw reply

* Re: [PATCH V1 10/13] PCI: hv: Build device id for a VMBus device, export PCI devid function
From: Bjorn Helgaas @ 2026-04-27 16:35 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260422023239.1171963-11-mrathor@linux.microsoft.com>

On Tue, Apr 21, 2026 at 07:32:36PM -0700, Mukesh R wrote:
> On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> interrupts, etc need a device id as a parameter. This device id refers
> to that specific device during the lifetime of passthru.
> 
> An L1VH VM only contains VMBus based devices. A device id for a VMBus
> device is slightly different in that it uses the hv_pcibus_device info
> for building it to make sure it matches exactly what the hypervisor
> expects. This VMBus based device id is needed when attaching devices in
> an L1VH based guest VM. Before building it, a check is done to make sure
> the device is a valid VMBus device.
> 
> In remaining cases, PCI device id is used. So, also make pci device
> id build function public.

s/id/ID/ throughout, including subject line, since "id" is not really
a word by itself.  Well, it *is* a word, but not the one you want here :)

s/pci/PCI/, although I would prefer if you just mentioned the name of
the function instead.  I guess this refers to
hv_build_devid_type_pci()?  Or maybe hv_pci_vmbus_device_id()?
Or hv_build_devid_oftype()?

> +++ b/include/asm-generic/mshyperv.h
> +#if IS_ENABLED(CONFIG_PCI_HYPERV)
> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> +#else   /* IS_ENABLED(CONFIG_PCI_HYPERV) */
> +static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> +{ return 0; }
> +#endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */

IMO the "IS_ENABLED()" comments here are just clutter since it's only
six lines and it's obvious what the #else and #endif belong to.

^ permalink raw reply

* Re: [PATCH V1 08/13] PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg
From: Bjorn Helgaas @ 2026-04-27 16:31 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260422023239.1171963-9-mrathor@linux.microsoft.com>

On Tue, Apr 21, 2026 at 07:32:34PM -0700, Mukesh R wrote:
> Main change here is to rename hv_compose_msi_msg to
> hv_vmbus_compose_msi_msg as we introduce hv_compose_msi_msg in upcoming
> patches that builds MSI messages for both VMBus and non-VMBus cases. VMBus
> is not used on baremetal root partition for example. While at it, replace
> spaces with tabs and fix some formatting involving excessive line wraps.

Would be better to do the whitespace changes in their own patch,
although several of them should just be dropped (see below).

Capitalize subject ("PCI: hv: Rename ...").

Add "()" after function names in subject and commit log.

> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -30,7 +30,7 @@
>   * function's configuration space is zero.
>   *
>   * The rest of this driver mostly maps PCI concepts onto underlying Hyper-V
> - * facilities.  For instance, the configuration space of a function exposed
> + * facilities.	For instance, the configuration space of a function exposed

Oops, this hunk made it worse.  Definitely don't want a tab there.

> @@ -1954,7 +1955,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  			return;
>  		}
>  		/*
> -		 * The vector we select here is a dummy value.  The correct
> +		 * The vector we select here is a dummy value.	The correct

Another tab that should be a space.  Actually, you should just drop
this hunk; the rest of the comment has two spaces after periods, so
this should too.

> @@ -2046,7 +2047,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  
>  		/*
>  		 * Make sure that the ring buffer data structure doesn't get
> -		 * freed while we dereference the ring buffer pointer.  Test
> +		 * freed while we dereference the ring buffer pointer.	Test

Same here.  This makes it worse.

> @@ -2226,7 +2227,7 @@ static int hv_pcie_init_irq_domain(struct hv_pcibus_device *hbus)
>  /**
>   * get_bar_size() - Get the address space consumed by a BAR
>   * @bar_val:	Value that a BAR returned after -1 was written
> - *              to it.
> + *		to it.

Just put "to it" on the preceding line.  There's plenty of space
there.

> @@ -2580,7 +2581,7 @@ static void q_resource_requirements(void *context, struct pci_response *resp,
>   * new_pcichild_device() - Create a new child device
>   * @hbus:	The internal struct tracking this root PCI bus.
>   * @desc:	The information supplied so far from the host
> - *              about the device.
> + *		about the device.

Ditto.  If you want to change this, put "about the device" on the
preceding line.

> @@ -3422,7 +3423,7 @@ static int hv_allocate_config_window(struct hv_pcibus_device *hbus)
>  	 * vmbus_allocate_mmio() gets used for allocating both device endpoint
>  	 * resource claims (those which cannot be overlapped) and the ranges
>  	 * which are valid for the children of this bus, which are intended
> -	 * to be overlapped by those children.  Set the flag on this claim
> +	 * to be overlapped by those children.	Set the flag on this claim

Another hunk that should be dropped.

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: David Wei @ 2026-04-27 16:01 UTC (permalink / raw)
  To: Dipayaan Roy, kuba
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <aex119OtL8CEGXkb@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On 2026-04-25 01:05, Dipayaan Roy wrote:
> On Fri, Apr 24, 2026 at 01:05:24PM -0700, David Wei wrote:
>> On 2026-04-23 05:48, Dipayaan Roy wrote:
>>> On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
>>>> On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
>>>>> I still see roughly a 5% overhead from the atomic refcount operation
>>>>> itself, but on that platform there is no throughput drop when using
>>>>> page fragments versus full-page mode.
>>>>
>>>> That seems to contradict your claim that it's a problem with a specific
>>>> platform.. Since we're in the merge window I asked David Wei to try to
>>>> experiment with disabling page fragmentation on the ARM64 platforms we
>>>> have at Meta. If it repros we should use the generic rx-buf-len
>>>> ringparam because more NICs may want to implement this strategy.
>>>
>>> Hi Jakub,
>>>
>>> Thanks. I think I was not precise enough in my previous reply.
>>>
>>> What I meant is that the atomic refcount cost itself does not appear to
>>> be unique to the affected platform. I see a similar ~5% overhead on
>>> another ARM64 platformi (different vendor) as well. However, on that platform
>>> there is no throughput delta between fragment mode and full-page mode; both reach
>>> line rate.
>>>
>>> On the affected platform, fragment mode shows an additional ~15%
>>> throughput drop versus full-page mode. So the current data suggests that
>>> the atomic overhead is common, but the throughput regression is not
>>> explained by that overhead alone and likely depends on an additional
>>> platform-specific factor.
>>>
>>> Separately, the hardware team collected PCIe traces on the affected
>>> platform and reported stalls in the fragment-mode case that are not seen
>>> in full-page mode. They are still investigating the root cause, but
>>> their current hypothesis is that this is related to that platform’s
>>> PCIe/root-port microarchitecture rather than to page_pool refcounting
>>> alone.
>>>
>>> That said, I agree the right direction depends on whether this
>>> reproduces on other ARM64 platforms. If David is able to reproduce the
>>> same behavior, then using the generic rx-buf-len ringparam sounds like
>>> the better direction.
>>>
>>> Please let me know what David finds, and I can rework the patch
>>> accordingly.
>>
>> I ran a test on Grace, 4 KB pages, 72 cores, 1 NUMA node.
>>
>> Broadcom NIC, bnxt driver, 50 Gbps bandwidth. Hacked it up to either
>> give me 1 or 2 frags per page. No agg ring, no HDS, no HW GRO.
>>
>> Use 1 combined queue only for the server. Affinitized its net rx softirq
>> to run on core 4.
>>
>> Ran iperf3 server, taskset onto cpu cores 32-47. The iperf3 client is
>> running on a host w/ same hw in the same region. Using 32 queues, no
>> softirq affinities. The idea is to hammer page->pp_ref_count from
>> different cores.
>>
>> * 1 frag/page  -> 32.3 Gbps
>> * 2 frags/page -> 36.0 Gbps
>>
>> Comparing perf, for 2 frags/page the cost of skb_release_data() hitting
>> pp_ref_count goes up, as expected. Is this what you see? When you say
>> there's a +5% overhead, what function?
>>
>> Overall tput is higher with multiple frags. That's to be expected w/
>> page pool.
> 
> Hi David,
> 
> Thanks for running this. Your results are consistent with mine.
> 
> I have tested this on 2 ARM64 platforms from different vendors,
> running ntttcp and iperf3 using 4k as base page size.
> In my observation I see both platforms show a 5% overhead in
> napi_pp_put_page (~3.9%) and page_pool_alloc_frag_netmem (~1.9%)
> when running in fragment mode, both stalling on the LSE ldaddal
> atomic that maintains pp_ref_count.
> This seems to be same as your observation as well. However in my
> observation one of the platform shows 15% drop in throughput when
> in fragment mode vs page mode. The other platform I ran the test on
> infact performs slighty better in fragment mode than in full page
> mode (simillar observation as yours).

That's not what I observe. I don't see napi_pp_put_page at all, and
page_pool_alloc_frag_netmem is actually lower with 2 frags/page (4.06%)
than 1 frag/page (5.73%).

The main difference is in skb_release_data which goes from
0.85% (1 frag/page) to 3.32% (2 frags/page).

> 
> So the atomic refcount overhead appears to be common across ARM64
> platforms, but it does not cause a throughput regression.
> The throughput regression seems specific to one platform only for which
> we want to have the full page work around, also the HW team has
> identified PCIe stalls in fragment mode that are absent in full-page mode.
> Their investigation points to a suspected microarchitectural
> issue in the PCIe root port. IMO, there seems to be no issue with
> page_pool itself.
> 
> Given that:
>   - Grace shows fragments are faster (your data)
>   - A second ARM64 platform shows no regression (my data)
>   - Only the affected platform shows a throughput drop
>   - The HW team suspects this to a platform-specific PCIe issue,
>     also form our experiment data the drop in throughput seems to
>     be platform specific only.
> 
> I believe this remains a platform-specific workaround rather than
> a generic issue. Would a private flag still be acceptable for this
> case?
> 
> 
>>
>> There are some 200 Gbps NICs but they're mlx5 so I'd have to redo the
>> driver hack. Are you going to re-implement this change with rx-buf-len
>> instead of a private flag? If so, I won't spend more time running this
>> test.
>>
> I can go either way depending on what Jakub prefers.
> Hi Jakub,
> with this new data from David, is it convincing enough for a mana driver
> specific private flag, which can be set from user space by a udev rule
> by detecting the underlying platform? If not then I will send the next
> version with the other rxbuflen approach.
>>>
>>>
>>> Regards
>>> Dipayaan Roy
> 
> 
> Thanks and Regards
> Dipayaan Roy

^ permalink raw reply

* [PATCH v2] mshv: Fix interrupt state corruption in hv_do_map_pfns error path
From: Stanislav Kinsburskii @ 2026-04-27 14:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Restore interrupt state before breaking out of the loop on error.

The irq_flags are saved before entering the loop, but the early exit
path on error fails to restore them. This leaves interrupts in an
inconsistent state and can lead to lockdep warnings or other
interrupt-related issues.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index ab210a7fcb8c3..61291ec6f3468 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -229,8 +229,10 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 			} else {
 				pfnlist[i] = mmio_spa + done + i;
 			}
-		if (ret)
+		if (ret) {
+			local_irq_restore(irq_flags);
 			break;
+		}
 
 		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
 					     input_page, NULL);



^ permalink raw reply related

* [PATCH v2] mshv: Fix large page unmap count in error path
From: Stanislav Kinsburskii @ 2026-04-27 14:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

When hv_do_map_pfns() fails after partially mapping large pages, the
unmap count passed to hv_call_unmap_pfns() is incorrect. The 'done'
variable tracks the number of large pages mapped, but the unmap
function expects the count in 4KB page units.

This causes incomplete cleanup on error, potentially leaving stale
mappings in the partition. Shift the count by large_shift to convert
from large page count to 4KB page count before calling the unmap
function.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 7f91096f95a8e..ab210a7fcb8c3 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -255,8 +255,10 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 	if (ret && done) {
 		u32 unmap_flags = 0;
 
-		if (flags & HV_MAP_GPA_LARGE_PAGE)
+		if (flags & HV_MAP_GPA_LARGE_PAGE) {
 			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+			done <<= large_shift;
+		}
 		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
 	}
 



^ permalink raw reply related

* [RFC PATCH net-next] net: mana: Force single RX buffer per page under SWIOTLB bounce modes
From: Dipayaan Roy @ 2026-04-27 14:41 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov

The MANA driver has two distinct DMA paths for RX buffers:

1. Without PP_FLAG_DMA_MAP: The driver maps full pages manually,
   creating page-aligned mappings where the DMA offset is always zero.

2. With PP_FLAG_DMA_MAP: page_pool uses sub-page fragments, where
   multiple RX buffers share a single page. The pool maps the whole
   page once, and subsequent allocations use offsets into that region.

Path 2 is problematic in two scenarios where DMA must go through
SWIOTLB bounce buffers:

- Confidential VMs (AMD SEV-SNP, Intel TDX): guest memory is encrypted
  and the NIC cannot access it directly due to lack of TDISP support.
  All DMA must use SWIOTLB bounce buffers.

- Force-bounce mode (swiotlb=force): all DMA is routed through bounce
  buffers regardless of whether the system is a CVM.

In both cases, sub-page RX buffer fragments allocated via page_pool may
not be compatible with bounce buffering in this mode, leading to failures
in the RX path.

Add a check using is_swiotlb_force_bounce() in
mana_use_single_rxbuf_per_page() to detect when force-bounce is active
for the device and force single RX buffer per page allocation.

Note: This issue likely affects any NIC driver using page_pool with
sub-page fragment allocation under SWIOTLB. A more general fix at
the page_pool level may be desirable. Seeking feedback on the
preferred approach.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 2d44eaf932a8..841421baf0de 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -12,6 +12,7 @@
 #include <linux/pci.h>
 #include <linux/export.h>
 #include <linux/skbuff.h>
+#include <linux/swiotlb.h>
 #include <linux/cc_platform.h>
 
 #include <net/checksum.h>
@@ -748,10 +749,15 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 static bool
 mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
 {
+	struct gdma_context *gc = apc->ac->gdma_dev->gdma_context;
+
 	/* On confidential VMs with guest memory encryption, all DMA goes
 	 * through SWIOTLB bounce buffers. Sub-page RX fragments may not
 	 * be properly bounce-buffered, so use fullpage buffers.
 	 */
+	if (is_swiotlb_force_bounce(gc->dev))
+		return true;
+
 	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
 		return true;
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next] net: mana: Force single RX buffer per page for CVM/encrypted guest memory
From: Dipayaan Roy @ 2026-04-27 13:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov

On Confidential VMs (CVMs) such as AMD SEV-SNP or Intel TDX, the guest
operating system's memory is encrypted. And current hardwares lacks the
support for TDISP (TEE Device Interface Security Protocol), meaning the
NIC cannot directly access this encrypted VM memory. Consequently, all
DMA operations must utilize SWIOTLB bounce buffers.

In the MANA driver currently, there are two distinct paths for DMA
mapping:

1. Without PP_FLAG_DMA_MAP: The driver manually maps full pages for each
packet. This creates standalone, page-aligned mappings where the offset
is always zero.

2. With PP_FLAG_DMA_MAP: Optimizes memory by using page_pool with
sub-page fragments (e.g., multiple RX buffers sharing a single page).
When PP_FLAG_DMA_MAP is enabled, page_pool maps the entire page once.
Subsequent RX buffer allocations use offsets into this pre-mapped area.

When page_pool allocates sub-page RX buffer fragments, the bounce buffer
granularity may not align with these smaller fragment sizes, leading to
failure in mana driver rx path.

Refactor the RX buffer decision into mana_use_single_rxbuf_per_page().
When cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) is true, the driver is
forced to use a single RX buffer per page. This ensures:
- Each RX buffer is exactly one PAGE_SIZE.
- The DMA offset is always 0.
- SWIOTLB maps full, page-aligned blocks.

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 21 +++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a654b3699c4c..2d44eaf932a8 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -12,6 +12,7 @@
 #include <linux/pci.h>
 #include <linux/export.h>
 #include <linux/skbuff.h>
+#include <linux/cc_platform.h>
 
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
@@ -744,6 +745,23 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On confidential VMs with guest memory encryption, all DMA goes
+	 * through SWIOTLB bounce buffers. Sub-page RX fragments may not
+	 * be properly bounce-buffered, so use fullpage buffers.
+	 */
+	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +772,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 2/2] net: mana: Use kvmalloc for large RX queue and buffer allocations
From: Aditya Garg @ 2026-04-27 13:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya
In-Reply-To: <20260427132807.1642290-1-gargaditya@linux.microsoft.com>

The RX path allocations for rxbufs_pre, das_pre, and rxq scale with
queue count and queue depth. With high queue counts and depth, these can
exceed what kmalloc can reliably provide from physically contiguous
memory under fragmentation.

Switch these from kmalloc to kvmalloc variants so the allocator
transparently falls back to vmalloc when contiguous memory is scarce,
and update the corresponding frees to kvfree.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 8adf72b96145..e1d8ac3417e8 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -685,11 +685,11 @@ void mana_pre_dealloc_rxbufs(struct mana_port_context *mpc)
 		put_page(virt_to_head_page(mpc->rxbufs_pre[i]));
 	}
 
-	kfree(mpc->das_pre);
+	kvfree(mpc->das_pre);
 	mpc->das_pre = NULL;
 
 out2:
-	kfree(mpc->rxbufs_pre);
+	kvfree(mpc->rxbufs_pre);
 	mpc->rxbufs_pre = NULL;
 
 out1:
@@ -806,11 +806,11 @@ int mana_pre_alloc_rxbufs(struct mana_port_context *mpc, int new_mtu, int num_qu
 	num_rxb = num_queues * mpc->rx_queue_size;
 
 	WARN(mpc->rxbufs_pre, "mana rxbufs_pre exists\n");
-	mpc->rxbufs_pre = kmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
+	mpc->rxbufs_pre = kvmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
 	if (!mpc->rxbufs_pre)
 		goto error;
 
-	mpc->das_pre = kmalloc_objs(dma_addr_t, num_rxb);
+	mpc->das_pre = kvmalloc_objs(dma_addr_t, num_rxb);
 	if (!mpc->das_pre)
 		goto error;
 
@@ -2564,7 +2564,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 	if (rxq->gdma_rq)
 		mana_gd_destroy_queue(gc, rxq->gdma_rq);
 
-	kfree(rxq);
+	kvfree(rxq);
 }
 
 static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
@@ -2704,7 +2704,7 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 
 	gc = gd->gdma_context;
 
-	rxq = kzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
+	rxq = kvzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
 	if (!rxq)
 		return NULL;
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 1/2] net: mana: Use per-queue allocation for tx_qp to reduce allocation size
From: Aditya Garg @ 2026-04-27 13:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya
In-Reply-To: <20260427132807.1642290-1-gargaditya@linux.microsoft.com>

Convert tx_qp from a single contiguous array allocation to per-queue
individual allocations. Each mana_tx_qp struct is approximately 35KB.
With many queues (e.g., 32/64), the flat array requires a single
contiguous allocation that can fail under memory fragmentation.

Change mana_tx_qp *tx_qp to mana_tx_qp **tx_qp (array of pointers),
allocating each queue's mana_tx_qp individually via kvzalloc. This
reduces each allocation to ~35KB and provides vmalloc fallback,
avoiding allocation failure due to fragmentation.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 .../net/ethernet/microsoft/mana/mana_bpf.c    |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 49 ++++++++++++-------
 .../ethernet/microsoft/mana/mana_ethtool.c    |  2 +-
 include/net/mana/mana.h                       |  2 +-
 4 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 7697c9b52ed3..b5e9bb184a1d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -68,7 +68,7 @@ int mana_xdp_xmit(struct net_device *ndev, int n, struct xdp_frame **frames,
 		count++;
 	}
 
-	tx_stats = &apc->tx_qp[q_idx].txq.stats;
+	tx_stats = &apc->tx_qp[q_idx]->txq.stats;
 
 	u64_stats_update_begin(&tx_stats->syncp);
 	tx_stats->xdp_xmit += count;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a654b3699c4c..8adf72b96145 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -355,9 +355,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	if (skb_cow_head(skb, MANA_HEADROOM))
 		goto tx_drop_count;
 
-	txq = &apc->tx_qp[txq_idx].txq;
+	txq = &apc->tx_qp[txq_idx]->txq;
 	gdma_sq = txq->gdma_sq;
-	cq = &apc->tx_qp[txq_idx].tx_cq;
+	cq = &apc->tx_qp[txq_idx]->tx_cq;
 	tx_stats = &txq->stats;
 
 	BUILD_BUG_ON(MAX_TX_WQE_SGL_ENTRIES != MANA_MAX_TX_WQE_SGL_ENTRIES);
@@ -614,7 +614,7 @@ static void mana_get_stats64(struct net_device *ndev,
 	}
 
 	for (q = 0; q < num_queues; q++) {
-		tx_stats = &apc->tx_qp[q].txq.stats;
+		tx_stats = &apc->tx_qp[q]->txq.stats;
 
 		do {
 			start = u64_stats_fetch_begin(&tx_stats->syncp);
@@ -2321,21 +2321,26 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 		return;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		debugfs_remove_recursive(apc->tx_qp[i].mana_tx_debugfs);
-		apc->tx_qp[i].mana_tx_debugfs = NULL;
+		if (!apc->tx_qp[i])
+			continue;
+
+		debugfs_remove_recursive(apc->tx_qp[i]->mana_tx_debugfs);
+		apc->tx_qp[i]->mana_tx_debugfs = NULL;
 
-		napi = &apc->tx_qp[i].tx_cq.napi;
-		if (apc->tx_qp[i].txq.napi_initialized) {
+		napi = &apc->tx_qp[i]->tx_cq.napi;
+		if (apc->tx_qp[i]->txq.napi_initialized) {
 			napi_synchronize(napi);
 			napi_disable_locked(napi);
 			netif_napi_del_locked(napi);
-			apc->tx_qp[i].txq.napi_initialized = false;
+			apc->tx_qp[i]->txq.napi_initialized = false;
 		}
-		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
+		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
 
-		mana_deinit_cq(apc, &apc->tx_qp[i].tx_cq);
+		mana_deinit_cq(apc, &apc->tx_qp[i]->tx_cq);
 
-		mana_deinit_txq(apc, &apc->tx_qp[i].txq);
+		mana_deinit_txq(apc, &apc->tx_qp[i]->txq);
+
+		kvfree(apc->tx_qp[i]);
 	}
 
 	kfree(apc->tx_qp);
@@ -2344,7 +2349,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 
 static void mana_create_txq_debugfs(struct mana_port_context *apc, int idx)
 {
-	struct mana_tx_qp *tx_qp = &apc->tx_qp[idx];
+	struct mana_tx_qp *tx_qp = apc->tx_qp[idx];
 	char qnum[32];
 
 	sprintf(qnum, "TX-%d", idx);
@@ -2383,7 +2388,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 	int err;
 	int i;
 
-	apc->tx_qp = kzalloc_objs(struct mana_tx_qp, apc->num_queues);
+	apc->tx_qp = kzalloc_objs(struct mana_tx_qp *, apc->num_queues);
 	if (!apc->tx_qp)
 		return -ENOMEM;
 
@@ -2403,10 +2408,16 @@ static int mana_create_txq(struct mana_port_context *apc,
 	gc = gd->gdma_context;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		apc->tx_qp[i].tx_object = INVALID_MANA_HANDLE;
+		apc->tx_qp[i] = kvzalloc_obj(*apc->tx_qp[i]);
+		if (!apc->tx_qp[i]) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		apc->tx_qp[i]->tx_object = INVALID_MANA_HANDLE;
 
 		/* Create SQ */
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 
 		u64_stats_init(&txq->stats.syncp);
 		txq->ndev = net;
@@ -2424,7 +2435,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 			goto out;
 
 		/* Create SQ's CQ */
-		cq = &apc->tx_qp[i].tx_cq;
+		cq = &apc->tx_qp[i]->tx_cq;
 		cq->type = MANA_CQ_TYPE_TX;
 
 		cq->txq = txq;
@@ -2453,7 +2464,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 
 		err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
 					 &wq_spec, &cq_spec,
-					 &apc->tx_qp[i].tx_object);
+					 &apc->tx_qp[i]->tx_object);
 
 		if (err)
 			goto out;
@@ -3288,7 +3299,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	 */
 
 	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 		tsleep = 1000;
 		while (atomic_read(&txq->pending_sends) > 0 &&
 		       time_before(jiffies, timeout)) {
@@ -3307,7 +3318,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	}
 
 	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 		while ((skb = skb_dequeue(&txq->pending_skbs))) {
 			mana_unmap_skb(skb, apc);
 			dev_kfree_skb_any(skb);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 6a4b42fe0944..04350973e19e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -260,7 +260,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	}
 
 	for (q = 0; q < num_queues; q++) {
-		tx_stats = &apc->tx_qp[q].txq.stats;
+		tx_stats = &apc->tx_qp[q]->txq.stats;
 
 		do {
 			start = u64_stats_fetch_begin(&tx_stats->syncp);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..aa90a858c8e3 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -507,7 +507,7 @@ struct mana_port_context {
 	bool tx_shortform_allowed;
 	u16 tx_vp_offset;
 
-	struct mana_tx_qp *tx_qp;
+	struct mana_tx_qp **tx_qp;
 
 	/* Indirection Table for RX & TX. The values are queue indexes */
 	u32 *indir_table;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 0/2] net: mana: Avoid queue struct allocation failure under memory fragmentation
From: Aditya Garg @ 2026-04-27 13:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya

The MANA driver can fail to load on systems with high memory
utilization because several allocations in the queue setup paths
require large physically contiguous blocks via kmalloc. Under memory
fragmentation these high-order allocations may fail, preventing the
driver from creating queues at probe time or when reconfiguring
channels, ring parameters or MTU at runtime.

Allocation sizes that are problematic:

  mana_create_txq -> tx_qp flat array (sizeof(mana_tx_qp) = 35528):
    16 queues (default): 35528 * 16 =  ~555 KB contiguous
    64 queues (max):     35528 * 64 = ~2220 KB contiguous

  mana_create_rxq -> rxq struct with flex array
  (sizeof(mana_rxq) = 35712, rx_oobs=296 per entry):
    depth 1024 (default): 35712 + 296 * 1024 =  ~331 KB per queue
    depth 8192 (max):     35712 + 296 * 8192 = ~2403 KB per queue

  mana_pre_alloc_rxbufs -> rxbufs_pre and das_pre arrays:
    16 queues, depth 1024 (default): 16 * 1024 * 8 =  128 KB each
    64 queues, depth 8192 (max):     64 * 8192 * 8 = 4096 KB each

This series addresses the issue by:
  1. Converting the tx_qp flat array into an array of pointers with
     per-queue kvzalloc (~35 KB each), replacing a single contiguous
     allocation that can reach ~2.2 MB at 64 queues.
  2. Switching rxbufs_pre, das_pre, and rxq allocations to
     kvmalloc/kvzalloc so the allocator can fall back to vmalloc
     when contiguous memory is unavailable.

Throughput testing confirms no regression. Since kvmalloc falls
back to vmalloc under memory fragmentation, all kvmalloc calls
were temporarily replaced with vmalloc to simulate the fallback
path (iperf3, GBits/sec):

                 Physically contiguous         vmalloc region
  Connections      TX          RX              TX          RX
  --------------------------------------------------------------
  1                47.2        46.9            46.8        46.6
  16               181         181             181         181
  32               181         181             181         181
  64               181         181             181         181

---
Changes in v2:
  - Rebased onto v7.1-rc1 (was v7.0-rc7)

Aditya Garg (2):
  net: mana: Use per-queue allocation for tx_qp to reduce allocation
    size
  net: mana: Use kvmalloc for large RX queue and buffer allocations

 .../net/ethernet/microsoft/mana/mana_bpf.c    |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++--------
 .../ethernet/microsoft/mana/mana_ethtool.c    |  2 +-
 include/net/mana/mana.h                       |  2 +-
 4 files changed, 39 insertions(+), 28 deletions(-)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
From: Hamza Mahfooz @ 2026-04-27 11:51 UTC (permalink / raw)
  To: Saurabh Singh Sengar
  Cc: linux-kernel@vger.kernel.org, KY Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Himadri Pandya, Michael Kelley, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Deepak Rawat,
	dri-devel@lists.freedesktop.org, stable@kernel.vger.org
In-Reply-To: <KUZP153MB14445757C6A5DA5DEDA9A09CBE292@KUZP153MB1444.APCP153.PROD.OUTLOOK.COM>

On Sun, Apr 26, 2026 at 05:00:24AM +0000, Saurabh Singh Sengar wrote:
> > Subject: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
> > 
> > VMBUS ring buffers must be page aligned. So, use VMBUS_RING_SIZE() to
> > ensure they are always aligned and large enough to hold all of the relevant
> > data.
> > 
> > Cc: stable@kernel.vger.org
> > Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video
> > device")
> > Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> > ---
> >  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > index 051ecc526832..753d97bff76f 100644
> > --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> > @@ -10,7 +10,7 @@
> > 
> >  #include "hyperv_drm.h"
> > 
> > -#define VMBUS_RING_BUFSIZE (256 * 1024)
> > +#define VMBUS_RING_BUFSIZE VMBUS_RING_SIZE(256 * 1024)
> >  #define VMBUS_VSP_TIMEOUT (10 * HZ)
> > 
> >  #define SYNTHVID_VERSION(major, minor) ((minor) << 16 | (major))
> > --
> > 2.54.0
> 
> Although this lgtm, but this may change the behaviour on ARM64 systems with page size > 4K ?
> Have we tested it ?

Yup, I tested it on an ARM64 windows machine with a 64K page size guest kernel.

> 
> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>

Pushed to drm-misc.

> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox