Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH V1 03/13] x86/hyperv: add insufficient memory support in irqdomain.c
From: Anirudh Rayabharam @ 2026-04-24 14:55 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260422023239.1171963-4-mrathor@linux.microsoft.com>

On Tue, Apr 21, 2026 at 07:32:29PM -0700, Mukesh R wrote:
> Intermittent insufficient memory hypercall failure have been observed in
> the current map device interrupt hypercall. In case of such a failure,
> we must deposit more memory and redo the hypercall. Add support for
> that. Deposit memory needs partition id, make that a parameter to the
> map interrupt function.
> 
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/irqdomain.c | 38 +++++++++++++++++++++++++++++++------
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index b3ad50a874dc..229f986e08ea 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -13,8 +13,9 @@
>  #include <linux/irqchip/irq-msi-lib.h>
>  #include <asm/mshyperv.h>
>  
> -static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
> -		int cpu, int vector, struct hv_interrupt_entry *ret_entry)
> +static u64 hv_map_interrupt_hcall(u64 ptid, union hv_device_id hv_devid,
> +				  bool level, int cpu, int vector,
> +				  struct hv_interrupt_entry *ret_entry)
>  {
>  	struct hv_input_map_device_interrupt *input;
>  	struct hv_output_map_device_interrupt *output;
> @@ -30,8 +31,10 @@ static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
>  
>  	intr_desc = &input->interrupt_descriptor;
>  	memset(input, 0, sizeof(*input));
> -	input->partition_id = hv_current_partition_id;
> +
> +	input->partition_id = ptid;
>  	input->device_id = hv_devid.as_uint64;
> +
>  	intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
>  	intr_desc->vector_count = 1;
>  	intr_desc->target.vector = vector;
> @@ -64,6 +67,28 @@ static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
>  
>  	local_irq_restore(flags);
>  
> +	return status;
> +}
> +
> +static int hv_map_interrupt(u64 ptid, union hv_device_id device_id, bool level,
> +			    int cpu, int vector,
> +			    struct hv_interrupt_entry *ret_entry)
> +{
> +	u64 status;
> +	int rc, deposit_pgs = 16;		/* don't loop forever */
> +
> +	while (deposit_pgs--) {
> +		status = hv_map_interrupt_hcall(ptid, device_id, level, cpu,
> +						vector, ret_entry);
> +
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY)
> +			break;
> +
> +		rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);

This code should use the hv_result_needs_memory() and hv_deposit_memory()
helpers instead.

Thanks,
Anirudh


^ permalink raw reply

* Re: [PATCH] mshv: Fix interrupt state corruption in hv_do_map_pfns error path
From: Anirudh Rayabharam @ 2026-04-24 14:35 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177681692062.637858.4160821495321404639.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Wed, Apr 22, 2026 at 12:15:28AM +0000, Stanislav Kinsburskii wrote:
> Restore interrupt state before breaking out of the loop on error.
> 
> The irq_flags are saved before entering the loop, but the early exit
> path on error fails to restore them. This leaves interrupts in an
> inconsistent state and can lead to lockdep warnings or other
> interrupt-related issues.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_hv_call.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index 7ed623668c8ec..6381f949d9d91 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -237,8 +237,10 @@ static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,

Umm... I don't see this function in the hyperv-next tree at all.

Anirudh.

>  			} else {
>  				pfnlist[i] = mmio_spa + done + i;
>  			}
> -		if (ret)
> +		if (ret) {
> +			local_irq_restore(irq_flags);
>  			break;
> +		}
>  
>  		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
>  					     input_page, NULL);
> 
> 

^ permalink raw reply

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Dipayaan Roy @ 2026-04-24 12:21 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <20260424061702.1442618-1-shradhagupta@linux.microsoft.com>

On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated are capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vcpus, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vcpus are so
> few that the softIRQ handling overhead on two IRQs on the same vcpu is
> much more than their overheads if they were spread across sibling vcpus
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch
> 20480		15.65		7.73
> 10240		15.63		8.93
> 8192		15.64		9.69
> 6144		15.64		13.16
> 4096		15.69		15.75
> 2048		15.69		15.83
> 1024		15.71		15.28
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 098fbda0d128..433c044d53c6 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	rcu_read_lock();
We do not need to call rcu_read_lock here, as the caller of this
function has already acquired cpus_read_lock.
> +	for_each_online_cpu(cpu) {
> +		if (len <= 0)
len is unsigned here so <= doesnot makes sense. PLease change it to int
or better use if(!len)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
> @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> +	if (gc->num_msix_usable <= num_online_cpus()) {
>  		skip_first_cpu = true;
> +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> +	} else {
> +		/*
> +		 * In case our IRQs are more than num_online_cpus, we try to
> +		 * make sure we are using all vcpus. In such a case NUMA or
> +		 * CPU core affinity does not matter.
> +		 * Note that in this case the total mana IRQ should always be
> +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * So, the nvec value in this path should always be equal to
> +		 * num_online_cpu
nit: typo: should be num_online_cpus
> +		 */
> +		WARN_ON(nvec > num_online_cpus());
> +		err = irq_setup_linear(irqs, nvec);
> +	}
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
>  	if (err) {
>  		cpus_read_unlock();
>  		goto free_irq;
> 
> base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> -- 
> 2.34.1
> 
Regards
Dipayaan Roy

^ permalink raw reply

* [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-04-24  6:17 UTC (permalink / raw)
  To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
  Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
	Shradha Gupta, Saurabh Singh Sengar, stable

In mana driver, the number of IRQs allocated are capped by the
min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
than the vcpu count, we want to utilize all the vcpus, irrespective of
their NUMA/core bindings.

This is important, especially in the envs where number of vcpus are so
few that the softIRQ handling overhead on two IRQs on the same vcpu is
much more than their overheads if they were spread across sibling vcpus

This behaviour is more evident with dynamic IRQ allocation. Since MANA
IRQs are assigned at a later stage compared to static allocation, other
device IRQs may already be affinitized to the vCPUs. As a result, IRQ
weights become imbalanced, causing multiple MANA IRQs to land on the
same vCPU.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly

Test envs:
=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC		0
IRQ1:	mana_q1		0
IRQ2:	mana_q2		2
IRQ3:	mana_q3		0
IRQ4:	mana_q4		3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU		0	1	2	3
=======================================================
pass 1:		38.85	0.03	24.89	24.65
pass 2:		39.15	0.03	24.57	25.28
pass 3:		40.36	0.03	23.20	23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42	15.85	14.99	14.51
pass 2:         15.53	15.94	15.81	15.93
pass 3:         16.41	16.35	16.40	16.36

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn	with patch	w/o patch
20480		15.65		7.73
10240		15.63		8.93
8192		15.64		9.69
6144		15.64		13.16
4096		15.69		15.75
2048		15.69		15.83
1024		15.71		15.28

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..433c044d53c6 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
 	return 0;
 }
 
+static int irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+	int cpu;
+
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		if (len <= 0)
+			break;
+
+		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+		len--;
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
@@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	 * first CPU sibling group since they are already affinitized to HWC IRQ
 	 */
 	cpus_read_lock();
-	if (gc->num_msix_usable <= num_online_cpus())
+	if (gc->num_msix_usable <= num_online_cpus()) {
 		skip_first_cpu = true;
+		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
+	} else {
+		/*
+		 * In case our IRQs are more than num_online_cpus, we try to
+		 * make sure we are using all vcpus. In such a case NUMA or
+		 * CPU core affinity does not matter.
+		 * Note that in this case the total mana IRQ should always be
+		 * num_online_cpu + 1. The first HWC IRQ is already handled
+		 * in HWC setup calls
+		 * So, the nvec value in this path should always be equal to
+		 * num_online_cpu
+		 */
+		WARN_ON(nvec > num_online_cpus());
+		err = irq_setup_linear(irqs, nvec);
+	}
 
-	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
 	if (err) {
 		cpus_read_unlock();
 		goto free_irq;

base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Dipayaan Roy @ 2026-04-24  3:28 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <7c4dbe89-9b51-45d6-ae89-39d4183e66b1@lunn.ch>

On Thu, Apr 23, 2026 at 09:44:04PM +0200, Andrew Lunn wrote:
> On Thu, Apr 23, 2026 at 12:14:16PM -0700, Dipayaan Roy wrote:
> > On Thu, Apr 23, 2026 at 06:37:04PM +0200, Andrew Lunn wrote:
> > > > The root cause is in mana_gd_init_vf_regs(), which computes:
> > > > 
> > > >   gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > > > 
> > > > without validating the offset read from hardware. If the register
> > > > returns a garbage value that is neither within bar 0 bounds nor aligned
> > > > to the 4-byte granularity, thus causing the alignment fault.
> > > 
> > > Is GDMA_REG_SHM_OFFSET special?
> > Hi Andrew,
> > GDMA_REG_SHM_OFFSET is not special. It was simply the only register
> > read that had no validation at all. The other two registers
> > (GDMA_REG_DB_PAGE_SIZE, GDMA_REG_DB_PAGE_OFFSET) already have checks
> > in place.
> 
> I must be missing something:
> 
> grep page_size *
> 
> gdma_main.c:	gc->db_page_size = mana_gd_r32(gc, GDMA_PF_REG_DB_PAGE_SIZE) & 0xFFFF;
> gdma_main.c:	gc->db_page_size = mana_gd_r32(gc, GDMA_REG_DB_PAGE_SIZE) & 0xFFFF;
> gdma_main.c:	void __iomem *addr = gc->db_page_base + gc->db_page_size * db_index;
> 

Hi Andrew,
There are 2 upstream commits regarding these, I think you missed
them please check once:

commit fb4b4a05aeeb8b0f253c5ddce21f4635dadc9550
Author: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Date:   Wed Mar 25 11:04:17 2026 -0700
 
    net: mana: Use at least SZ_4K in doorbell ID range check

commit 89fe91c65992a37863241e35aec151210efc53ce
Author: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Date:   Fri Mar 6 13:12:06 2026 -0800
 
    net: mana: hardening: Validate doorbell ID from GDMA_REGISTER_DEVICE response

> So if GDMA_REG_DB_PAGE_SIZE returns garbage, it is at least masked,
> but it is still a random number.
> 
> mana_gd_ring_doorbell() takes this random number, multiples by
> db_index, adds, gc->db_page_base and then does:
> 
> writeq(e.as_uint64, addr);
> 
> So you write to a random address. 
> 
> I don't see any sanity checks here. Cannot you check that db_page_size
> is at least one of the expected page sizes?
As mentioned above checks are already present in this commit: 89fe91c65992a37863241e35aec151210efc53ce
> 
>    Andrew

Regards
Dipayaan Roy

^ permalink raw reply

* RE: [PATCH] tools/hv: fix parse_ip_val_buffer out-of-bounds write
From: Michael Kelley @ 2026-04-23 20:28 UTC (permalink / raw)
  To: unknownbbqrx, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <c9871f25-9d7e-423d-954b-4080d2484cd8@smtp-relay.sendinblue.com>

From: unknownbbqrx <dev@unknownbbqr.xyz> Sent: Thursday, April 23, 2026 11:07 AM
> 
> 
> parse_ip_val_buffer() validates the parsed token length against out_len,
> but several callers passed MAX_IP_ADDR_SIZE * 2 while the destination
> buffers are much smaller stack arrays (e.g. INET6_ADDRSTRLEN).
> 
> This can lead to out-of-bounds writes via strcpy() when a long token is
> parsed from host-provided IP/subnet strings.
> 
> Use size_t for out_len, switch to bounded copy with memcpy() + explicit
> NUL termination, and pass the actual destination buffer sizes at all
> call sites.
> 
> Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>

Linux kernel patches must be signed off by a real person's name,
not an unknown alias. In the kernel source code tree, see
Documentation/process/submitting-patches.rst and specifically
the section entitled "Sign your work - the Developer's Certificate
of Origin".  It specifies that the signoff must be done by "a
known identity (sorry, no anonymous contributions)".

Michael

> ---
>  tools/hv/hv_kvp_daemon.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
> index c02f8a341..ecf123bce 100644
> --- a/tools/hv/hv_kvp_daemon.c
> +++ b/tools/hv/hv_kvp_daemon.c
> @@ -1188,10 +1188,11 @@ static int is_ipv4(char *addr)
>  }
> 
>  static int parse_ip_val_buffer(char *in_buf, int *offset,
> -				char *out_buf, int out_len)
> +				char *out_buf, size_t out_len)
>  {
>  	char *x;
>  	char *start;
> +	size_t copy_len;
> 
>  	/*
>  	 * in_buf has sequence of characters that are separated by
> @@ -1214,8 +1215,10 @@ static int parse_ip_val_buffer(char *in_buf, int *offset,
>  		while (start[i] == ' ')
>  			i++;
> 
> -		if ((x - start) <= out_len) {
> -			strcpy(out_buf, (start + i));
> +		copy_len = x - (start + i);
> +		if (copy_len < out_len) {
> +			memcpy(out_buf, start + i, copy_len);
> +			out_buf[copy_len] = '\0';
>  			*offset += (x - start) + 1;
>  			return 1;
>  		}
> @@ -1249,7 +1252,7 @@ static int process_ip_string(FILE *f, char *ip_string, int type)
>  	memset(addr, 0, sizeof(addr));
> 
>  	while (parse_ip_val_buffer(ip_string, &offset, addr,
> -					(MAX_IP_ADDR_SIZE * 2))) {
> +					sizeof(addr))) {
> 
>  		sub_str[0] = 0;
>  		if (is_ipv4(addr)) {
> @@ -1374,7 +1377,7 @@ static int process_dns_gateway_nm(FILE *f, char *ip_string,
> int type,
>  		memset(addr, 0, sizeof(addr));
> 
>  		if (!parse_ip_val_buffer(ip_string, &ip_offset, addr,
> -					 (MAX_IP_ADDR_SIZE * 2)))
> +					 sizeof(addr)))
>  			break;
> 
>  		ip_ver = ip_version_check(addr);
> @@ -1426,12 +1429,11 @@ static int process_ip_string_nm(FILE *f, char *ip_string,
> char *subnet,
>  	memset(subnet_addr, 0, sizeof(subnet_addr));
> 
>  	while (parse_ip_val_buffer(ip_string, &ip_offset, addr,
> -				   (MAX_IP_ADDR_SIZE * 2)) &&
> +				   sizeof(addr)) &&
>  				   parse_ip_val_buffer(subnet,
> -						       &subnet_offset,
> -						       subnet_addr,
> -						       (MAX_IP_ADDR_SIZE *
> -							2))) {
> +					       &subnet_offset,
> +					       subnet_addr,
> +					       sizeof(subnet_addr))) {
>  		ip_ver = ip_version_check(addr);
>  		if (ip_ver < 0)
>  			continue;
> 
> base-commit: 2e68039281932e6dc37718a1ea7cbb8e2cda42e6
> prerequisite-patch-id: b61dd51dee390277603975bf729a687113185c3a
> prerequisite-patch-id: df28525061dd528875c7c75880b4684d80f4aa7d
> prerequisite-patch-id: 64c48c6f2222781631304d9d4d7d1c712c002610
> prerequisite-patch-id: 9be258692732026bf560ed9887adbd02a8887263
> --
> 2.53.0
> 
> 
> 


^ permalink raw reply

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Andrew Lunn @ 2026-04-23 19:44 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <aepviNMszMBtiB/H@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Thu, Apr 23, 2026 at 12:14:16PM -0700, Dipayaan Roy wrote:
> On Thu, Apr 23, 2026 at 06:37:04PM +0200, Andrew Lunn wrote:
> > > The root cause is in mana_gd_init_vf_regs(), which computes:
> > > 
> > >   gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > > 
> > > without validating the offset read from hardware. If the register
> > > returns a garbage value that is neither within bar 0 bounds nor aligned
> > > to the 4-byte granularity, thus causing the alignment fault.
> > 
> > Is GDMA_REG_SHM_OFFSET special?
> Hi Andrew,
> GDMA_REG_SHM_OFFSET is not special. It was simply the only register
> read that had no validation at all. The other two registers
> (GDMA_REG_DB_PAGE_SIZE, GDMA_REG_DB_PAGE_OFFSET) already have checks
> in place.

I must be missing something:

grep page_size *

gdma_main.c:	gc->db_page_size = mana_gd_r32(gc, GDMA_PF_REG_DB_PAGE_SIZE) & 0xFFFF;
gdma_main.c:	gc->db_page_size = mana_gd_r32(gc, GDMA_REG_DB_PAGE_SIZE) & 0xFFFF;
gdma_main.c:	void __iomem *addr = gc->db_page_base + gc->db_page_size * db_index;

So if GDMA_REG_DB_PAGE_SIZE returns garbage, it is at least masked,
but it is still a random number.

mana_gd_ring_doorbell() takes this random number, multiples by
db_index, adds, gc->db_page_base and then does:

writeq(e.as_uint64, addr);

So you write to a random address. 

I don't see any sanity checks here. Cannot you check that db_page_size
is at least one of the expected page sizes?

   Andrew

^ permalink raw reply

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Dipayaan Roy @ 2026-04-23 19:14 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <edccaafd-73f3-421d-a48e-a6cb704d39e6@lunn.ch>

On Thu, Apr 23, 2026 at 06:37:04PM +0200, Andrew Lunn wrote:
> > The root cause is in mana_gd_init_vf_regs(), which computes:
> > 
> >   gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > 
> > without validating the offset read from hardware. If the register
> > returns a garbage value that is neither within bar 0 bounds nor aligned
> > to the 4-byte granularity, thus causing the alignment fault.
> 
> Is GDMA_REG_SHM_OFFSET special?
Hi Andrew,
GDMA_REG_SHM_OFFSET is not special. It was simply the only register
read that had no validation at all. The other two registers
(GDMA_REG_DB_PAGE_SIZE, GDMA_REG_DB_PAGE_OFFSET) already have checks
in place. Also shm_off becomes gc->shm_base (bar0_va + shm_off) and
gc->shm_base is dereferenced via readl() (ldr w1, [x20]) in
mana_smc_poll_register(), which is why it requires 4-byte alignment on arm64
device memory. Or else a misaligned shm_off propagates directly into a
misaligned shm_base, causing an alignment fault (FSC=0x21).
>
> What if GDMA_REG_DB_PAGE_SIZE or GDMA_REG_DB_PAGE_OFFSET have returned
> garbage? Are you going to die a horrible death as well?
Those two already have validation in the current code:

- GDMA_REG_DB_PAGE_SIZE is checked for < SZ_4K (returns -EPROTO)
- GDMA_REG_DB_PAGE_OFFSET is checked for >= bar0_size (returns -EPROTO)

The same checks exist for the PF equivalents (GDMA_PF_REG_DB_PAGE_SIZE
and GDMA_PF_REG_DB_PAGE_OFF) as well.
> 
> Isn't there a way you can poll the firmware to ask it if it is ready?
Unfortunately no, as there is no separate readiness register to
poll.

The existing recovery flow already waits MANA_SERVICE_PERIOD (10
seconds) after suspend before attempting resume. If the registers are
still invalid after that, the -EPROTO triggers a PCI remove/rescan,
which re-probes the device.
> 
> And what about the PF case. Can GDMA_PF_REG_SHM_OFF also be garbage?
Yes. This patch also adds bounds and alignment validation for the PF path:
both GDMA_SRIOV_REG_CFG_BASE_OFF and the SHM offset read via
(sriov_base_off + GDMA_PF_REG_SHM_OFF) are validated before use.
> 
>       Andrew

Regards
Dipayaan Roy

^ permalink raw reply

* Re: [PATCH] mshv: add a missing padding field
From: Easwar Hariharan @ 2026-04-23 18:16 UTC (permalink / raw)
  To: Wei Liu
  Cc: easwar.hariharan, Linux on Hyper-V List, Doru Blânzeanu,
	Magnus Kulke, stable, K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui,
	Long Li, Nuno Das Neves, Roman Kisel, Michael Kelley, open list
In-Reply-To: <20260423181440.GA1196957@liuwe-devbox-debian-v2.local>

On 4/23/2026 11:14 AM, Wei Liu wrote:
> On Thu, Apr 23, 2026 at 10:32:58AM -0700, Easwar Hariharan wrote:
>> On 4/23/2026 10:29 AM, Easwar Hariharan wrote:
>>> On 4/23/2026 10:26 AM, wei.liu@kernel.org wrote:
>>>> From: Wei Liu <wei.liu@kernel.org>
>>>>
>>>> That was missed when importing the header.
>>>>
>>>> Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
>>>> Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
>>>> Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
>>>> Cc: stable@kernel.org
>>>> Signed-off-by: Wei Liu <wei.liu@kernel.org>
>>>> ---
>>>>  include/hyperv/hvhdk.h | 1 +
>>>>  1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>>>> index 5e83d3714966..ff7ca9ee1bd4 100644
>>>> --- a/include/hyperv/hvhdk.h
>>>> +++ b/include/hyperv/hvhdk.h
>>>> @@ -79,6 +79,7 @@ struct hv_vp_register_page {
>>>>  
>>>>  		u64 registers[18];
>>>>  	};
>>>> +	__u8 reserved[8];
>>>>  	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
>>>>  	union {
>>>>  		struct {
>>>
>>>
>>> This is not a uapi, so why not just use u8 instead of __u8?
>>> Or since it's 8 u8s, a u64?
>>>
>>> Thanks,
>>> Easwar (he/him)
>>
>> Hm, occurs to me that this would be used by VMMs, but then the registers
>> field just above used a u64 instead of a __u64....
> 
> I fat-fingered u8 to __u8.  User space code has scripts to massage the
> types as needed.
> 
> To remain consistent with the existing code, it should be u8.
> 
> I can change the type when I commit this.
> 
> Wei
Thanks, with that fixed:

Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>

^ permalink raw reply

* Re: [PATCH] mshv: add a missing padding field
From: Wei Liu @ 2026-04-23 18:14 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: wei.liu, Linux on Hyper-V List, Doru Blânzeanu, Magnus Kulke,
	stable, K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui, Long Li,
	Nuno Das Neves, Roman Kisel, Michael Kelley, open list
In-Reply-To: <19a904f4-e26f-4951-85ac-aae537da89cb@linux.microsoft.com>

On Thu, Apr 23, 2026 at 10:32:58AM -0700, Easwar Hariharan wrote:
> On 4/23/2026 10:29 AM, Easwar Hariharan wrote:
> > On 4/23/2026 10:26 AM, wei.liu@kernel.org wrote:
> >> From: Wei Liu <wei.liu@kernel.org>
> >>
> >> That was missed when importing the header.
> >>
> >> Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
> >> Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
> >> Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
> >> Cc: stable@kernel.org
> >> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> >> ---
> >>  include/hyperv/hvhdk.h | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> >> index 5e83d3714966..ff7ca9ee1bd4 100644
> >> --- a/include/hyperv/hvhdk.h
> >> +++ b/include/hyperv/hvhdk.h
> >> @@ -79,6 +79,7 @@ struct hv_vp_register_page {
> >>  
> >>  		u64 registers[18];
> >>  	};
> >> +	__u8 reserved[8];
> >>  	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
> >>  	union {
> >>  		struct {
> > 
> > 
> > This is not a uapi, so why not just use u8 instead of __u8?
> > Or since it's 8 u8s, a u64?
> > 
> > Thanks,
> > Easwar (he/him)
> 
> Hm, occurs to me that this would be used by VMMs, but then the registers
> field just above used a u64 instead of a __u64....

I fat-fingered u8 to __u8.  User space code has scripts to massage the
types as needed.

To remain consistent with the existing code, it should be u8.

I can change the type when I commit this.

Wei

> 
> 

^ permalink raw reply

* Re: [PATCH net v2] hv_sock: Return -EIO for malformed/short packets
From: patchwork-bot+netdevbpf @ 2026-04-23 18:10 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: kys, haiyangz, wei.liu, longli, sgarzare, davem, edumazet, kuba,
	pabeni, horms, niuxuewei.nxw, linux-hyperv, virtualization,
	netdev, linux-kernel, stable
In-Reply-To: <20260423064811.1371749-1-decui@microsoft.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 22 Apr 2026 23:48:11 -0700 you wrote:
> Commit f63152958994 fixes a regression, however it fails to report an
> error for malformed/short packets -- normally we should never see such
> packets, but let's report an error for them just in case.
> 
> Fixes: f63152958994 ("hv_sock: Report EOF instead of -EIO for FIN")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> [...]

Here is the summary with links:
  - [net,v2] hv_sock: Return -EIO for malformed/short packets
    https://git.kernel.org/netdev/net/c/3d1f20727a63

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* [PATCH] tools/hv: fix parse_ip_val_buffer out-of-bounds write
From: unknownbbqrx @ 2026-04-23 18:06 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: linux-hyperv, linux-kernel, unknownbbqrx


parse_ip_val_buffer() validates the parsed token length against out_len,
but several callers passed MAX_IP_ADDR_SIZE * 2 while the destination
buffers are much smaller stack arrays (e.g. INET6_ADDRSTRLEN).

This can lead to out-of-bounds writes via strcpy() when a long token is
parsed from host-provided IP/subnet strings.

Use size_t for out_len, switch to bounded copy with memcpy() + explicit
NUL termination, and pass the actual destination buffer sizes at all
call sites.

Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>
---
 tools/hv/hv_kvp_daemon.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
index c02f8a341..ecf123bce 100644
--- a/tools/hv/hv_kvp_daemon.c
+++ b/tools/hv/hv_kvp_daemon.c
@@ -1188,10 +1188,11 @@ static int is_ipv4(char *addr)
 }
 
 static int parse_ip_val_buffer(char *in_buf, int *offset,
-				char *out_buf, int out_len)
+				char *out_buf, size_t out_len)
 {
 	char *x;
 	char *start;
+	size_t copy_len;
 
 	/*
 	 * in_buf has sequence of characters that are separated by
@@ -1214,8 +1215,10 @@ static int parse_ip_val_buffer(char *in_buf, int *offset,
 		while (start[i] == ' ')
 			i++;
 
-		if ((x - start) <= out_len) {
-			strcpy(out_buf, (start + i));
+		copy_len = x - (start + i);
+		if (copy_len < out_len) {
+			memcpy(out_buf, start + i, copy_len);
+			out_buf[copy_len] = '\0';
 			*offset += (x - start) + 1;
 			return 1;
 		}
@@ -1249,7 +1252,7 @@ static int process_ip_string(FILE *f, char *ip_string, int type)
 	memset(addr, 0, sizeof(addr));
 
 	while (parse_ip_val_buffer(ip_string, &offset, addr,
-					(MAX_IP_ADDR_SIZE * 2))) {
+					sizeof(addr))) {
 
 		sub_str[0] = 0;
 		if (is_ipv4(addr)) {
@@ -1374,7 +1377,7 @@ static int process_dns_gateway_nm(FILE *f, char *ip_string, int type,
 		memset(addr, 0, sizeof(addr));
 
 		if (!parse_ip_val_buffer(ip_string, &ip_offset, addr,
-					 (MAX_IP_ADDR_SIZE * 2)))
+					 sizeof(addr)))
 			break;
 
 		ip_ver = ip_version_check(addr);
@@ -1426,12 +1429,11 @@ static int process_ip_string_nm(FILE *f, char *ip_string, char *subnet,
 	memset(subnet_addr, 0, sizeof(subnet_addr));
 
 	while (parse_ip_val_buffer(ip_string, &ip_offset, addr,
-				   (MAX_IP_ADDR_SIZE * 2)) &&
+				   sizeof(addr)) &&
 				   parse_ip_val_buffer(subnet,
-						       &subnet_offset,
-						       subnet_addr,
-						       (MAX_IP_ADDR_SIZE *
-							2))) {
+					       &subnet_offset,
+					       subnet_addr,
+					       sizeof(subnet_addr))) {
 		ip_ver = ip_version_check(addr);
 		if (ip_ver < 0)
 			continue;

base-commit: 2e68039281932e6dc37718a1ea7cbb8e2cda42e6
prerequisite-patch-id: b61dd51dee390277603975bf729a687113185c3a
prerequisite-patch-id: df28525061dd528875c7c75880b4684d80f4aa7d
prerequisite-patch-id: 64c48c6f2222781631304d9d4d7d1c712c002610
prerequisite-patch-id: 9be258692732026bf560ed9887adbd02a8887263
-- 
2.53.0




^ permalink raw reply related

* RE: [PATCH] Drivers: hv: vmbus: Improve the logc of reserving fb_mmio on Gen2 VMs
From: Michael Kelley @ 2026-04-23 17:40 UTC (permalink / raw)
  To: Dexuan Cui, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	matthew.ruffell@canonical.com, johansen@templeofstupid.com
  Cc: stable@vger.kernel.org
In-Reply-To: <20260416183529.838321-1-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Thursday, April 16, 2026 11:35 AM
> 
> If vmbus_reserve_fb() in the kdump kernel fails to properly reserve the

This problem has wider scope than just kdump. Any kexec'ed kernel would see
the same problem, though kdump is probably the most common case. But the
discussion here, and the mention of kdump in the code comments, should be
adjusted accordingly. 

> framebuffer MMIO range due to a Gen2 VM's screen.lfb_base being zero [1],
> there is an MMIO conflict between the drivers hyperv_drm and pci-hyperv.

You describe an MMIO "conflict" without giving the details. Is that
intentional to keep the commit message from being too long? It might be
helpful to future readers to say a little more about how PCI devices must not
use MMIO space that the hypervisor has assigned to the frame buffer.

> This is especially an issue if pci-hyperv is built-in and hyperv_drm is
> built as a module. Consequently, the kdump kernel fails to detect PCI
> devices via pci-hyperv, and may fail to mount the root file system,
> which may reside in a NVMe disk.

It might not just be pci-hyperv that conflicts. The recently submitted
dxgkrnl driver also does vmbus_allocate_mmio(), but I haven't looked
at the details of exactly what it is doing.

> 
> On Gen2 VMs, if the screen.lfb_base is 0 in the kdump kernel, fall
> back to the low MMIO base, which should be equal to the framebuffer
> MMIO base (Tested on x64 Windows Server 2016, and on x64 and ARM64 Windows
> Server 2025 and on Azure) [2]. In the first kernel, screen.lfb_base
> is not 0; if the user specifies a high resolution, it's not enough to
> only reserve 8MB: in this case, reserve half of the space below 4GB, but
> cap the reservation to 128MB, which is the required framebuffer size of
> the highest resolution 7680*4320 supported by Hyper-V.

As you noted in the detailed discussion in the other email thread [2],
there's a Gen1 VM case that this patch doesn't fix. For completeness,
perhaps that case should be called out in this commit message.

> 
> Add the cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) check, because a CoCo
> VM (i.e. Confidential VM) on Hyper-V doesn't have any framebuffer
> device, so there is no need to reserve any MMIO for it.
> 
> While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
> the > to >=. Here the 'end' is an inclusive end (typically, it's
> 0xFFFF_FFFF).
> 
> [1] https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
> [2] https://lore.kernel.org/all/SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com/
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> CC: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/vmbus_drv.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index f0d0803d1e16..a0b34f9e426a 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -37,6 +37,7 @@
>  #include <linux/dma-map-ops.h>
>  #include <linux/pci.h>
>  #include <linux/export.h>
> +#include <linux/cc_platform.h>
>  #include <clocksource/hyperv_timer.h>
>  #include <asm/mshyperv.h>
>  #include "hyperv_vmbus.h"
> @@ -2327,8 +2328,8 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx)
>  		return AE_NO_MEMORY;
> 
>  	/* If this range overlaps the virtual TPM, truncate it. */
> -	if (end > VTPM_BASE_ADDRESS && start < VTPM_BASE_ADDRESS)
> -		end = VTPM_BASE_ADDRESS;
> +	if (end >= VTPM_BASE_ADDRESS && start < VTPM_BASE_ADDRESS)
> +		end = VTPM_BASE_ADDRESS - 1;
> 
>  	new_res->name = "hyperv mmio";
>  	new_res->flags = IORESOURCE_MEM;
> @@ -2395,13 +2396,36 @@ static void vmbus_mmio_remove(void)
>  static void __maybe_unused vmbus_reserve_fb(void)
>  {
>  	resource_size_t start = 0, size;
> +	resource_size_t low_mmio_base;
>  	struct pci_dev *pdev;
> 
> +	/* Hyper-V CoCo guests do not have a framebuffer device. */
> +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> +		return;

This test is testing feature "A" (mem encryption) in order to determine
the presence of feature "B" (no framebuffer), because current
configurations happen to always have "A" and "B" at the same time. But
the linkage between the features is tenuous, and if configurations should
change in the future, testing this way could be bogus. It works now, but I'm
leery of depending on the linkage between "A" and "B".

You could set up a "can_have_framebuffer" flag in ms_hyperv_init_platform()
if running in a CVM, and test that flag here. But I'd suggest just dropping
this optimization. CVMs are always Gen2 (and that's not going to change),
so they have plenty of low mmio space. And at the moment, CVMs don't
support PCI devices, so can't encounter a conflict (though conceivably
some new flavor of CVM in the future could support PCI devices).

> +
>  	if (efi_enabled(EFI_BOOT)) {
>  		/* Gen2 VM: get FB base from EFI framebuffer */
>  		if (IS_ENABLED(CONFIG_SYSFB)) {
>  			start = sysfb_primary_display.screen.lfb_base;
>  			size = max_t(__u32, sysfb_primary_display.screen.lfb_size, 0x800000);
> +
> +			low_mmio_base = hyperv_mmio->start;
> +			if (!low_mmio_base || low_mmio_base >= SZ_4G ||
> +			    (start && start < low_mmio_base)) {
> +				pr_warn("Unexpected low mmio base 0x%pa\n", &low_mmio_base);
> +			} else {
> +				/*
> +				 * If the kdump kernel's lfb_base is 0,

As mentioned earlier, this case isn't just kdump kernels.

> +				 * fall back to the low mmio base.
> +				 */
> +				if (!start)
> +					start = low_mmio_base;
> +				/*
> +				 * Reserve half of the space below 4GB for high
> +				 * resolutions, but cap the reservation to 128MB.
> +				 */
> +				size = min((SZ_4G - start) / 2, SZ_128M);
> +			}
>  		}
>  	} else {
>  		/* Gen1 VM: get FB base from PCI */
> @@ -2433,6 +2457,8 @@ static void __maybe_unused vmbus_reserve_fb(void)
>  	 */
>  	for (; !fb_mmio && (size >= 0x100000); size >>= 1)
>  		fb_mmio = __request_region(hyperv_mmio, start, size, fb_mmio_name, 0);

Just above this "for" loop, "start" is tested for 0. This patch eliminates the main
reason start might be 0. But I guess it's still possible that the legacy PCI device BAR
might return 0 for a Gen1 VM? Or you might get 0 if the pr_warn() about low
mmio base is triggered. But I'm thinking maybe a pr_warn() should be done if 
start is zero.

> +
> +	pr_info("hv_mmio=%pR,%pR fb=%pR\n", hyperv_mmio, hyperv_mmio->sibling, fb_mmio);

Outputting the above info is nice!

Michael

^ permalink raw reply

* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-04-23 17:40 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Long Li, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	matthew.ruffell@canonical.com, kjlx@templeofstupid.com
  Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Wednesday, April 15, 2026 8:31 AM
> 
> > From: Michael Kelley <mhklinux@outlook.com> Sent: Wednesday, April 8, 2026 6:54 AM

[snip]

> 
> Another example is: for a Gen2 VM with the below commands:
>    Set-VM -LowMemoryMappedIoSpace 1GB \
>           -VMName decui-u2204-gen2-fb
>    // i.e. the default setting on Azure. Let's ignore CVMs here.

FWIW, I'm seeing that in Gen2 VMs in Azure, the low_mmio_size
is 3 GiB. I'm looking at a D16ds_v5, and a D16lds_v6. The v5 VM
is newly created, while the v6 has been around for a few months.
In a CVM, the low_mmio_size should be 1 GiB. This overall example
is still correct -- it's just the comment that I have doubts about. Or
maybe you are looking at a different VM size that has a different
default?

Some years back, I had gotten into a discussion with Azure about
this size because the swiotlb memory wants to be allocated below
the 4 GiB line, and reserving 3 GiB for low mmio limited the size
of the swiotlb. CVMs were changed to have only 1 GiB for low
mmio because they need a larger swiotlb.


>    Set-VMVideo -VMName decui-u2204-gen2-fb \
>                -HorizontalResolution 4834 \
>                -VerticalResolution 3622 \
>                -ResolutionType Single
> we have:
>     max_fb_size = round_up_to_2MB(4834*3622*4) = 68 MB
>     excess_fb_size = 4MB
>     low_mmio_base = 4GB - 128MB - 4MB * 2
>                   = 4GB - 136 MB = 0xf7800000
>     but 4GB - target_low_mmio_size = 4GB - 1GB, which is
>     smaller than low_mmio_base, so low_mmio_base and
>     fb_mmio_base are both set to 4GB - 1GB = 0xc0000000,
>     and low_mmio_size = 1GB.
>     In this case, we'd like to reserve
>     min(low_mmio_size/2, 128MB) = 128MB for the framebuffer
>     mmio, since the max possible framebuffer so far is 128MB.
> 
> ************************************
> 
> On an ARM64 lab host, I also tested Gen2 VMs (there is no Gen1 VM
> for ARM VMs):
> 
> By default:
>   low_mmio_base = 4GB - 512MB, i.e. 0xe0000000
>   low_mmio_size = 512MB
>   fb_mmio_base = low_mmio_base
>   The default framebuffer size is 3MB
>   (i.e. screen.lfb_size = 3MB) but hyperv_drm:
>   mmio_megabytes = 8 MB, which supports up to 1920 * 1080.
> 
> With the below commands:
>    Set-VM -LowMemoryMappedIoSpace 512MB \
>           -VMName decui-u2204-gen2-fb
>    // the command only accepts a value between 512MB and 3.5GB.
>    Set-VMVideo -VMName decui-u2204-gen2-fb \
>                -HorizontalResolution 4834 \
>                -VerticalResolution 3622 \
>                -ResolutionType Single
> I thought we would have:
>     max_fb_size = round_up_to_2MB(4834*3622*4) = 68 MB
>     excess_fb_size = 4MB
>     low_mmio_base = 4GB - 512MB - 4MB * 2
>                   = 4GB - 520MB
>     fb_mmio_base = low_mmio_base
>     low_mmio_size = 4GB - low_mmio_base = 520MB
> 
>     Since 4GB - target_low_mmio_size = 4GB - 512MB, which is
>     smaller than low_mmio_base, so low_mmio_base and
>     fb_mmio_base would be both set to 4GB - 520MB, and
>     low_mmio_size would be 520MB.
> 
>     However, the actual result is:
>     max_fb_size is indeed 68MB.
>     but fb_mmio_base = low_mmio_base = 4GB - 512MB, and
>     low_mmio_size = 512MB, i.e. the 'excess_fb_size' is not
>     considered on ARM64!
> 
>     In this case, we'd like to reserve
>     min(low_mmio_size/2, 128MB) = 128MB for the framebuffer
>     mmio, since the max possible framebuffer so far is 128MB.
> 
> With the below command:
>    Set-VM -LowMemoryMappedIoSpace 3GB \
>           -VMName decui-u2204-gen2-fb
>    // i.e. the default setting on Azure. Unlike x86-64, an ARM64
>    // VM on Azure has 3GB of mmio below 4GB.

See my previous comment on the same topic. I think arm64
and x86/x64 are the same.

>    Set-VMVideo -VMName decui-u2204-gen2-fb \
>                -HorizontalResolution 4834 \
>                -VerticalResolution 3622 \
>                -ResolutionType Single
> we have:
>     max_fb_size = round_up_to_2MB(4834*3622*4) = 68 MB
>     low_mmio_base = 4GB - 3GB = 1GB = 0x40000000
>     low_mmio_size = 3GB
>     fb_mmio_base = low_mmio_base = 1GB
> 
>     In this case, we'd like to reserve
>     min(low_mmio_size/2, 128MB) = 128MB for the framebuffer
>     mmio, since the max possible framebuffer so far is 128MB.
> 
> ************************************
> 
> To recap, I think the bottom line is:
> 
> a) For Gen2 VMs, we can safely reserve a mmio range starting at
>    sysfb_primary_display.screen.lfb_base with a size of
>    min(low_mmio_size/2, 128MB).
> 
>    If sysfb_primary_display.screen.lfb_base is 0, i.e. in the case
>    of kdump kernel, we use low_mmio_base instead.
>    This should fix the mmio conflict in the kdump kernel.
> 
> b) For Gen1 VMs, let's still only reserve a mmio range starting at
>    4GB - 128MB with a size of 64MB, because when we are in
>    vmbus_reserve_fb(), we still don't know the exact size of the
>    max_fb_size, and we don't want to reserve too much as we would
>    want to reserve some low mmio space for PCI devices with 32-bit
>    BARs (if any).
> 
>    If the user runs Set-VMVideo and needs a framebuffer size
>    bigger than 64MB (IMO this is not a typical scenario in
>    practice), we have to use high mmio for hyperv_drm in the first
>    kernel, and the kdump kernel still suffers from the mmio
>    conflict between hyperv_drm and hv_pci. We encourage Gen1 VM
>    users to upgrade to Gen2 VMs to resolve the issue. Anyway, the
>    mmio conflict is inevitable for Gen1 VMs, if the max required
>    framebuffer size is bigger than 108MB (Note:
>    128MB - VTPM_BASE_ADDRESS = 109.25, and the required framebuffer
>    size is always rounded up to 2MB).

Question about Gen 1 VMs: If the Linux frame buffer driver moves
the frame buffer somewhere other than the default location, and
then the VM does a kexec/kdump, what does the legacy PCI graphic
device BAR report as the frame buffer location? Does it *always*
report 4G-128MB, or does it report the new location? I can run
an experiment to find out, but maybe you've already done so and
not reported that detail here.

Michael

^ permalink raw reply

* Re: [PATCH] mshv: add a missing padding field
From: Easwar Hariharan @ 2026-04-23 17:32 UTC (permalink / raw)
  To: wei.liu
  Cc: easwar.hariharan, Linux on Hyper-V List, Doru Blânzeanu,
	Magnus Kulke, stable, K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui,
	Long Li, Nuno Das Neves, Roman Kisel, Michael Kelley, open list
In-Reply-To: <614f1e17-2dba-4529-b067-e1434b74cad8@linux.microsoft.com>

On 4/23/2026 10:29 AM, Easwar Hariharan wrote:
> On 4/23/2026 10:26 AM, wei.liu@kernel.org wrote:
>> From: Wei Liu <wei.liu@kernel.org>
>>
>> That was missed when importing the header.
>>
>> Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
>> Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
>> Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
>> Cc: stable@kernel.org
>> Signed-off-by: Wei Liu <wei.liu@kernel.org>
>> ---
>>  include/hyperv/hvhdk.h | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>> index 5e83d3714966..ff7ca9ee1bd4 100644
>> --- a/include/hyperv/hvhdk.h
>> +++ b/include/hyperv/hvhdk.h
>> @@ -79,6 +79,7 @@ struct hv_vp_register_page {
>>  
>>  		u64 registers[18];
>>  	};
>> +	__u8 reserved[8];
>>  	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
>>  	union {
>>  		struct {
> 
> 
> This is not a uapi, so why not just use u8 instead of __u8?
> Or since it's 8 u8s, a u64?
> 
> Thanks,
> Easwar (he/him)

Hm, occurs to me that this would be used by VMMs, but then the registers
field just above used a u64 instead of a __u64....



^ permalink raw reply

* Re: [PATCH] mshv: add a missing padding field
From: Easwar Hariharan @ 2026-04-23 17:29 UTC (permalink / raw)
  To: wei.liu
  Cc: Linux on Hyper-V List, easwar.hariharan, Doru Blânzeanu,
	Magnus Kulke, stable, K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui,
	Long Li, Nuno Das Neves, Roman Kisel, Michael Kelley, open list
In-Reply-To: <20260423172625.1189669-2-wei.liu@kernel.org>

On 4/23/2026 10:26 AM, wei.liu@kernel.org wrote:
> From: Wei Liu <wei.liu@kernel.org>
> 
> That was missed when importing the header.
> 
> Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
> Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
> Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
> Cc: stable@kernel.org
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> ---
>  include/hyperv/hvhdk.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> index 5e83d3714966..ff7ca9ee1bd4 100644
> --- a/include/hyperv/hvhdk.h
> +++ b/include/hyperv/hvhdk.h
> @@ -79,6 +79,7 @@ struct hv_vp_register_page {
>  
>  		u64 registers[18];
>  	};
> +	__u8 reserved[8];
>  	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
>  	union {
>  		struct {


This is not a uapi, so why not just use u8 instead of __u8?
Or since it's 8 u8s, a u64?

Thanks,
Easwar (he/him)

^ permalink raw reply

* [PATCH] mshv: add a missing padding field
From: wei.liu @ 2026-04-23 17:26 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: Wei Liu, Doru Blânzeanu, Magnus Kulke, stable,
	K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui, Long Li,
	Nuno Das Neves, Roman Kisel, Michael Kelley, Easwar Hariharan,
	open list

From: Wei Liu <wei.liu@kernel.org>

That was missed when importing the header.

Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
Cc: stable@kernel.org
Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
 include/hyperv/hvhdk.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
index 5e83d3714966..ff7ca9ee1bd4 100644
--- a/include/hyperv/hvhdk.h
+++ b/include/hyperv/hvhdk.h
@@ -79,6 +79,7 @@ struct hv_vp_register_page {
 
 		u64 registers[18];
 	};
+	__u8 reserved[8];
 	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
 	union {
 		struct {
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Andrew Lunn @ 2026-04-23 16:37 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <aepF3NwyANeklkfD@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

> The root cause is in mana_gd_init_vf_regs(), which computes:
> 
>   gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> 
> without validating the offset read from hardware. If the register
> returns a garbage value that is neither within bar 0 bounds nor aligned
> to the 4-byte granularity, thus causing the alignment fault.

Is GDMA_REG_SHM_OFFSET special?

What if GDMA_REG_DB_PAGE_SIZE or GDMA_REG_DB_PAGE_OFFSET have returned
garbage? Are you going to die a horrible death as well?

Isn't there a way you can poll the firmware to ask it if it is ready?

And what about the PF case. Can GDMA_PF_REG_SHM_OFF also be garbage?

      Andrew

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-04-23 16:33 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <aeoVC27mIzoKytqA@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Thu, 23 Apr 2026 05:48:11 -0700 Dipayaan Roy wrote:
> What I meant is that the atomic refcount cost itself does not appear to
> be unique to the affected platform. I see a similar ~5% overhead on
> another ARM64 platformi (different vendor) as well. However, on that platform
> there is no throughput delta between fragment mode and full-page mode; both reach
> line rate.

I wonder if it wouldn't be more expedient at this stage to just switch
to rx-buf-len rather than investigating in more detail. But we can wait
for more data if you prefer.

> Please let me know what David finds, and I can rework the patch
> accordingly.

Haven't heard back. I pinged him now.

^ permalink raw reply

* [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Dipayaan Roy @ 2026-04-23 16:16 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov

During Function Level Reset recovery, the MANA driver reads
hardware BAR0 registers that may temporarily contain garbage values.
The SHM (Shared Memory) offset read from GDMA_REG_SHM_OFFSET is used
to compute gc->shm_base, which is later dereferenced via readl() in
mana_smc_poll_register(). If the hardware returns an unaligned or
out-of-range value, the driver must not blindly use it, as this would
propagate the hardware error into a kernel crash.

The following crash was observed on an arm64 Hyper-V guest running
kernel 6.17.0-3013-azure during VF reset recovery triggered by HWC
timeout.

[13291.785274] Unable to handle kernel paging request at virtual address ffff8000a200001b
[13291.785311] Mem abort info:
[13291.785332]   ESR = 0x0000000096000021
[13291.785343]   EC = 0x25: DABT (current EL), IL = 32 bits
[13291.785355]   SET = 0, FnV = 0
[13291.785363]   EA = 0, S1PTW = 0
[13291.785372]   FSC = 0x21: alignment fault
[13291.785382] Data abort info:
[13291.785391]   ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000
[13291.785404]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[13291.785412]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[13291.785421] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000014df3a1000
[13291.785432] [ffff8000a200001b] pgd=1000000100438403, p4d=1000000100438403, pud=1000000100439403, pmd=0068000fc2000711
[13291.785703] Internal error: Oops: 0000000096000021 [#1]  SMP
[13291.830975] Modules linked in: tls qrtr mana_ib ib_uverbs ib_core xt_owner xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables cfg80211 8021q garp mrp stp llc binfmt_misc joydev serio_raw nls_iso8859_1 hid_generic aes_ce_blk aes_ce_cipher polyval_ce ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher hid_hyperv sm4 sm3_ce sha3_ce hv_netvsc hid vmgenid hyperv_keyboard hyperv_drm sch_fq_codel nvme_fabrics efi_pstore dm_multipath nfnetlink vsock_loopback vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport vmw_vmci vsock dmi_sysfs ip_tables x_tables autofs4
[13291.862630] CPU: 122 UID: 0 PID: 61796 Comm: kworker/122:2 Tainted: G        W           6.17.0-3013-azure #13-Ubuntu VOLUNTARY
[13291.869902] Tainted: [W]=WARN
[13291.871901] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 01/08/2026
[13291.878086] Workqueue: events mana_serv_func
[13291.880718] pstate: 62400005 (nZCv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
[13291.884835] pc : mana_smc_poll_register+0x48/0xb0
[13291.887902] lr : mana_smc_setup_hwc+0x70/0x1c0
[13291.890493] sp : ffff8000ab79bbb0
[13291.892364] x29: ffff8000ab79bbb0 x28: ffff00410c8b5900 x27: ffff00410d630680
[13291.896252] x26: ffff004171f9fd80 x25: 000000016ed55000 x24: 000000017f37e000
[13291.899990] x23: 0000000000000000 x22: 000000016ed55000 x21: 0000000000000000
[13291.904497] x20: ffff8000a200001b x19: 0000000000004e20 x18: ffff8000a6183050
[13291.908308] x17: 0000000000000000 x16: 0000000000000000 x15: 000000000000000a
[13291.912542] x14: 0000000000000004 x13: 0000000000000000 x12: 0000000000000000
[13291.916298] x11: 0000000000000000 x10: 0000000000000001 x9 : ffffc45006af1bd8
[13291.920945] x8 : ffff000151129000 x7 : 0000000000000000 x6 : 0000000000000000
[13291.925293] x5 : 000000015f214000 x4 : 000000017217a000 x3 : 000000016ed50000
[13291.930436] x2 : 000000016ed55000 x1 : 0000000000000000 x0 : ffff8000a1ffffff
[13291.934342] Call trace:
[13291.935736]  mana_smc_poll_register+0x48/0xb0 (P)
[13291.938611]  mana_smc_setup_hwc+0x70/0x1c0
[13291.941113]  mana_hwc_create_channel+0x1a0/0x3a0
[13291.944283]  mana_gd_setup+0x16c/0x398
[13291.946584]  mana_gd_resume+0x24/0x70
[13291.948917]  mana_do_service+0x13c/0x1d0
[13291.951583]  mana_serv_func+0x34/0x68
[13291.953732]  process_one_work+0x168/0x3d0
[13291.956745]  worker_thread+0x2ac/0x480
[13291.959104]  kthread+0xf8/0x110
[13291.961026]  ret_from_fork+0x10/0x20
[13291.963560] Code: d2807d00 9417c551 71000673 54000220 (b9400281)
[13291.967299] ---[ end trace 0000000000000000 ]---

Disassembly of mana_smc_poll_register() around the crash site:

Disassembly of section .text:

00000000000047c8 <mana_smc_poll_register>:
    47c8: d503201f        nop
    47cc: d503201f        nop
    47d0: d503233f        paciasp
    47d4: f800865e        str     x30, [x18], #8
    47d8: a9bd7bfd        stp     x29, x30, [sp, #-48]!
    47dc: 910003fd        mov     x29, sp
    47e0: a90153f3        stp     x19, x20, [sp, #16]
    47e4: 91007014        add     x20, x0, #0x1c
    47e8: 5289c413        mov     w19, #0x4e20
    47ec: f90013f5        str     x21, [sp, #32]
    47f0: 12001c35        and     w21, w1, #0xff
    47f4: 14000008        b       4814 <mana_smc_poll_register+0x4c>
    47f8: 36f801e1  tbz  w1, #31, 4834 <mana_smc_poll_register+0x6c>
    47fc: 52800042        mov     w2, #0x2
    4800: d280fa01        mov     x1, #0x7d0
    4804: d2807d00        mov     x0, #0x3e8
    4808: 94000000        bl      0 <usleep_range_state>
    480c: 71000673        subs    w19, w19, #0x1
    4810: 54000200        b.eq    4850 <mana_smc_poll_register+0x88>
    4814: b9400281      ldr   w1, [x20] <-- **** CRASHED HERE *****
    4818: d50331bf        dmb     oshld
    481c: 2a0103e2        mov     w2, w1
    4820: ca020042        eor     x2, x2, x2
    4824: b5000002        cbnz    x2, 4824 <mana_smc_poll_register+0x5c>
    4828: 710002bf        cmp     w21, #0x0
    482c: 3a411820        ccmn    w1, #0x1, #0x0, ne
    4830: 54fffe41        b.ne    47f8 <mana_smc_poll_register+0x30>
    4834: f85f8e5e        ldr     x30, [x18, #-8]!
    4838: 52800000        mov     w0, #0x0
    483c: a94153f3        ldp     x19, x20, [sp, #16]
    4840: f94013f5        ldr     x21, [sp, #32]
    4844: f84307fd        ldr     x29, [sp], #48
    4848: d50323bf        autiasp
    484c: d65f03c0        ret
    4850: f85f8e5e        ldr     x30, [x18, #-8]!
    4854: 12800da0        mov     w0, #0xffffff92
    4858: a94153f3        ldp     x19, x20, [sp, #16]
    485c: f94013f5        ldr     x21, [sp, #32]
    4860: f84307fd        ldr     x29, [sp], #48
    4864: d50323bf        autiasp
    4868: d65f03c0        ret

From the crash signature x20 = ffff8000a200001b, this address
ends in 0x1b which is not 4-byte aligned, so the 'ldr w1, [x20]'
instruction (readl) triggers the arm64 alignment fault (FSC = 0x21).

The root cause is in mana_gd_init_vf_regs(), which computes:

  gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);

without validating the offset read from hardware. If the register
returns a garbage value that is neither within bar 0 bounds nor aligned
to the 4-byte granularity, thus causing the alignment fault.

Harden the register validation in both mana_gd_init_vf_regs() and
mana_gd_init_pf_regs() by checking the SHM offset for bounds and
4-byte alignment before use. Return -EPROTO on invalid values, which
the existing recovery path in mana_serv_reset() already handles by
falling through to PCI device rescan, giving the hardware another
chance to present valid register values.

Fixes: 9bf66036d686 ("net: mana: Handle hardware recovery events when probing the device")
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 32 +++++++++++++++++--
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..75efbeebae0e 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -45,6 +45,7 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	void __iomem *sriov_base_va;
 	u64 sriov_base_off;
+	u64 sriov_shm_off;
 
 	gc->db_page_size = mana_gd_r32(gc, GDMA_PF_REG_DB_PAGE_SIZE) & 0xFFFF;
 
@@ -73,10 +74,25 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
 	gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
 
 	sriov_base_off = mana_gd_r64(gc, GDMA_SRIOV_REG_CFG_BASE_OFF);
+	if (sriov_base_off >= gc->bar0_size ||
+	    !IS_ALIGNED(sriov_base_off, sizeof(u32))) {
+		dev_err(gc->dev,
+			"SRIOV base offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
+			sriov_base_off, (u64)gc->bar0_size);
+		return -EPROTO;
+	}
 
 	sriov_base_va = gc->bar0_va + sriov_base_off;
-	gc->shm_base = sriov_base_va +
-			mana_gd_r64(gc, sriov_base_off + GDMA_PF_REG_SHM_OFF);
+
+	sriov_shm_off = mana_gd_r64(gc, sriov_base_off + GDMA_PF_REG_SHM_OFF);
+	if (sriov_base_off + sriov_shm_off >= gc->bar0_size ||
+	    !IS_ALIGNED(sriov_shm_off, sizeof(u32))) {
+		dev_err(gc->dev,
+			"SRIOV SHM offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
+			sriov_shm_off, (u64)gc->bar0_size);
+		return -EPROTO;
+	}
+	gc->shm_base = sriov_base_va + sriov_shm_off;
 
 	return 0;
 }
@@ -84,6 +100,7 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
 static int mana_gd_init_vf_regs(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
+	u64 shm_off;
 
 	gc->db_page_size = mana_gd_r32(gc, GDMA_REG_DB_PAGE_SIZE) & 0xFFFF;
 
@@ -111,7 +128,16 @@ static int mana_gd_init_vf_regs(struct pci_dev *pdev)
 	gc->db_page_base = gc->bar0_va + gc->db_page_off;
 	gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
 
-	gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
+	shm_off = mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
+	if (shm_off >= gc->bar0_size ||
+	    !IS_ALIGNED(shm_off, sizeof(u32))) {
+		dev_err(gc->dev,
+			"SHM offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
+			shm_off, (u64)gc->bar0_size);
+		return -EPROTO;
+	}
+
+	gc->shm_base = gc->bar0_va + shm_off;
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Marc Zyngier @ 2026-04-23 14:00 UTC (permalink / raw)
  To: Naman Jain
  Cc: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Michael Kelley, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, linux-riscv, vdso, ssengar
In-Reply-To: <20260423124206.2410879-8-namjain@linux.microsoft.com>

On Thu, 23 Apr 2026 13:41:57 +0100,
Naman Jain <namjain@linux.microsoft.com> wrote:
> 
> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
> driver on arm64. This function enables the transition between Virtual
> Trust Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
> 
> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
> Reviewed-by: Roman Kisel <vdso@mailbox.org>
> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> ---
>  arch/arm64/hyperv/Makefile        |   1 +
>  arch/arm64/hyperv/hv_vtl.c        | 158 ++++++++++++++++++++++++++++++
>  arch/arm64/include/asm/mshyperv.h |  13 +++
>  arch/x86/include/asm/mshyperv.h   |   2 -
>  drivers/hv/mshv_vtl.h             |   3 +
>  include/asm-generic/mshyperv.h    |   2 +
>  6 files changed, 177 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/hyperv/hv_vtl.c
> 
> diff --git a/arch/arm64/hyperv/Makefile b/arch/arm64/hyperv/Makefile
> index 87c31c001da9..9701a837a6e1 100644
> --- a/arch/arm64/hyperv/Makefile
> +++ b/arch/arm64/hyperv/Makefile
> @@ -1,2 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-y		:= hv_core.o mshyperv.o
> +obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
> diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
> new file mode 100644
> index 000000000000..59cbeb74e7b9
> --- /dev/null
> +++ b/arch/arm64/hyperv/hv_vtl.c
> @@ -0,0 +1,158 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2026, Microsoft, Inc.
> + *
> + * Authors:
> + *     Roman Kisel <romank@linux.microsoft.com>
> + *     Naman Jain <namjain@linux.microsoft.com>
> + */
> +
> +#include <asm/mshyperv.h>
> +#include <asm/neon.h>
> +#include <linux/export.h>
> +
> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
> +{
> +	struct user_fpsimd_state fpsimd_state;
> +	u64 base_ptr = (u64)vtl0->x;
> +
> +	/*
> +	 * Obtain the CPU FPSIMD registers for VTL context switch.
> +	 * This saves the current task's FP/NEON state and allows us to
> +	 * safely load VTL0's FP/NEON context for the hypercall.
> +	 */
> +	kernel_neon_begin(&fpsimd_state);
> +
> +	/*
> +	 * VTL switch for ARM64 platform - managing VTL0's CPU context.
> +	 * We explicitly use the stack to save the base pointer, and use x16
> +	 * as our working register for accessing the context structure.
> +	 *
> +	 * Register Handling:
> +	 * - X0-X17: Saved/restored (general-purpose, shared for VTL communication)
> +	 * - X18: NOT touched - hypervisor-managed per-VTL (platform register)
> +	 * - X19-X30: Saved/restored (part of VTL0's execution context)
> +	 * - Q0-Q31: Saved/restored (128-bit NEON/floating-point registers, shared)
> +	 * - SP: Not in structure, hypervisor-managed per-VTL
> +	 *
> +	 * X29 (FP) and X30 (LR) are in the structure and must be saved/restored
> +	 * as part of VTL0's complete execution state.
> +	 */
> +	asm __volatile__ (
> +		/* Save base pointer to stack explicitly, then load into x16 */
> +		"str %0, [sp, #-16]!\n\t"     /* Push base pointer onto stack */
> +		"mov x16, %0\n\t"             /* Load base pointer into x16 */
> +		/* Volatile registers (Windows ARM64 ABI: x0-x17) */
> +		"ldp x0, x1, [x16]\n\t"
> +		"ldp x2, x3, [x16, #(2*8)]\n\t"
> +		"ldp x4, x5, [x16, #(4*8)]\n\t"
> +		"ldp x6, x7, [x16, #(6*8)]\n\t"
> +		"ldp x8, x9, [x16, #(8*8)]\n\t"
> +		"ldp x10, x11, [x16, #(10*8)]\n\t"
> +		"ldp x12, x13, [x16, #(12*8)]\n\t"
> +		"ldp x14, x15, [x16, #(14*8)]\n\t"
> +		/* x16 will be loaded last, after saving base pointer */
> +		"ldr x17, [x16, #(17*8)]\n\t"
> +		/* x18 is hypervisor-managed per-VTL - DO NOT LOAD */

Wut? Does it mean the kernel is not free to use x18?

> +		/* General-purpose registers: x19-x30 */
> +		"ldp x19, x20, [x16, #(19*8)]\n\t"
> +		"ldp x21, x22, [x16, #(21*8)]\n\t"
> +		"ldp x23, x24, [x16, #(23*8)]\n\t"
> +		"ldp x25, x26, [x16, #(25*8)]\n\t"
> +		"ldp x27, x28, [x16, #(27*8)]\n\t"
> +
> +		/* Frame pointer and link register */
> +		"ldp x29, x30, [x16, #(29*8)]\n\t"
> +
> +		/* Shared NEON/FP registers: Q0-Q31 (128-bit) */
> +		"ldp q0, q1, [x16, #(32*8)]\n\t"
> +		"ldp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
> +		"ldp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
> +		"ldp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
> +		"ldp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
> +		"ldp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
> +		"ldp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
> +		"ldp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
> +		"ldp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
> +		"ldp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
> +		"ldp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
> +		"ldp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
> +		"ldp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
> +		"ldp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
> +		"ldp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
> +		"ldp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
> +
> +		/* Now load x16 itself */
> +		"ldr x16, [x16, #(16*8)]\n\t"
> +
> +		/* Return to the lower VTL */
> +		"hvc #3\n\t"

No. Absolutely not. If you need to do context switching, do it in the
hypervisor. Entirely in the hypervisor. You don't even handle SVE, let
alone SME. How is that going to work?

And please use the SMCCC. Only that. Which mandates that the HVC
immediate is 0, 0 or zero.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Mark Rutland @ 2026-04-23 13:56 UTC (permalink / raw)
  To: Naman Jain
  Cc: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Michael Kelley, Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi,
	Sascha Bischoff, mrigendrachaubey, linux-hyperv, linux-arm-kernel,
	linux-kernel, linux-arch, linux-riscv, vdso, ssengar
In-Reply-To: <20260423124206.2410879-8-namjain@linux.microsoft.com>

On Thu, Apr 23, 2026 at 12:41:57PM +0000, Naman Jain wrote:
> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
> driver on arm64. This function enables the transition between Virtual
> Trust Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
> 
> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
> Reviewed-by: Roman Kisel <vdso@mailbox.org>
> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> ---
>  arch/arm64/hyperv/Makefile        |   1 +
>  arch/arm64/hyperv/hv_vtl.c        | 158 ++++++++++++++++++++++++++++++
>  arch/arm64/include/asm/mshyperv.h |  13 +++
>  arch/x86/include/asm/mshyperv.h   |   2 -
>  drivers/hv/mshv_vtl.h             |   3 +
>  include/asm-generic/mshyperv.h    |   2 +
>  6 files changed, 177 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/hyperv/hv_vtl.c
> 
> diff --git a/arch/arm64/hyperv/Makefile b/arch/arm64/hyperv/Makefile
> index 87c31c001da9..9701a837a6e1 100644
> --- a/arch/arm64/hyperv/Makefile
> +++ b/arch/arm64/hyperv/Makefile
> @@ -1,2 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-y		:= hv_core.o mshyperv.o
> +obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
> diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
> new file mode 100644
> index 000000000000..59cbeb74e7b9
> --- /dev/null
> +++ b/arch/arm64/hyperv/hv_vtl.c
> @@ -0,0 +1,158 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2026, Microsoft, Inc.
> + *
> + * Authors:
> + *     Roman Kisel <romank@linux.microsoft.com>
> + *     Naman Jain <namjain@linux.microsoft.com>
> + */
> +
> +#include <asm/mshyperv.h>
> +#include <asm/neon.h>
> +#include <linux/export.h>
> +
> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
> +{
> +	struct user_fpsimd_state fpsimd_state;
> +	u64 base_ptr = (u64)vtl0->x;
> +
> +	/*
> +	 * Obtain the CPU FPSIMD registers for VTL context switch.
> +	 * This saves the current task's FP/NEON state and allows us to
> +	 * safely load VTL0's FP/NEON context for the hypercall.
> +	 */
> +	kernel_neon_begin(&fpsimd_state);
> +
> +	/*
> +	 * VTL switch for ARM64 platform - managing VTL0's CPU context.
> +	 * We explicitly use the stack to save the base pointer, and use x16
> +	 * as our working register for accessing the context structure.
> +	 *
> +	 * Register Handling:
> +	 * - X0-X17: Saved/restored (general-purpose, shared for VTL communication)
> +	 * - X18: NOT touched - hypervisor-managed per-VTL (platform register)
> +	 * - X19-X30: Saved/restored (part of VTL0's execution context)
> +	 * - Q0-Q31: Saved/restored (128-bit NEON/floating-point registers, shared)
> +	 * - SP: Not in structure, hypervisor-managed per-VTL
> +	 *
> +	 * X29 (FP) and X30 (LR) are in the structure and must be saved/restored
> +	 * as part of VTL0's complete execution state.
> +	 */
> +	asm __volatile__ (
> +		/* Save base pointer to stack explicitly, then load into x16 */
> +		"str %0, [sp, #-16]!\n\t"     /* Push base pointer onto stack */
> +		"mov x16, %0\n\t"             /* Load base pointer into x16 */
> +		/* Volatile registers (Windows ARM64 ABI: x0-x17) */
> +		"ldp x0, x1, [x16]\n\t"
> +		"ldp x2, x3, [x16, #(2*8)]\n\t"
> +		"ldp x4, x5, [x16, #(4*8)]\n\t"
> +		"ldp x6, x7, [x16, #(6*8)]\n\t"
> +		"ldp x8, x9, [x16, #(8*8)]\n\t"
> +		"ldp x10, x11, [x16, #(10*8)]\n\t"
> +		"ldp x12, x13, [x16, #(12*8)]\n\t"
> +		"ldp x14, x15, [x16, #(14*8)]\n\t"
> +		/* x16 will be loaded last, after saving base pointer */
> +		"ldr x17, [x16, #(17*8)]\n\t"
> +		/* x18 is hypervisor-managed per-VTL - DO NOT LOAD */
> +
> +		/* General-purpose registers: x19-x30 */
> +		"ldp x19, x20, [x16, #(19*8)]\n\t"
> +		"ldp x21, x22, [x16, #(21*8)]\n\t"
> +		"ldp x23, x24, [x16, #(23*8)]\n\t"
> +		"ldp x25, x26, [x16, #(25*8)]\n\t"
> +		"ldp x27, x28, [x16, #(27*8)]\n\t"
> +
> +		/* Frame pointer and link register */
> +		"ldp x29, x30, [x16, #(29*8)]\n\t"
> +
> +		/* Shared NEON/FP registers: Q0-Q31 (128-bit) */
> +		"ldp q0, q1, [x16, #(32*8)]\n\t"
> +		"ldp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
> +		"ldp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
> +		"ldp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
> +		"ldp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
> +		"ldp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
> +		"ldp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
> +		"ldp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
> +		"ldp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
> +		"ldp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
> +		"ldp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
> +		"ldp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
> +		"ldp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
> +		"ldp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
> +		"ldp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
> +		"ldp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
> +
> +		/* Now load x16 itself */
> +		"ldr x16, [x16, #(16*8)]\n\t"
> +
> +		/* Return to the lower VTL */
> +		"hvc #3\n\t"

NAK to this.

* This is a non-SMCCC hypercall, which we have NAK'd in general in the
  past for various reasons that I am not going to rehash here.

* It's not clear how this is going to be extended with necessary
  architecture state in future (e.g. SVE, SME). This is not
  future-proof, and I don't believe this is maintainable.

* This breaks general requirements for reliable stacktracing by
  clobbering state (e.g. x29) that we depend upon being valid AT ALL
  TIMES outside of entry code.

* IMO, if this needs to be saved/restored, that should happen in
  whatever you are calling.

Mark.

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-04-23 12:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260416083146.0bb94d2b@kernel.org>

On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
> On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
> > I still see roughly a 5% overhead from the atomic refcount operation
> > itself, but on that platform there is no throughput drop when using
> > page fragments versus full-page mode.
> 
> That seems to contradict your claim that it's a problem with a specific
> platform.. Since we're in the merge window I asked David Wei to try to
> experiment with disabling page fragmentation on the ARM64 platforms we
> have at Meta. If it repros we should use the generic rx-buf-len
> ringparam because more NICs may want to implement this strategy.

Hi Jakub,

Thanks. I think I was not precise enough in my previous reply.

What I meant is that the atomic refcount cost itself does not appear to
be unique to the affected platform. I see a similar ~5% overhead on
another ARM64 platformi (different vendor) as well. However, on that platform
there is no throughput delta between fragment mode and full-page mode; both reach
line rate.

On the affected platform, fragment mode shows an additional ~15%
throughput drop versus full-page mode. So the current data suggests that
the atomic overhead is common, but the throughput regression is not
explained by that overhead alone and likely depends on an additional
platform-specific factor.

Separately, the hardware team collected PCIe traces on the affected
platform and reported stalls in the fragment-mode case that are not seen
in full-page mode. They are still investigating the root cause, but
their current hypothesis is that this is related to that platform’s
PCIe/root-port microarchitecture rather than to page_pool refcounting
alone.

That said, I agree the right direction depends on whether this
reproduces on other ARM64 platforms. If David is able to reproduce the
same behavior, then using the generic rx-buf-len ringparam sounds like
the better direction.

Please let me know what David finds, and I can rework the patch
accordingly.


Regards
Dipayaan Roy

^ permalink raw reply

* [PATCH v2 15/15] Drivers: hv: Add ARM64 support for MSHV_VTL in Kconfig
From: Naman Jain @ 2026-04-23 12:42 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, Naman Jain, linux-hyperv, linux-arm-kernel,
	linux-kernel, linux-arch, linux-riscv, vdso, ssengar
In-Reply-To: <20260423124206.2410879-1-namjain@linux.microsoft.com>

Enable ARM64 support in MSHV_VTL Kconfig now that all the necessary
support is present.

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Roman Kisel <vdso@mailbox.org>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 drivers/hv/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 115821cc535c..0bec3bc81a1a 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -87,7 +87,7 @@ config MSHV_ROOT
 
 config MSHV_VTL
 	tristate "Microsoft Hyper-V VTL driver"
-	depends on X86_64 && HYPERV_VTL_MODE
+	depends on (X86_64 || ARM64) && HYPERV_VTL_MODE
 	depends on HYPERV_VMBUS
 	# Mapping VTL0 memory to a userspace process in VTL2 is supported in OpenHCL.
 	# VTL2 for OpenHCL makes use of Huge Pages to improve performance on VMs,
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 14/15] Drivers: hv: Add 4K page dependency in MSHV_VTL
From: Naman Jain @ 2026-04-23 12:42 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Michael Kelley
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, Naman Jain, linux-hyperv, linux-arm-kernel,
	linux-kernel, linux-arch, linux-riscv, vdso, ssengar
In-Reply-To: <20260423124206.2410879-1-namjain@linux.microsoft.com>

Add a dependency on 4K page size in Kconfig of MSHV_VTL
to support any assumptions that may be present in the code.
x86 anyways supports 4K page size only, and for arm64, higher
page size support is not validated. Remove this dependency as
and when this feature is supported.

Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 drivers/hv/Kconfig | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 7937ac0cbd0f..115821cc535c 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -96,6 +96,11 @@ config MSHV_VTL
 	# MTRRs are controlled by VTL0, and are not specific to individual VTLs.
 	# Therefore, do not attempt to access or modify MTRRs here.
 	depends on !MTRR
+	# The hypervisor interface operates on 4k pages. Enforcing it here
+	# simplifies many assumptions in the mshv_vtl code.
+	# VTL0 VMs can still support higher page size in ARM64 and is not limited
+	# by this setting.
+	depends on PAGE_SIZE_4KB
 	select CPUMASK_OFFSTACK
 	select VIRT_XFER_TO_GUEST_WORK
 	default n
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox