Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH v2 2/2] RDMA/mana_ib: initialize err for empty send WR lists
From: Ruoyu Wang @ 2026-06-18  4:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: Cheng Xu, Kai Shen, Long Li, Konstantin Taranov, linux-rdma,
	linux-hyperv, linux-kernel, Ruoyu Wang
In-Reply-To: <20260618041752.481193-1-ruoyuw560@gmail.com>

mana_ib_post_send() returns err after walking the send work request list.
If the caller passes an empty list, the loop is skipped and err is not
assigned.

Initialize err to 0 so an empty send work request list returns success
instead of stack data.

Fixes: c8017f5b4856 ("RDMA/mana_ib: UD/GSI work requests")
Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
---
v2:
- Split the erdma and mana_ib changes into separate patches.
- Add a driver-specific Fixes tag.

 drivers/infiniband/hw/mana/wr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mana/wr.c b/drivers/infiniband/hw/mana/wr.c
index 1813567d3b16c..36a1d506f08f6 100644
--- a/drivers/infiniband/hw/mana/wr.c
+++ b/drivers/infiniband/hw/mana/wr.c
@@ -144,7 +144,7 @@ static int mana_ib_post_send_ud(struct mana_ib_qp *qp, const struct ib_ud_wr *wr
 int mana_ib_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
 		      const struct ib_send_wr **bad_wr)
 {
-	int err;
+	int err = 0;
 	struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
 
 	for (; wr; wr = wr->next) {
-- 
2.51.0

^ permalink raw reply related

* Re: [PATCH v2 1/2] RDMA/erdma: initialize ret for empty receive WR lists
From: sashiko-bot @ 2026-06-18  4:30 UTC (permalink / raw)
  To: Ruoyu Wang; +Cc: linux-hyperv
In-Reply-To: <20260618041752.481193-1-ruoyuw560@gmail.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

Pre-existing issues:
- [High] The receive work request queue (RQ) can silently overflow because `erdma_post_recv` and `erdma_post_recv_one` lack a bounds check against the queue capacity, and `rq_ci` is not properly tracked during completion polling.
- [High] In `erdma_post_send`, when the send queue is full, `*bad_send_wr` is incorrectly assigned the head of the list (`send_wr`) instead of the failing work request (`wr`).
- [High] Missing `dma_wmb()` before updating the doorbell records (`sq_dbrec` and `rq_dbrec`) in host memory.
--

commit 7fd711bd13dff4553304f89e7b55436f427b3b9d
Author: Ruoyu Wang <ruoyuw560@gmail.com>

RDMA/erdma: initialize ret for empty receive WR lists

erdma_post_recv() returns ret after walking the receive work request list.
If the caller passes an empty list, the loop is skipped and ret is not
assigned. Initialize ret to 0 so an empty receive work request list returns
success instead of stack data.

> diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c b/drivers/infiniband/hw/erdma/erdma_qp.c
> index 25f6c49aec779..e002343832f74 100644
> --- a/drivers/infiniband/hw/erdma/erdma_qp.c
> +++ b/drivers/infiniband/hw/erdma/erdma_qp.c
> @@ -734,7 +734,7 @@ int erdma_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *recv_wr,
>  	const struct ib_recv_wr *wr = recv_wr;
>  	struct erdma_qp *qp = to_eqp(ibqp);
>  	unsigned long flags;
> -	int ret;
> +	int ret = 0;
>  
>  	spin_lock_irqsave(&qp->lock, flags);
>  

[Severity: High]
This is a pre-existing issue, but does the receive work request queue (RQ)
have a risk of silent overflow?

While erdma_post_send() tracks send queue capacity, erdma_post_recv()
unconditionally posts receive work requests and advances rq_pi without a
bounds check against the queue capacity:

drivers/infiniband/hw/erdma/erdma_qp.c:erdma_post_recv() {
    ...
	while (wr) {
		ret = erdma_post_recv_one(qp, wr);
    ...
}

drivers/infiniband/hw/erdma/erdma_qp.c:erdma_post_recv_one() {
    ...
	qp->kern_qp.rq_pi++;
    ...
}

Additionally, receive queue completions don't seem to update kern_qp->rq_ci:

drivers/infiniband/hw/erdma/erdma_cq.c:erdma_poll_one_cqe() {
    ...
	} else {
		id_table = kern_qp->rwr_tbl;
		depth = qp->attrs.rq_size;
	}
    ...
}

If a kernel consumer posts more receives than the queue capacity, could the
ring buffer wrap around and overwrite active, unpolled work requests, leading
to data corruption and invalid WR IDs during polling?

[Severity: High]
This is a pre-existing issue, but is the failing work request assigned
correctly when the send queue is full?

In erdma_post_send(), if the queue becomes full after some work requests have
already been successfully posted, *bad_send_wr is assigned the head of the
list instead of the failing request:

drivers/infiniband/hw/erdma/erdma_qp.c:erdma_post_send() {
    ...
		if ((u16)(sq_pi - qp->kern_qp.sq_ci) >= qp->attrs.sq_size) {
			ret = -ENOMEM;
			*bad_send_wr = send_wr;
			break;
		}
    ...
}

Does this signal to the caller that the entire batch failed from the start,
prompting a retry of the entire list and potentially leading to duplicate
execution?

[Severity: High]
This is a pre-existing issue, but should there be a memory barrier before
updating the doorbell records in host memory?

The driver updates doorbell records after constructing work queue elements
(WQEs/RQEs):

drivers/infiniband/hw/erdma/erdma_qp.c:erdma_post_recv_one() {
    ...
	*(u64 *)qp->kern_qp.rq_dbrec = *(u64 *)rqe;
    ...
}

drivers/infiniband/hw/erdma/erdma_qp.c:kick_sq_db() {
    ...
	*(u64 *)qp->kern_qp.sq_dbrec = db_data;
    ...
}

While the subsequent hardware doorbell ring via writeq() includes a barrier
before the MMIO operation, could the CPU reorder the memory stores of the WQE
with the memory store of the dbrec?

If the device polls the updated dbrec before the WQE stores are visible,
might it fetch a stale or partially written WQE via DMA?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260618041752.481193-1-ruoyuw560@gmail.com?part=1

^ permalink raw reply

* RE: [RFC PATCH 1/6] arm64: rsi: Add RSI host call structure and helper function
From: Michael Kelley @ 2026-06-18 17:45 UTC (permalink / raw)
  To: Kameron Carr, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
	lpieralisi@kernel.org, sudeep.holla@kernel.org, arnd@arndb.de,
	thuth@redhat.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Michael Kelley
In-Reply-To: <20260609181030.2378391-2-kameroncarr@linux.microsoft.com>

From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June 9, 2026 11:10 AM
> 
> Add struct rsi_host_call to rsi_smc.h, which represents the host call
> data structure used by the Realm Management Monitor (RMM) for the
> RSI_HOST_CALL interface. The structure contains a 16-bit immediate field
> and 31 general-purpose register values, aligned to 256 bytes as required
> by the CCA RMM specification.
> 
> Add rsi_host_call() static inline wrapper in rsi_cmds.h that invokes
> SMC_RSI_HOST_CALL with the physical address of the host call structure.
> This will be used by Hyper-V guest code to route hypercalls through the
> RSI interface when running inside an Arm CCA Realm.
> 
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
>  arch/arm64/include/asm/rsi_cmds.h | 9 +++++++++
>  arch/arm64/include/asm/rsi_smc.h  | 6 ++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/rsi_cmds.h b/arch/arm64/include/asm/rsi_cmds.h
> index 2c8763876dfb7..83b4b1f598454 100644
> --- a/arch/arm64/include/asm/rsi_cmds.h
> +++ b/arch/arm64/include/asm/rsi_cmds.h
> @@ -159,4 +159,13 @@ static inline unsigned long
> rsi_attestation_token_continue(phys_addr_t granule,
>  	return res.a0;
>  }
> 
> +static inline long rsi_host_call(phys_addr_t host_call_struct)
> +{
> +	struct arm_smccc_res res;
> +
> +	arm_smccc_smc(SMC_RSI_HOST_CALL, host_call_struct, 0, 0, 0, 0, 0, 0,
> +		      &res);
> +	return res.a0;
> +}

For consistent grouping, it seems like this inline function should
be placed after rsi_set_addr_range_state() since it follows the
same pattern. It's a bit different from the token functions.

> +
>  #endif /* __ASM_RSI_CMDS_H */
> diff --git a/arch/arm64/include/asm/rsi_smc.h b/arch/arm64/include/asm/rsi_smc.h
> index e19253f96c940..ffea93340ed7f 100644
> --- a/arch/arm64/include/asm/rsi_smc.h
> +++ b/arch/arm64/include/asm/rsi_smc.h
> @@ -142,6 +142,12 @@ struct realm_config {
>  	 */
>  } __aligned(0x1000);
> 
> +struct rsi_host_call {
> +	u16 immediate;

I don't see the "immediate" used anywhere in this patch set.
Is it always zero for the Hyper-V use cases?  Just curious ...

> +	u64 gprs[31];
> +} __aligned(256);
> +static_assert(sizeof(struct rsi_host_call) == 256);

This struct defines an ABI with the RMM layer, so I'd suggest
adding explicit padding of 6 bytes after the immediate so there's
no implicit dependency on the compiler adding the padding.
Sashiko had the same comment ....

Michael

> +
>  #endif /* __ASSEMBLER__ */
> 
>  /*
> --
> 2.45.4
> 


^ permalink raw reply

* RE: [RFC PATCH 2/6] firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
From: Michael Kelley @ 2026-06-18 17:45 UTC (permalink / raw)
  To: Kameron Carr, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
	lpieralisi@kernel.org, sudeep.holla@kernel.org, arnd@arndb.de,
	thuth@redhat.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Michael Kelley
In-Reply-To: <20260609181030.2378391-3-kameroncarr@linux.microsoft.com>

From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June 9, 2026 11:10 AM
> 
> Modify arm_smccc_hypervisor_has_uuid() to check is_realm_world() and
> use rsi_host_call() to query the hypervisor vendor UUID when inside a
> Realm. The realm path is factored into a helper,
> arm_smccc_realm_get_hypervisor_uuid(), that owns a file-static
> rsi_host_call buffer (uuid_hc) serialized by a spinlock.
> 
> The RSI-specific includes, file-static state and helper are guarded
> with CONFIG_ARM64 because <asm/rsi.h> does not exist on 32-bit ARM.
> 
> For non-Realm environments, the existing arm_smccc_1_1_invoke() path
> is unchanged.
> 
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
>  drivers/firmware/smccc/smccc.c | 41 +++++++++++++++++++++++++++++++++-
>  1 file changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c
> index bdee057db2fd3..6b465e65472b0 100644
> --- a/drivers/firmware/smccc/smccc.c
> +++ b/drivers/firmware/smccc/smccc.c
> @@ -12,6 +12,12 @@
>  #include <linux/platform_device.h>
>  #include <asm/archrandom.h>
> 
> +#ifdef CONFIG_ARM64
> +#include <linux/cleanup.h>
> +#include <linux/spinlock.h>
> +#include <asm/rsi.h>
> +#endif
> +
>  static u32 smccc_version = ARM_SMCCC_VERSION_1_0;
>  static enum arm_smccc_conduit smccc_conduit = SMCCC_CONDUIT_NONE;
> 
> @@ -67,12 +73,45 @@ s32 arm_smccc_get_soc_id_revision(void)
>  }
>  EXPORT_SYMBOL_GPL(arm_smccc_get_soc_id_revision);
> 
> +#ifdef CONFIG_ARM64
> +static struct rsi_host_call uuid_hc;
> +static DEFINE_SPINLOCK(uuid_hc_lock);

So evidently Sashiko is wrong in saying that struct rsi_host_call must be
in decrypted memory?

> 
> +/*
> + * Helper function to get the hypervisor UUID via an RsiHostCall.
> + */
> +static bool arm_smccc_realm_get_hypervisor_uuid(struct arm_smccc_res *res)
> +{
> +	guard(spinlock_irqsave)(&uuid_hc_lock);
> +
> +	memset(&uuid_hc, 0, sizeof(uuid_hc));
> +	uuid_hc.gprs[0] = ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID;
> +
> +	if (rsi_host_call(__pa_symbol(&uuid_hc)) != RSI_SUCCESS)
> +		return false;

Rather than having this function return a boolean upon failure,
couldn't it just set res->a0 to SMCCC_RET_NOT_SUPPORTED like
arm_smcc_1_1_invoke()? Then arm_smccc_hypervisor_has_uuid()
could process both paths exactly the same way.

> +
> +	res->a0 = uuid_hc.gprs[0];
> +	res->a1 = uuid_hc.gprs[1];
> +	res->a2 = uuid_hc.gprs[2];
> +	res->a3 = uuid_hc.gprs[3];
> +	return true;
> +}
> +#endif
> +
>  bool arm_smccc_hypervisor_has_uuid(const uuid_t *hyp_uuid)
>  {
>  	struct arm_smccc_res res = {};
>  	uuid_t uuid;
> 
> -	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, &res);
> +#ifdef CONFIG_ARM64
> +	if (is_realm_world()) {
> +		if (!arm_smccc_realm_get_hypervisor_uuid(&res))
> +			return false;
> +	} else
> +#endif
> +
> 	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID,
> +				     &res);
> +
>  	if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
>  		return false;
> 
> --
> 2.45.4
> 


^ permalink raw reply

* RE: [RFC PATCH 3/6] arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
From: Michael Kelley @ 2026-06-18 17:46 UTC (permalink / raw)
  To: Kameron Carr, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
	lpieralisi@kernel.org, sudeep.holla@kernel.org, arnd@arndb.de,
	thuth@redhat.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Michael Kelley
In-Reply-To: <20260609181030.2378391-4-kameroncarr@linux.microsoft.com>

From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June 9, 2026 11:10 AM
> 
> Arm CCA Realms cannot issue Hyper-V hypercalls via HVC; the guest must
> route them through the RSI_HOST_CALL interface, which takes the IPA of a
> per-CPU rsi_host_call structure as its argument.
> 
> Add hyperv_pcpu_hostcall_struct as a per-CPU pointer to that buffer and
> allocate it for the boot CPU during hyperv_init() and for each secondary
> CPU in hv_cpu_init(). The allocation is gated on is_realm_world() so
> non-Realm arm64 Hyper-V guests pay no memory cost.

I wonder if there's a simpler approach here. What about calculating the
total size of struct rsi_host_call needed for all CPUs, then doing a single
dynamic allocation to effectively create an array of entries? Each CPU
would just index into the array with its processor ID. You could still have
a per-cpu pointer that points to the correct array entry to avoid the need
to get the processor ID, but I wonder if even that is worth the trouble. Since
struct rsi_host_call size is a power of 2, the indexing is just a simple shift.

The hyperv_pcpu_input_page is allocated the way it is because it's much
bigger. But 16 struct rsi_host_call fit into a single 4 KiB, so there's no
danger of hitting a memory allocation limit at boot time. Even with 8192
CPUs the allocation is only 2 MiB. 

Michael

> 
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
>  arch/arm64/hyperv/mshyperv.c      | 78 ++++++++++++++++++++++++++++++-
>  arch/arm64/include/asm/mshyperv.h |  3 ++
>  2 files changed, 79 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
> index 4fdc26ade1d74..08fec82691683 100644
> --- a/arch/arm64/hyperv/mshyperv.c
> +++ b/arch/arm64/hyperv/mshyperv.c
> @@ -15,10 +15,16 @@
>  #include <linux/errno.h>
>  #include <linux/version.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/slab.h>
> +#include <linux/percpu.h>
>  #include <asm/mshyperv.h>
> +#include <asm/rsi.h>
> 
>  static bool hyperv_initialized;
> 
> +void * __percpu *hyperv_pcpu_hostcall_struct;
> +EXPORT_SYMBOL_GPL(hyperv_pcpu_hostcall_struct);
> +
>  int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>  {
>  	hv_get_vpreg_128(HV_REGISTER_HYPERVISOR_VERSION,
> @@ -60,6 +66,46 @@ static bool __init hyperv_detect_via_acpi(void)
> 
>  #endif
> 
> +static void hv_hostcall_free(void)
> +{
> +	int cpu;
> +
> +	if (!hyperv_pcpu_hostcall_struct)
> +		return;
> +
> +	for_each_possible_cpu(cpu)
> +		kfree(*per_cpu_ptr(hyperv_pcpu_hostcall_struct, cpu));
> +	free_percpu(hyperv_pcpu_hostcall_struct);
> +	hyperv_pcpu_hostcall_struct = NULL;
> +}
> +
> +static int hv_cpu_init(unsigned int cpu)
> +{
> +	void **hostcall_struct;
> +	gfp_t flags;
> +	void *mem;
> +
> +	if (hyperv_pcpu_hostcall_struct) {
> +		/* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
> +		flags = irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL;
> +
> +		hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		/*
> +		 * The hostcall_struct memory is not freed when the CPU
> +		 * goes offline. If a previously offlined CPU is brought
> +		 * back online, the memory is reused here.
> +		 */
> +		if (!*hostcall_struct) {
> +			mem = kzalloc_obj(struct rsi_host_call, flags);
> +			if (!mem)
> +				return -ENOMEM;
> +			*hostcall_struct = mem;
> +		}
> +	}
> +
> +	return hv_common_cpu_init(cpu);
> +}
> +
>  static bool __init hyperv_detect_via_smccc(void)
>  {
>  	uuid_t hyperv_uuid = UUID_INIT(
> @@ -73,6 +119,8 @@ static bool __init hyperv_detect_via_smccc(void)
>  static int __init hyperv_init(void)
>  {
>  	struct hv_get_vp_registers_output	result;
> +	void **hostcall_struct;
> +	void *mem;
>  	u64	guest_id;
>  	int	ret;
> 
> @@ -85,6 +133,27 @@ static int __init hyperv_init(void)
>  	if (!hyperv_detect_via_acpi() && !hyperv_detect_via_smccc())
>  		return 0;
> 
> +	/*
> +	 * The RSI host-call buffer is only ever used when
> +	 * is_realm_world() is true. Skip the per-CPU allocation on
> +	 * non-Realm guests.
> +	 */
> +	if (is_realm_world()) {
> +		hyperv_pcpu_hostcall_struct = alloc_percpu(void *);
> +		if (!hyperv_pcpu_hostcall_struct)
> +			return -ENOMEM;
> +
> +		hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		if (!*hostcall_struct) {
> +			mem = kzalloc_obj(struct rsi_host_call);
> +			if (!mem) {
> +				ret = -ENOMEM;
> +				goto free_hostcall_mem;
> +			}
> +			*hostcall_struct = mem;
> +		}
> +	}
> +
>  	/* Setup the guest ID */
>  	guest_id = hv_generate_guest_id(LINUX_VERSION_CODE);
>  	hv_set_vpreg(HV_REGISTER_GUEST_OS_ID, guest_id);
> @@ -106,12 +175,13 @@ static int __init hyperv_init(void)
> 
>  	ret = hv_common_init();
>  	if (ret)
> -		return ret;
> +		goto free_hostcall_mem;
> 
>  	ret = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE,
> "arm64/hyperv_init:online",
> -				hv_common_cpu_init, hv_common_cpu_die);
> +				hv_cpu_init, hv_common_cpu_die);
>  	if (ret < 0) {
>  		hv_common_free();
> +		hv_hostcall_free();
>  		return ret;
>  	}
> 
> @@ -125,6 +195,10 @@ static int __init hyperv_init(void)
> 
>  	hyperv_initialized = true;
>  	return 0;
> +
> +free_hostcall_mem:
> +	hv_hostcall_free();
> +	return ret;
>  }
> 
>  early_initcall(hyperv_init);
> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
> index b721d3134ab66..65a00bd14c6cb 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -63,4 +63,7 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
> 
>  #include <asm-generic/mshyperv.h>
> 
> +/* Per-CPU RSI host call structure for CCA Realms */
> +extern void *__percpu *hyperv_pcpu_hostcall_struct;
> +
>  #endif
> --
> 2.45.4
> 


^ permalink raw reply

* RE: [RFC PATCH 5/6] arm64: hyperv: Route hypercalls through RSI host call in CCA Realms
From: Michael Kelley @ 2026-06-18 17:46 UTC (permalink / raw)
  To: Kameron Carr, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
	lpieralisi@kernel.org, sudeep.holla@kernel.org, arnd@arndb.de,
	thuth@redhat.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Michael Kelley
In-Reply-To: <20260609181030.2378391-6-kameroncarr@linux.microsoft.com>

From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June 9, 2026 11:10 AM
> 
> Modify the five hypercall wrapper functions to check is_realm_world()
> and use the per-CPU rsi_host_call structure when inside a Realm.
> 
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
>  arch/arm64/hyperv/hv_core.c | 175 +++++++++++++++++++++++++++++-------
>  1 file changed, 141 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
> index e33a9e3c366a1..1759998ef2667 100644
> --- a/arch/arm64/hyperv/hv_core.c
> +++ b/arch/arm64/hyperv/hv_core.c
> @@ -16,6 +16,7 @@
>  #include <asm-generic/bug.h>
>  #include <hyperv/hvhdk.h>
>  #include <asm/mshyperv.h>
> +#include <asm/rsi.h>
> 
>  /*
>   * hv_do_hypercall- Invoke the specified hypercall
> @@ -25,12 +26,32 @@ u64 hv_do_hypercall(u64 control, void *input, void *output)
>  	struct arm_smccc_res	res;
>  	u64			input_address;
>  	u64			output_address;
> +	struct rsi_host_call *hostcall;
> +	unsigned long flags;
> +	u64 ret;
> 
>  	input_address = input ? virt_to_phys(input) : 0;
>  	output_address = output ? virt_to_phys(output) : 0;
> 
> -	arm_smccc_1_1_hvc(HV_FUNC_ID, control,
> -			  input_address, output_address, &res);
> +	if (is_realm_world()) {
> +		local_irq_save(flags);
> +		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		memset(hostcall, 0, sizeof(*hostcall));
> +		hostcall->gprs[0] = HV_FUNC_ID;
> +		hostcall->gprs[1] = control;
> +		hostcall->gprs[2] = input_address;
> +		hostcall->gprs[3] = output_address;
> +
> +		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
> +			ret = hostcall->gprs[0];
> +		else
> +			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> +		local_irq_restore(flags);
> +		return ret;

This code sequence for handling the realm case is almost exactly
duplicated for the three hypercall variants. The only difference is
how gprs[2] and gprs[3] are populated. So I think the code
sequence could go into a helper routine with the appropriate
values for gprs[2] and gprs[3] passed in. 

> +	}
> +
> +	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input_address,
> +			  output_address, &res);
>  	return res.a0;
>  }
>  EXPORT_SYMBOL_GPL(hv_do_hypercall);
> @@ -45,9 +66,28 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
>  {
>  	struct arm_smccc_res	res;
>  	u64			control;
> +	struct rsi_host_call *hostcall;
> +	unsigned long flags;
> +	u64 ret;
> 
>  	control = (u64)code | HV_HYPERCALL_FAST_BIT;
> 
> +	if (is_realm_world()) {
> +		local_irq_save(flags);
> +		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		memset(hostcall, 0, sizeof(*hostcall));
> +		hostcall->gprs[0] = HV_FUNC_ID;
> +		hostcall->gprs[1] = control;
> +		hostcall->gprs[2] = input;
> +
> +		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
> +			ret = hostcall->gprs[0];
> +		else
> +			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> +		local_irq_restore(flags);
> +		return ret;
> +	}
> +
>  	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input, &res);
>  	return res.a0;
>  }
> @@ -62,9 +102,29 @@ u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
>  {
>  	struct arm_smccc_res	res;
>  	u64			control;
> +	struct rsi_host_call *hostcall;
> +	unsigned long flags;
> +	u64 ret;
> 
>  	control = (u64)code | HV_HYPERCALL_FAST_BIT;
> 
> +	if (is_realm_world()) {
> +		local_irq_save(flags);
> +		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		memset(hostcall, 0, sizeof(*hostcall));
> +		hostcall->gprs[0] = HV_FUNC_ID;
> +		hostcall->gprs[1] = control;
> +		hostcall->gprs[2] = input1;
> +		hostcall->gprs[3] = input2;
> +
> +		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
> +			ret = hostcall->gprs[0];
> +		else
> +			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> +		local_irq_restore(flags);
> +		return ret;
> +	}
> +
>  	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
>  	return res.a0;
>  }
> @@ -76,24 +136,44 @@ EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
>  void hv_set_vpreg(u32 msr, u64 value)
>  {
>  	struct arm_smccc_res res;
> +	struct rsi_host_call *hostcall;
> +	unsigned long flags;
> +	u64 status;
> +
> +	if (is_realm_world()) {
> +		local_irq_save(flags);
> +		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		memset(hostcall, 0, sizeof(*hostcall));
> +		hostcall->gprs[0] = HV_FUNC_ID;
> +		hostcall->gprs[1] = HVCALL_SET_VP_REGISTERS |
> +				    HV_HYPERCALL_FAST_BIT |
> +				    HV_HYPERCALL_REP_COMP_1;
> +		hostcall->gprs[2] = HV_PARTITION_ID_SELF;
> +		hostcall->gprs[3] = HV_VP_INDEX_SELF;
> +		hostcall->gprs[4] = msr;
> +		hostcall->gprs[6] = value;
> 
> -	arm_smccc_1_1_hvc(HV_FUNC_ID,
> -		HVCALL_SET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
> -			HV_HYPERCALL_REP_COMP_1,
> -		HV_PARTITION_ID_SELF,
> -		HV_VP_INDEX_SELF,
> -		msr,
> -		0,
> -		value,
> -		0,
> -		&res);
> +		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
> +			status = hostcall->gprs[0];
> +		else
> +			status = HV_STATUS_INVALID_HYPERCALL_INPUT;
> +		local_irq_restore(flags);
> +	} else {
> +		arm_smccc_1_1_hvc(HV_FUNC_ID,
> +				  HVCALL_SET_VP_REGISTERS |
> +					  HV_HYPERCALL_FAST_BIT |
> +					  HV_HYPERCALL_REP_COMP_1,
> +				  HV_PARTITION_ID_SELF, HV_VP_INDEX_SELF, msr,
> +				  0, value, 0, &res);
> +		status = res.a0;
> +	}
> 
>  	/*
> -	 * Something is fundamentally broken in the hypervisor if
> -	 * setting a VP register fails. There's really no way to
> -	 * continue as a guest VM, so panic.
> +	 * Something is fundamentally broken in the hypervisor (or, in a
> +	 * Realm, the RMM denied the host call) if setting a VP register
> +	 * fails. There's really no way to continue as a guest VM, so panic.
>  	 */
> -	BUG_ON(!hv_result_success(res.a0));
> +	BUG_ON(!hv_result_success(status));
>  }
>  EXPORT_SYMBOL_GPL(hv_set_vpreg);
> 
> @@ -108,29 +188,56 @@ void hv_get_vpreg_128(u32 msr, struct
> hv_get_vp_registers_output *result)
>  {
>  	struct arm_smccc_1_2_regs args;
>  	struct arm_smccc_1_2_regs res;
> +	struct rsi_host_call *hostcall;
> +	u64 status;
> 
> -	args.a0 = HV_FUNC_ID;
> -	args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
> -			HV_HYPERCALL_REP_COMP_1;
> -	args.a2 = HV_PARTITION_ID_SELF;
> -	args.a3 = HV_VP_INDEX_SELF;
> -	args.a4 = msr;
> +	if (is_realm_world()) {
> +		unsigned long flags;
> 
> -	/*
> -	 * Use the SMCCC 1.2 interface because the results are in registers
> -	 * beyond X0-X3.
> -	 */
> -	arm_smccc_1_2_hvc(&args, &res);
> +		local_irq_save(flags);
> +		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		memset(hostcall, 0, sizeof(*hostcall));
> +
> +		hostcall->gprs[0] = HV_FUNC_ID;
> +		hostcall->gprs[1] = HVCALL_GET_VP_REGISTERS |
> +				    HV_HYPERCALL_FAST_BIT |
> +				    HV_HYPERCALL_REP_COMP_1;
> +		hostcall->gprs[2] = HV_PARTITION_ID_SELF;
> +		hostcall->gprs[3] = HV_VP_INDEX_SELF;
> +		hostcall->gprs[4] = msr;
> +
> +		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS) {
> +			status = hostcall->gprs[0];
> +			result->as64.low = hostcall->gprs[6];
> +			result->as64.high = hostcall->gprs[7];
> +		} else {
> +			status = HV_STATUS_INVALID_HYPERCALL_INPUT;
> +		}
> +		local_irq_restore(flags);
> +	} else {
> +		args.a0 = HV_FUNC_ID;
> +		args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
> +			  HV_HYPERCALL_REP_COMP_1;
> +		args.a2 = HV_PARTITION_ID_SELF;
> +		args.a3 = HV_VP_INDEX_SELF;
> +		args.a4 = msr;
> +
> +		/*
> +		 * Use the SMCCC 1.2 interface because the results are in
> +		 * registers beyond X0-X3.
> +		 */
> +		arm_smccc_1_2_hvc(&args, &res);
> +		status = res.a0;
> +		result->as64.low = res.a6;
> +		result->as64.high = res.a7;
> +	}
> 
>  	/*
> -	 * Something is fundamentally broken in the hypervisor if
> -	 * getting a VP register fails. There's really no way to
> -	 * continue as a guest VM, so panic.
> +	 * Something is fundamentally broken in the hypervisor (or, in a
> +	 * Realm, the RMM denied the host call) if getting a VP register
> +	 * fails. There's really no way to continue as a guest VM, so panic.
>  	 */
> -	BUG_ON(!hv_result_success(res.a0));
> -
> -	result->as64.low = res.a6;
> -	result->as64.high = res.a7;
> +	BUG_ON(!hv_result_success(status));
>  }
>  EXPORT_SYMBOL_GPL(hv_get_vpreg_128);
> 
> --
> 2.45.4
> 


^ permalink raw reply

* RE: [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Haiyang Zhang @ 2026-06-18 18:38 UTC (permalink / raw)
  To: Dexuan Cui, KY Srinivasan, wei.liu@kernel.org, Dexuan Cui,
	Long Li, andrew+netdev@lunn.ch, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	Konstantin Taranov, horms@kernel.org, ernis@linux.microsoft.com,
	dipayanroy@linux.microsoft.com, kees@kernel.org,
	jacob.e.keller@intel.com, ssengar@linux.microsoft.com,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org
  Cc: stable@vger.kernel.org
In-Reply-To: <20260618035029.249361-1-decui@microsoft.com>



> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Wednesday, June 17, 2026 11:50 PM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; Konstantin Taranov
> <kotaranov@microsoft.com>; horms@kernel.org; ernis@linux.microsoft.com;
> dipayanroy@linux.microsoft.com; kees@kernel.org; jacob.e.keller@intel.com;
> ssengar@linux.microsoft.com; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org
> Cc: stable@vger.kernel.org
> Subject: [PATCH net] net: mana: Sync page pool RX frags for CPU
> 
> MANA allocates RX buffers from page pool fragments when frag_count is
> greater than 1. In that case the buffers remain DMA mapped by page pool
> and the RX completion path does not call dma_unmap_single(). As a result,
> the implicit sync-for-CPU normally performed by dma_unmap_single() is
> missing before the packet data is passed to the networking stack.
> 
> This breaks RX on configurations which require explicit DMA syncing, for
> example when booted with swiotlb=force.
> 
> Fix this by recording the page pool page and DMA sync offset when the RX
> buffer is allocated, and syncing the received packet range for CPU access
> before handing the RX buffer to the stack.
> 
> Also validate the packet length reported in the RX CQE before using it as
> a DMA sync length or passing it to skb processing. The CQE is supplied
> by the device and should not be blindly trusted by Confidential VMs.
> 
> Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers
> instead of full pages to improve memory efficiency.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
>  include/net/mana/mana.h                       |  8 +++
>  2 files changed, 57 insertions(+), 12 deletions(-)

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>



^ permalink raw reply

* [PATCH net] net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0
From: Erni Sri Satya Vennela @ 2026-06-19  5:53 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, ssengar, jacob.e.keller,
	ernis, horms, gargaditya, kees, linux-hyperv, netdev,
	linux-kernel, bpf

Commit d7709812e13d ("net: mana: hardening: Validate adapter_mtu from
MANA_QUERY_DEV_CONFIG") rejected any adapter_mtu value smaller than
ETH_MIN_MTU + ETH_HLEN, including 0, returning -EPROTO and failing
mana_probe().

Some older PF firmware versions still in the field report
adapter_mtu as 0 in the MANA_QUERY_DEV_CONFIG response. With the
hardening check in place, the MANA VF driver now fails to load on
those hosts, breaking networking entirely for guests.

MANA hardware always supports the standard Ethernet MTU. Treat a
reported adapter_mtu of 0 as "the PF did not advertise a value" and
fall back to ETH_FRAME_LEN, the same value used for the pre-V2
message version path. Only jumbo frames remain unavailable until
the PF reports a valid MTU.

Other small-but-nonzero bogus values are still rejected, preserving
the original protection against the unsigned-subtraction wrap that
would otherwise let ndev->max_mtu underflow to a huge value.

Fixes: d7709812e13d ("net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_bpf.c |  3 ++-
 drivers/net/ethernet/microsoft/mana/mana_en.c  | 16 ++++++++++++++--
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index b5e9bb184a1d..53308e139cbe 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -237,7 +237,8 @@ static int mana_xdp_set(struct net_device *ndev, struct bpf_prog *prog,
 		bpf_prog_put(old_prog);
 
 	if (prog)
-		ndev->max_mtu = MANA_XDP_MTU_MAX;
+		ndev->max_mtu = min_t(unsigned int, MANA_XDP_MTU_MAX,
+				      gc->adapter_mtu - ETH_HLEN);
 	else
 		ndev->max_mtu = gc->adapter_mtu - ETH_HLEN;
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 87862b0434c7..7438ea6b3f26 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1233,12 +1233,24 @@ int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
 	*max_num_vports = resp.max_num_vports;
 
 	if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2) {
-		if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
+		if (resp.adapter_mtu == 0) {
+			/*
+			 * Some older PF firmware versions report an
+			 * adapter_mtu of 0. MANA hardware always supports the
+			 * standard Ethernet MTU, so fall back to ETH_FRAME_LEN.
+			 * Jumbo frames will not be available in this case.
+			 */
+			dev_info(dev,
+				 "PF reported adapter_mtu of 0, falling back to %u (jumbo frames disabled)\n",
+				 ETH_FRAME_LEN);
+			gc->adapter_mtu = ETH_FRAME_LEN;
+		} else if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
 			dev_err(dev, "Adapter MTU too small: %u\n",
 				resp.adapter_mtu);
 			return -EPROTO;
+		} else {
+			gc->adapter_mtu = resp.adapter_mtu;
 		}
-		gc->adapter_mtu = resp.adapter_mtu;
 	} else {
 		gc->adapter_mtu = ETH_FRAME_LEN;
 	}
-- 
2.34.1


^ permalink raw reply related

* [PATCH v4 net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-19  7:33 UTC (permalink / raw)
  To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
  Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
	Shradha Gupta, Saurabh Singh Sengar, stable

In mana driver, the number of IRQs allocated is capped by the
min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
than the vcpu count, we want to utilize all the vCPUs, irrespective of
their NUMA/core bindings.

This is important, especially in the envs where number of vCPUs are so
few that the softIRQ handling overhead on two IRQs on the same vCPU is
much more than their overheads if they were spread across sibling vCPUs.

This behaviour is more evident with dynamic IRQ allocation. Since MANA
IRQs are assigned at a later stage compared to static allocation, other
device IRQs may already be affinitized to the vCPUs. As a result, IRQ
weights become imbalanced, causing multiple MANA IRQs to land on the
same vCPU, while some vCPUs have none.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly.

We also studied the results of setting the affinity and hint to
NULL in these cases, and observed that, with this logic if there are
pre existing IRQs allocated on the VM(apart from MANA), during MANA
IRQs allocation, it leads to clustering of the MANA queue IRQs again.
These results can be seen through case 3 in the following data.

Test envs:
=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC		0
IRQ1:	mana_q1		0
IRQ2:	mana_q2		2
IRQ3:	mana_q3		0
IRQ4:	mana_q4		3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU		0	1	2	3
=======================================================
pass 1:		38.85	0.03	24.89	24.65
pass 2:		39.15	0.03	24.57	25.28
pass 3:		40.36	0.03	23.20	23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42	15.85	14.99	14.51
pass 2:         15.53	15.94	15.81	15.93
pass 3:         16.41	16.35	16.40	16.36

=======================================================
Case 3: with affinity set to NULL
=======================================================
4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC			0
IRQ1:	mana_q1			2
IRQ2:	mana_q2			3
IRQ3:	mana_q3			2
IRQ4:	mana_q4			3

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn	with patch	w/o patch	aff NULL
20480		15.65		7.73		5.25
10240		15.63		8.93		5.77
8192		15.64		9.69		7.16
6144		15.64		13.16		9.33
4096		15.69		15.75		13.50
2048		15.69		15.83		13.61
1024		15.71		15.28		13.60

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
---
Changes in v4
 * Add mana prefix on irq_affinity_*() in mana driver
 * Corrected grammar, comment for mana_irq_setup_linear()
 * added new line as per guidelines
 * added case 3 in commit message for when affinity is NULL
---
Changes in v3
 * Optimize the comments in mana_gd_setup_dyn_irqs()
 * add more details in the dev_dbg for extra IRQs
---
Changes in v2
 * Removed the unused skip_first_cpu variable
 * fixed exit condition in irq_setup_linear() with len == 0
 * changed return type of irq_setup_linear() as it will always be 0
 * removed the unnecessary rcu_read_lock() in irq_setup_linear()
 * added appropriate comments to indicate expected behaviour when
   IRQs are more than or equal to num_online_cpus()
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 78 +++++++++++++++----
 1 file changed, 64 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index a0fdd052d7f1..e8b7ffb47eb9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	} else {
 		/* If dynamic allocation is enabled we have already allocated
 		 * hwc msi
+		 * Also, we make sure in this case the following is always true
+		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
 		 */
 		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
 	}
@@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
  * do the same thing.
  */
 
-static int irq_setup(unsigned int *irqs, unsigned int len, int node,
-		     bool skip_first_cpu)
+static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len,
+				     int node, bool skip_first_cpu)
 {
 	const struct cpumask *next, *prev = cpu_none_mask;
 	cpumask_var_t cpus __free(free_cpumask_var);
@@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
 	return 0;
 }
 
+/* must be called with cpus_read_lock() held */
+static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (len == 0)
+			break;
+
+		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+		len--;
+	}
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_irq_context *gic;
-	bool skip_first_cpu = false;
 	int *irqs, err, i, msi;
 
 	irqs = kmalloc_objs(int, nvec);
@@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 		return -ENOMEM;
 
 	/*
+	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
+	 * nvec is only Queue IRQ (HWC already setup).
 	 * While processing the next pci irq vector, we start with index 1,
 	 * as IRQ vector at index 0 is already processed for HWC.
 	 * However, the population of irqs array starts with index 0, to be
-	 * further used in irq_setup()
+	 * further used in mana_irq_setup_numa_aware()
 	 */
 	for (i = 1; i <= nvec; i++) {
 		msi = i;
@@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	}
 
 	/*
-	 * When calling irq_setup() for dynamically added IRQs, if number of
-	 * CPUs is more than or equal to allocated MSI-X, we need to skip the
-	 * first CPU sibling group since they are already affinitized to HWC IRQ
+	 * When calling mana_irq_setup_numa_aware() for dynamically added IRQs,
+	 * if number of CPUs is more than or equal to allocated MSI-X, we need to
+	 * skip the first CPU sibling group since they are already affinitized to
+	 * HWC IRQ
 	 */
 	cpus_read_lock();
-	if (gc->num_msix_usable <= num_online_cpus())
-		skip_first_cpu = true;
+	if (gc->num_msix_usable <= num_online_cpus()) {
+		err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node,
+						true);
+		if (err) {
+			cpus_read_unlock();
+			goto free_irq;
+		}
+	} else {
+		/*
+		 * When num_msix_usable are more than num_online_cpus, our
+		 * queue IRQs should be equal to num of online vCPUs.
+		 * We try to make sure queue IRQs spread across all vCPUs.
+		 * In such a case NUMA or CPU core affinity does not matter.
+		 * Note: in this case the total mana IRQ should always be
+		 * num_online_cpus + 1. The first HWC IRQ is already handled
+		 * in HWC setup calls
+		 * However, if CPUs went offline since num_msix_usable was
+		 * computed, queue IRQs will be more than num_online_cpus().
+		 * In such cases remaining extra IRQs will retain their default
+		 * affinity.
+		 */
+		int first_unassigned = num_online_cpus();
 
-	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
-	if (err) {
-		cpus_read_unlock();
-		goto free_irq;
+		if (nvec > first_unassigned) {
+			char buf[32];
+
+			if (first_unassigned == nvec - 1)
+				snprintf(buf, sizeof(buf), "%d",
+					 first_unassigned);
+			else
+				snprintf(buf, sizeof(buf), "%d-%d",
+					 first_unassigned, nvec - 1);
+
+			dev_dbg(&pdev->dev,
+				"MANA IRQ indices #%s will retain the default CPU affinity\n",
+				buf);
+		}
+
+		mana_irq_setup_linear(irqs, nvec);
 	}
 
 	cpus_read_unlock();
@@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
 		nvec -= 1;
 	}
 
-	err = irq_setup(irqs, nvec, gc->numa_node, false);
+	err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false);
 	if (err) {
 		cpus_read_unlock();
 		goto free_irq;

base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Simon Horman @ 2026-06-19  9:05 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: kys, haiyangz, wei.liu, longli, andrew+netdev, davem, edumazet,
	kuba, pabeni, kotaranov, ernis, dipayanroy, kees, jacob.e.keller,
	ssengar, linux-hyperv, netdev, linux-kernel, linux-rdma, stable
In-Reply-To: <20260618035029.249361-1-decui@microsoft.com>

On Wed, Jun 17, 2026 at 08:50:29PM -0700, Dexuan Cui wrote:
> MANA allocates RX buffers from page pool fragments when frag_count is
> greater than 1. In that case the buffers remain DMA mapped by page pool
> and the RX completion path does not call dma_unmap_single(). As a result,
> the implicit sync-for-CPU normally performed by dma_unmap_single() is
> missing before the packet data is passed to the networking stack.
> 
> This breaks RX on configurations which require explicit DMA syncing, for
> example when booted with swiotlb=force.
> 
> Fix this by recording the page pool page and DMA sync offset when the RX
> buffer is allocated, and syncing the received packet range for CPU access
> before handing the RX buffer to the stack.
> 
> Also validate the packet length reported in the RX CQE before using it as
> a DMA sync length or passing it to skb processing. The CQE is supplied
> by the device and should not be blindly trusted by Confidential VMs.

I think this last part warrants being split out into a separate patch.

> 
> Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>

...

^ permalink raw reply

* Re: [PATCH v4 net] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-06-19 13:55 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <20260619073338.481035-1-shradhagupta@linux.microsoft.com>

On Fri, Jun 19, 2026 at 12:33:35AM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated is capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vCPUs, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vCPUs are so
> few that the softIRQ handling overhead on two IRQs on the same vCPU is
> much more than their overheads if they were spread across sibling vCPUs.
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU, while some vCPUs have none.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly.
> 
> We also studied the results of setting the affinity and hint to
> NULL in these cases, and observed that, with this logic if there are
> pre existing IRQs allocated on the VM(apart from MANA), during MANA
> IRQs allocation, it leads to clustering of the MANA queue IRQs again.
> These results can be seen through case 3 in the following data.
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Case 3: with affinity set to NULL
> =======================================================
> 4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC			0
> IRQ1:	mana_q1			2
> IRQ2:	mana_q2			3
> IRQ3:	mana_q3			2
> IRQ4:	mana_q4			3
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch	aff NULL
> 20480		15.65		7.73		5.25
> 10240		15.63		8.93		5.77
> 8192		15.64		9.69		7.16
> 6144		15.64		13.16		9.33
> 4096		15.69		15.75		13.50
> 2048		15.69		15.83		13.61
> 1024		15.71		15.28		13.60
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Reviewed-by: Simon Horman <horms@kernel.org>

Reviewed-by: Yury Norov <ynorov@nvidia.com>

> ---
> Changes in v4
>  * Add mana prefix on irq_affinity_*() in mana driver
>  * Corrected grammar, comment for mana_irq_setup_linear()
>  * added new line as per guidelines
>  * added case 3 in commit message for when affinity is NULL
> ---
> Changes in v3
>  * Optimize the comments in mana_gd_setup_dyn_irqs()
>  * add more details in the dev_dbg for extra IRQs
> ---
> Changes in v2
>  * Removed the unused skip_first_cpu variable
>  * fixed exit condition in irq_setup_linear() with len == 0
>  * changed return type of irq_setup_linear() as it will always be 0
>  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
>  * added appropriate comments to indicate expected behaviour when
>    IRQs are more than or equal to num_online_cpus()
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 78 +++++++++++++++----
>  1 file changed, 64 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index a0fdd052d7f1..e8b7ffb47eb9 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
>  	} else {
>  		/* If dynamic allocation is enabled we have already allocated
>  		 * hwc msi
> +		 * Also, we make sure in this case the following is always true
> +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
>  		 */
>  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
>  	}
> @@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
>   * do the same thing.
>   */
>  
> -static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> -		     bool skip_first_cpu)
> +static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len,
> +				     int node, bool skip_first_cpu)
>  {
>  	const struct cpumask *next, *prev = cpu_none_mask;
>  	cpumask_var_t cpus __free(free_cpumask_var);
> @@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +/* must be called with cpus_read_lock() held */
> +static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		if (len == 0)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +}
> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
>  	struct gdma_irq_context *gic;
> -	bool skip_first_cpu = false;
>  	int *irqs, err, i, msi;
>  
>  	irqs = kmalloc_objs(int, nvec);
> @@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  		return -ENOMEM;
>  
>  	/*
> +	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> +	 * nvec is only Queue IRQ (HWC already setup).
>  	 * While processing the next pci irq vector, we start with index 1,
>  	 * as IRQ vector at index 0 is already processed for HWC.
>  	 * However, the population of irqs array starts with index 0, to be
> -	 * further used in irq_setup()
> +	 * further used in mana_irq_setup_numa_aware()
>  	 */
>  	for (i = 1; i <= nvec; i++) {
>  		msi = i;
> @@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	}
>  
>  	/*
> -	 * When calling irq_setup() for dynamically added IRQs, if number of
> -	 * CPUs is more than or equal to allocated MSI-X, we need to skip the
> -	 * first CPU sibling group since they are already affinitized to HWC IRQ
> +	 * When calling mana_irq_setup_numa_aware() for dynamically added IRQs,
> +	 * if number of CPUs is more than or equal to allocated MSI-X, we need to
> +	 * skip the first CPU sibling group since they are already affinitized to
> +	 * HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> -		skip_first_cpu = true;
> +	if (gc->num_msix_usable <= num_online_cpus()) {
> +		err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node,
> +						true);
> +		if (err) {
> +			cpus_read_unlock();
> +			goto free_irq;
> +		}
> +	} else {
> +		/*
> +		 * When num_msix_usable are more than num_online_cpus, our
> +		 * queue IRQs should be equal to num of online vCPUs.
> +		 * We try to make sure queue IRQs spread across all vCPUs.
> +		 * In such a case NUMA or CPU core affinity does not matter.
> +		 * Note: in this case the total mana IRQ should always be
> +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * However, if CPUs went offline since num_msix_usable was
> +		 * computed, queue IRQs will be more than num_online_cpus().
> +		 * In such cases remaining extra IRQs will retain their default
> +		 * affinity.
> +		 */
> +		int first_unassigned = num_online_cpus();
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> -	if (err) {
> -		cpus_read_unlock();
> -		goto free_irq;
> +		if (nvec > first_unassigned) {
> +			char buf[32];
> +
> +			if (first_unassigned == nvec - 1)
> +				snprintf(buf, sizeof(buf), "%d",
> +					 first_unassigned);
> +			else
> +				snprintf(buf, sizeof(buf), "%d-%d",
> +					 first_unassigned, nvec - 1);
> +
> +			dev_dbg(&pdev->dev,
> +				"MANA IRQ indices #%s will retain the default CPU affinity\n",
> +				buf);
> +		}
> +
> +		mana_irq_setup_linear(irqs, nvec);
>  	}
>  
>  	cpus_read_unlock();
> @@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
>  		nvec -= 1;
>  	}
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, false);
> +	err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false);
>  	if (err) {
>  		cpus_read_unlock();
>  		goto free_irq;
> 
> base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
> -- 
> 2.34.1

^ permalink raw reply

* RE: [EXTERNAL] [PATCH v2 2/2] RDMA/mana_ib: initialize err for empty send WR lists
From: Long Li @ 2026-06-19 17:09 UTC (permalink / raw)
  To: Ruoyu Wang, Jason Gunthorpe, Leon Romanovsky
  Cc: Cheng Xu, Kai Shen, Konstantin Taranov,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260618041752.481193-2-ruoyuw560@gmail.com>

> mana_ib_post_send() returns err after walking the send work request list.
> If the caller passes an empty list, the loop is skipped and err is not assigned.
> 
> Initialize err to 0 so an empty send work request list returns success instead of
> stack data.
> 
> Fixes: c8017f5b4856 ("RDMA/mana_ib: UD/GSI work requests")
> Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>

Reviewed-by: Long Li <longli@microsoft.com>


> ---
> v2:
> - Split the erdma and mana_ib changes into separate patches.
> - Add a driver-specific Fixes tag.
> 
>  drivers/infiniband/hw/mana/wr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/hw/mana/wr.c b/drivers/infiniband/hw/mana/wr.c
> index 1813567d3b16c..36a1d506f08f6 100644
> --- a/drivers/infiniband/hw/mana/wr.c
> +++ b/drivers/infiniband/hw/mana/wr.c
> @@ -144,7 +144,7 @@ static int mana_ib_post_send_ud(struct mana_ib_qp
> *qp, const struct ib_ud_wr *wr  int mana_ib_post_send(struct ib_qp *ibqp,
> const struct ib_send_wr *wr,
>                       const struct ib_send_wr **bad_wr)  {
> -       int err;
> +       int err = 0;
>         struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
> 
>         for (; wr; wr = wr->next) {
> --
> 2.51.0

^ permalink raw reply

* Re: [PATCH rdma-next v3] RDMA/mana_ib: Clamp adapter capabilities at the ib_device_attr boundary
From: Erni Sri Satya Vennela @ 2026-06-19 19:41 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: longli, kotaranov, Jason Gunthorpe, linux-rdma, linux-hyperv,
	linux-kernel
In-Reply-To: <20260611111745.GM327369@unreal>

On Thu, Jun 11, 2026 at 02:17:45PM +0300, Leon Romanovsky wrote:
> On Mon, May 25, 2026 at 12:01:01PM -0700, Erni Sri Satya Vennela wrote:
> > mana_ib stores its adapter capabilities internally as u32 in
> > struct mana_ib_adapter_caps. The IB core, however, exposes the
> > corresponding device attributes through struct ib_device_attr, where
> > fields such as max_qp, max_qp_wr, max_send_sge, max_recv_sge,
> > max_sge_rd, max_cq, max_cqe, max_mr, max_pd, max_qp_rd_atom,
> > max_res_rd_atom and max_qp_init_rd_atom are signed int.
> > 
> > mana_ib_query_device() is the only place that copies the cached u32
> > caps into these int fields. If a cap exceeds INT_MAX, the implicit
> > u32-to-int narrowing yields a negative value. Clamp each cap to
> > INT_MAX at this boundary so the values handed to the IB core are always
> > non-negative.
> > 
> > While here, fix a related overflow in the computation of
> > max_res_rd_atom. It is derived as max_qp_rd_atom * max_qp, both of
> > which are int after the assignment above; the multiplication can
> > overflow an int even with the new clamps in place. Widen to s64
> > before multiplying and clamp the result to INT_MAX.
> > 
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > ---
> > Changes in v3:
> > * Drop clamping from mana_ib_gd_query_adapter_caps(). The internal u32
> >   caps cache does not need to be clamped.
> > * Move all clamping exclusively to mana_ib_query_device(), which is the
> >   only place the cached u32 values are narrowed into the signed int
> >   fields of struct ib_device_attr.
> > * Reframe commit message: this is a u32-to-int type boundary fix, not a
> >   CVM/untrusted-hardware hardening patch.
> 
> You should align all types to u32 and avoid hiding the issue behind  
> min_t().
> 
> Thanks
Yes Leon, I'm currently at v7 version of this patch.
I'm planning to completely avoid using min_t in the next version.

- Vennela

^ permalink raw reply

* Re: [PATCH net] net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0
From: sashiko-bot @ 2026-06-20  5:54 UTC (permalink / raw)
  To: Erni Sri Satya Vennela; +Cc: linux-hyperv, bpf
In-Reply-To: <20260619055348.467224-1-ernis@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Preexisting error handling bug in XDP set path leaves apc->bpf_prog pointing to a freed BPF program, leading to a Use-After-Free/Double Free.
--

commit 425f6bf5274f2172e13cdc649ea2288d8dc1ad72
Author: Erni Sri Satya Vennela <ernis@linux.microsoft.com>

net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0

This commit fixes a regression for older PF firmware versions by allowing
an adapter_mtu of 0 to fall back to the standard ETH_FRAME_LEN. It also
updates MTU bounds checks for XDP limits.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
> index b5e9bb184a1dd..53308e139cbe9 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
> @@ -237,7 +237,8 @@ static int mana_xdp_set(struct net_device *ndev, struct bpf_prog *prog,
>  		bpf_prog_put(old_prog);
>  
>  	if (prog)
> -		ndev->max_mtu = MANA_XDP_MTU_MAX;
> +		ndev->max_mtu = min_t(unsigned int, MANA_XDP_MTU_MAX,
> +				      gc->adapter_mtu - ETH_HLEN);

[Severity: High]
This isn't a bug introduced by this patch, but while reviewing the changes
to mana_xdp_set(), I noticed a pre-existing issue on an error path earlier
in this function.

Earlier in mana_xdp_set(), the program is assigned:

    apc->bpf_prog = prog;

If the port is up, it then allocates rx buffers:

    err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
    if (err) {
        ...
        return err;
    }

If mana_pre_alloc_rxbufs() fails, it returns directly rather than jumping
to the err_dealloc_rxbuffs label. This bypasses the cleanup step that
restores apc->bpf_prog = old_prog.

Since the networking core frees the bpf program on error, does this leave
apc->bpf_prog pointing to a freed program?

If mana_xdp_set() is called again later, could fetching old_prog lead to a
use-after-free or double-free when calling bpf_prog_put(old_prog)?

>  	else
>  		ndev->max_mtu = gc->adapter_mtu - ETH_HLEN;
>

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260619055348.467224-1-ernis@linux.microsoft.com?part=1

^ permalink raw reply

* [GIT PULL] Hyper-V patches for v7.2
From: Wei Liu @ 2026-06-22  6:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wei Liu, Linux on Hyper-V List, Linux Kernel List, kys, haiyangz,
	decui, longli

Hi Linus,

The following changes since commit e7ae89a0c97ce2b68b0983cd01eda67cf373517d:

  Linux 7.1-rc5 (2026-05-24 13:48:06 -0700)

are available in the Git repository at:

  ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-next-signed-20260621

for you to fetch changes up to a4ffc59238be84dd1c26bf1c001543e832674fc6:

  mshv: add bounds check on vp_index in mshv_intercept_isr() (2026-06-07 23:22:46 -0700)

----------------------------------------------------------------
hyperv-next for v7.2-rc1
 - Use wakeup mailbox to boot APs in Hyper-V VTL2 TDX guests (Yunhong Jiang,
   Ricardo Neri)
 - Move the Hyper-V IOMMU to its own subdirectory (Mukesh Rathor)
 - Cosmetic changes to mshv and balloon driver (Junrui Luo, Markus
   Elfring)
----------------------------------------------------------------
Junrui Luo (1):
      mshv: add bounds check on vp_index in mshv_intercept_isr()

Markus Elfring (1):
      hv_balloon: Simplify data output in hv_balloon_debug_show()

Mukesh R (2):
      iommu/hyperv: Create hyperv subdirectory under drivers/iommu
      x86/hyperv: Cosmetic changes in irqdomain.c for readability

Ricardo Neri (6):
      x86/topology: Add missing struct declaration and attribute dependency
      x86/acpi: Add functions to setup and access the wakeup mailbox
      dt-bindings: reserved-memory: Wakeup Mailbox for Intel processors
      x86/dt: Parse the Wakeup Mailbox for Intel processors
      x86/acpi: Add a helper to get the address of the wakeup mailbox
      x86/hyperv/vtl: Use the wakeup mailbox to boot secondary CPUs

Yunhong Jiang (4):
      x86/hyperv/vtl: Set real_mode_header in hv_vtl_init_platform()
      x86/realmode: Make the location of the trampoline configurable
      x86/hyperv/vtl: Setup the 64-bit trampoline for TDX guests
      x86/hyperv/vtl: Mark the wakeup mailbox page as private

 .../reserved-memory/intel,wakeup-mailbox.yaml      |  49 +++++
 MAINTAINERS                                        |   2 +-
 arch/x86/hyperv/hv_vtl.c                           |  38 +++-
 arch/x86/hyperv/irqdomain.c                        | 198 +++++++++++----------
 arch/x86/include/asm/acpi.h                        |  16 ++
 arch/x86/include/asm/topology.h                    |   3 +
 arch/x86/include/asm/x86_init.h                    |   3 +
 arch/x86/kernel/acpi/madt_wakeup.c                 |  16 ++
 arch/x86/kernel/devicetree.c                       |  47 +++++
 arch/x86/kernel/x86_init.c                         |   3 +
 arch/x86/realmode/init.c                           |   7 +-
 drivers/hv/hv_balloon.c                            |   4 +-
 drivers/hv/mshv_synic.c                            |   5 +
 drivers/iommu/Kconfig                              |   9 -
 drivers/iommu/Makefile                             |   2 +-
 drivers/iommu/hyperv/Makefile                      |   2 +
 .../{hyperv-iommu.c => hyperv/hv-irq-remap-x86.c}  |   6 +-
 drivers/iommu/irq_remapping.c                      |   2 +-
 18 files changed, 290 insertions(+), 122 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/reserved-memory/intel,wakeup-mailbox.yaml
 create mode 100644 drivers/iommu/hyperv/Makefile
 rename drivers/iommu/{hyperv-iommu.c => hyperv/hv-irq-remap-x86.c} (99%)

^ permalink raw reply

* Re: [PATCH net] net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0
From: Erni Sri Satya Vennela @ 2026-06-22  9:22 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv, bpf
In-Reply-To: <20260620055406.C564C1F000E9@smtp.kernel.org>

On Sat, Jun 20, 2026 at 05:54:06AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> 
> Pre-existing issues:
> - [High] Preexisting error handling bug in XDP set path leaves apc->bpf_prog pointing to a freed BPF program, leading to a Use-After-Free/Double Free.
> --
> 
> commit 425f6bf5274f2172e13cdc649ea2288d8dc1ad72
> Author: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> 
> net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0
> 
> This commit fixes a regression for older PF firmware versions by allowing
> an adapter_mtu of 0 to fall back to the standard ETH_FRAME_LEN. It also
> updates MTU bounds checks for XDP limits.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
> > index b5e9bb184a1dd..53308e139cbe9 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
> > @@ -237,7 +237,8 @@ static int mana_xdp_set(struct net_device *ndev, struct bpf_prog *prog,
> >  		bpf_prog_put(old_prog);
> >  
> >  	if (prog)
> > -		ndev->max_mtu = MANA_XDP_MTU_MAX;
> > +		ndev->max_mtu = min_t(unsigned int, MANA_XDP_MTU_MAX,
> > +				      gc->adapter_mtu - ETH_HLEN);
> 
> [Severity: High]
> This isn't a bug introduced by this patch, but while reviewing the changes
> to mana_xdp_set(), I noticed a pre-existing issue on an error path earlier
> in this function.
> 
> Earlier in mana_xdp_set(), the program is assigned:
> 
>     apc->bpf_prog = prog;
> 
> If the port is up, it then allocates rx buffers:
> 
>     err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
>     if (err) {
>         ...
>         return err;
>     }
> 
> If mana_pre_alloc_rxbufs() fails, it returns directly rather than jumping
> to the err_dealloc_rxbuffs label. This bypasses the cleanup step that
> restores apc->bpf_prog = old_prog.
> 
> Since the networking core frees the bpf program on error, does this leave
> apc->bpf_prog pointing to a freed program?
> 
> If mana_xdp_set() is called again later, could fetching old_prog lead to a
> use-after-free or double-free when calling bpf_prog_put(old_prog)?
> 
> >  	else
> >  		ndev->max_mtu = gc->adapter_mtu - ETH_HLEN;
> >
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260619055348.467224-1-ernis@linux.microsoft.com?part=1

I'll take this up in a different patch.

Thankyou,
Vennela

^ permalink raw reply

* Re: [PATCH net] net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0
From: Simon Horman @ 2026-06-22 10:41 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, ssengar, jacob.e.keller,
	gargaditya, kees, linux-hyperv, netdev, linux-kernel, bpf
In-Reply-To: <20260619055348.467224-1-ernis@linux.microsoft.com>

On Thu, Jun 18, 2026 at 10:53:38PM -0700, Erni Sri Satya Vennela wrote:
> Commit d7709812e13d ("net: mana: hardening: Validate adapter_mtu from
> MANA_QUERY_DEV_CONFIG") rejected any adapter_mtu value smaller than
> ETH_MIN_MTU + ETH_HLEN, including 0, returning -EPROTO and failing
> mana_probe().
> 
> Some older PF firmware versions still in the field report
> adapter_mtu as 0 in the MANA_QUERY_DEV_CONFIG response. With the
> hardening check in place, the MANA VF driver now fails to load on
> those hosts, breaking networking entirely for guests.
> 
> MANA hardware always supports the standard Ethernet MTU. Treat a
> reported adapter_mtu of 0 as "the PF did not advertise a value" and
> fall back to ETH_FRAME_LEN, the same value used for the pre-V2
> message version path. Only jumbo frames remain unavailable until
> the PF reports a valid MTU.
> 
> Other small-but-nonzero bogus values are still rejected, preserving
> the original protection against the unsigned-subtraction wrap that
> would otherwise let ndev->max_mtu underflow to a huge value.
> 
> Fixes: d7709812e13d ("net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG")
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>

Reviewed-by: Simon Horman <horms@kernel.org>

FTR, I agree with your assessment that the issue flagged in the
AI-generated review of this patch on sashiko.dev can be
treated as a follow-up [1].

And I don't think the low priority issue flagged in the AI-generated
review on https://netdev-ai.bots.linux.dev/sashiko/ should impede progress
of this patch.

[1] https://lore.kernel.org/bpf/ajj+5mhswcqhI2z7@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/


^ permalink raw reply

* Re: [GIT PULL] Hyper-V patches for v7.2
From: pr-tracker-bot @ 2026-06-22 16:36 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linus Torvalds, Wei Liu, Linux on Hyper-V List, Linux Kernel List,
	kys, haiyangz, decui, longli
In-Reply-To: <20260622064549.GA2852659@liuwe-devbox-debian-v2.local>

The pull request you sent on Sun, 21 Jun 2026 23:45:49 -0700:

> ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-next-signed-20260621

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/6e869de3a1b9ef9f096223e0e7f30c727de4f6bc

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH v4 net] net: mana: Optimize irq affinity for low vcpu configs
From: Jakub Kicinski @ 2026-06-23  1:22 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Erni Sri Satya Vennela, Dipayaan Roy, Shiraz Saleem,
	Michael Kelley, Long Li, Yury Norov, linux-hyperv, linux-kernel,
	netdev, Paul Rosswurm, Shradha Gupta, Saurabh Singh Sengar,
	stable
In-Reply-To: <20260619073338.481035-1-shradhagupta@linux.microsoft.com>

On Fri, 19 Jun 2026 00:33:35 -0700 Shradha Gupta wrote:
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org

If you want this to be a fix -- could you please rewrite the commit
message? What matters most is the comparison before the bad commit,
the bad commit, and then with this fix applied. Perhaps the three
cases you list is that but it's not immediately obvious..
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH v4 net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-23  9:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Erni Sri Satya Vennela, Dipayaan Roy, Shiraz Saleem,
	Michael Kelley, Long Li, Yury Norov, linux-hyperv, linux-kernel,
	netdev, Paul Rosswurm, Shradha Gupta, Saurabh Singh Sengar,
	stable
In-Reply-To: <20260622182248.5bfc49ce@kernel.org>

On Mon, Jun 22, 2026 at 06:22:48PM -0700, Jakub Kicinski wrote:
> On Fri, 19 Jun 2026 00:33:35 -0700 Shradha Gupta wrote:
> > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > Cc: stable@vger.kernel.org
> 
> If you want this to be a fix -- could you please rewrite the commit
> message? What matters most is the comparison before the bad commit,
> the bad commit, and then with this fix applied. Perhaps the three
> cases you list is that but it's not immediately obvious..
> -- 
> pw-bot: cr

sure, Jakub. Let me send out a newer version with this change in commit
message

regards,
Shradha

^ permalink raw reply

* [PATCH v5 net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-24  7:21 UTC (permalink / raw)
  To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
  Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
	Shradha Gupta, Saurabh Singh Sengar, stable, Yury Norov

Before the commit 755391121038 ("net: mana: Allocate MSI-X vectors
dynamically"), all the MANA IRQs were assigned statically and together
during early driver load.

After this commit, the IRQ allocation for MANA was done in two phases.
HWC IRQ allocated earlier and then, queue IRQs dynamically added at a
later point. By this time, the IRQ weights on vCPUs can become imbalanced
and if IRQ count is greater than the vCPU count the topology aware IRQ
distribution logic in MANA can cause multiple MANA IRQs to land on the
same vCPUs, while other sibling vCPUs have none (case 1).

On SMP enabled, low-vCPU systems, this becomes a bigger problem as the
softIRQ handling overhead of two IRQs on the same vCPUs becomes much more
than their overheads if they were spread across sibling vCPUs.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly.

Fix the affinity assignment logic, in cases where the IRQ count is greater
than the vCPU count and when IRQs are added dynamically, by utilizing all
the vCPUs irrespective of their NUMA/core bindings (case 2).

The results of setting the affinity and hint to NULL were also studied,
and we observed that, with this logic if there are pre-existing IRQs
allocated on the VM (apart from MANA), during MANA IRQs allocation, it
leads to clustering of the MANA queue IRQs again (case 3).


=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC		0
IRQ1:	mana_q1		0
IRQ2:	mana_q2		2
IRQ3:	mana_q3		0
IRQ4:	mana_q4		3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU		0	1	2	3
=======================================================
pass 1:		38.85	0.03	24.89	24.65
pass 2:		39.15	0.03	24.57	25.28
pass 3:		40.36	0.03	23.20	23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42	15.85	14.99	14.51
pass 2:         15.53	15.94	15.81	15.93
pass 3:         16.41	16.35	16.40	16.36

=======================================================
Case 3: with affinity set to NULL
=======================================================
4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC			0
IRQ1:	mana_q1			2
IRQ2:	mana_q2			3
IRQ3:	mana_q3			2
IRQ4:	mana_q4			3

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn	with patch	w/o patch	aff NULL
20480		15.65		7.73		5.25
10240		15.63		8.93		5.77
8192		15.64		9.69		7.16
6144		15.64		13.16		9.33
4096		15.69		15.75		13.50
2048		15.69		15.83		13.61
1024		15.71		15.28		13.60

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Yury Norov <ynorov@nvidia.com>
---
Changes in v5
 * modify commit message to align with fix patch format
---
Changes in v4
 * Add mana prefix on irq_affinity_*() in mana driver
 * Corrected grammar, comment for mana_irq_setup_linear()
 * added new line as per guidelines
 * added case 3 in commit message for when affinity is NULL
---
Changes in v3
 * Optimize the comments in mana_gd_setup_dyn_irqs()
 * add more details in the dev_dbg for extra IRQs
---
Changes in v2
 * Removed the unused skip_first_cpu variable
 * fixed exit condition in irq_setup_linear() with len == 0
 * changed return type of irq_setup_linear() as it will always be 0
 * removed the unnecessary rcu_read_lock() in irq_setup_linear()
 * added appropriate comments to indicate expected behaviour when
   IRQs are more than or equal to num_online_cpus()
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 78 +++++++++++++++----
 1 file changed, 64 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index a0fdd052d7f1..e8b7ffb47eb9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	} else {
 		/* If dynamic allocation is enabled we have already allocated
 		 * hwc msi
+		 * Also, we make sure in this case the following is always true
+		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
 		 */
 		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
 	}
@@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
  * do the same thing.
  */
 
-static int irq_setup(unsigned int *irqs, unsigned int len, int node,
-		     bool skip_first_cpu)
+static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len,
+				     int node, bool skip_first_cpu)
 {
 	const struct cpumask *next, *prev = cpu_none_mask;
 	cpumask_var_t cpus __free(free_cpumask_var);
@@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
 	return 0;
 }
 
+/* must be called with cpus_read_lock() held */
+static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (len == 0)
+			break;
+
+		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+		len--;
+	}
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_irq_context *gic;
-	bool skip_first_cpu = false;
 	int *irqs, err, i, msi;
 
 	irqs = kmalloc_objs(int, nvec);
@@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 		return -ENOMEM;
 
 	/*
+	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
+	 * nvec is only Queue IRQ (HWC already setup).
 	 * While processing the next pci irq vector, we start with index 1,
 	 * as IRQ vector at index 0 is already processed for HWC.
 	 * However, the population of irqs array starts with index 0, to be
-	 * further used in irq_setup()
+	 * further used in mana_irq_setup_numa_aware()
 	 */
 	for (i = 1; i <= nvec; i++) {
 		msi = i;
@@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	}
 
 	/*
-	 * When calling irq_setup() for dynamically added IRQs, if number of
-	 * CPUs is more than or equal to allocated MSI-X, we need to skip the
-	 * first CPU sibling group since they are already affinitized to HWC IRQ
+	 * When calling mana_irq_setup_numa_aware() for dynamically added IRQs,
+	 * if number of CPUs is more than or equal to allocated MSI-X, we need to
+	 * skip the first CPU sibling group since they are already affinitized to
+	 * HWC IRQ
 	 */
 	cpus_read_lock();
-	if (gc->num_msix_usable <= num_online_cpus())
-		skip_first_cpu = true;
+	if (gc->num_msix_usable <= num_online_cpus()) {
+		err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node,
+						true);
+		if (err) {
+			cpus_read_unlock();
+			goto free_irq;
+		}
+	} else {
+		/*
+		 * When num_msix_usable are more than num_online_cpus, our
+		 * queue IRQs should be equal to num of online vCPUs.
+		 * We try to make sure queue IRQs spread across all vCPUs.
+		 * In such a case NUMA or CPU core affinity does not matter.
+		 * Note: in this case the total mana IRQ should always be
+		 * num_online_cpus + 1. The first HWC IRQ is already handled
+		 * in HWC setup calls
+		 * However, if CPUs went offline since num_msix_usable was
+		 * computed, queue IRQs will be more than num_online_cpus().
+		 * In such cases remaining extra IRQs will retain their default
+		 * affinity.
+		 */
+		int first_unassigned = num_online_cpus();
 
-	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
-	if (err) {
-		cpus_read_unlock();
-		goto free_irq;
+		if (nvec > first_unassigned) {
+			char buf[32];
+
+			if (first_unassigned == nvec - 1)
+				snprintf(buf, sizeof(buf), "%d",
+					 first_unassigned);
+			else
+				snprintf(buf, sizeof(buf), "%d-%d",
+					 first_unassigned, nvec - 1);
+
+			dev_dbg(&pdev->dev,
+				"MANA IRQ indices #%s will retain the default CPU affinity\n",
+				buf);
+		}
+
+		mana_irq_setup_linear(irqs, nvec);
 	}
 
 	cpus_read_unlock();
@@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
 		nvec -= 1;
 	}
 
-	err = irq_setup(irqs, nvec, gc->numa_node, false);
+	err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false);
 	if (err) {
 		cpus_read_unlock();
 		goto free_irq;

base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v5 01/15] drm/amd/display: Handle struct drm_plane_state.ignore_damage_clips
From: Harry Wentland @ 2026-06-24 16:06 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklkml,
	zack.rusin, bcm-kernel-feedback-list, sunpeng.li, siqueira,
	alexander.deucher, rodrigo.vivi, joonas.lahtinen, tursulin,
	javierm, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Zack Rusin, stable
In-Reply-To: <20260610152505.260172-2-tzimmermann@suse.de>



On 2026-06-10 11:18, Thomas Zimmermann wrote:
> The mode-setting pipeline can disabled damage clippings for a commit
> by setting ignore_damage_clips in struct drm_plane_state. The commit
> will then do a full display update.
> 
> Test the flag in DCN code and do a full update in DCN code if it has
> been set.
> 
> Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers
> to ignore damage clips") introduced ignore_damage_clips to selectively
> ignore damage clipping in certain framebuffer changes. This driver does
> not do that, but DRM's damage iterator will soon rely on the flag.
> Therefore supporting it here as well make sense for consistency.
> 
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")
> Cc: Javier Martinez Canillas <javierm@redhat.com>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: Zack Rusin <zackr@vmware.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: <stable@vger.kernel.org> # v6.8+

While I haven't looked thoroughly at the rest of the series this
patch for amdgpu_dm looks fine.

Reviewed-by: Harry Wentland <harry.wentland@amd.com>

Harry

> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 0e20194e6662..4cbb27f65a0b 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -6614,8 +6614,8 @@ static void fill_dc_dirty_rects(struct drm_plane *plane,
>  {
>  	struct dm_crtc_state *dm_crtc_state = to_dm_crtc_state(crtc_state);
>  	struct rect *dirty_rects = flip_addrs->dirty_rects;
> -	u32 num_clips;
> -	struct drm_mode_rect *clips;
> +	u32 num_clips = 0;
> +	struct drm_mode_rect *clips = NULL;
>  	bool bb_changed;
>  	bool fb_changed;
>  	u32 i = 0;
> @@ -6631,8 +6631,10 @@ static void fill_dc_dirty_rects(struct drm_plane *plane,
>  	if (new_plane_state->rotation != DRM_MODE_ROTATE_0)
>  		goto ffu;
>  
> -	num_clips = drm_plane_get_damage_clips_count(new_plane_state);
> -	clips = drm_plane_get_damage_clips(new_plane_state);
> +	if (!new_plane_state->ignore_damage_clips) {
> +		num_clips = drm_plane_get_damage_clips_count(new_plane_state);
> +		clips = drm_plane_get_damage_clips(new_plane_state);
> +	}
>  
>  	if (num_clips && (!amdgpu_damage_clips || (amdgpu_damage_clips < 0 &&
>  						   is_psr_su)))


^ permalink raw reply

* [PATCH] hyperv: mshv: zero VTL hypercall output page
From: Yousef Alhouseen @ 2026-06-24 17:21 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
  Cc: linux-hyperv, linux-kernel, Yousef Alhouseen

mshv_vtl_hvcall_call() copies output_size bytes from a freshly allocated
hypercall output page back to userspace. The page is currently allocated
without __GFP_ZERO, so any bytes not written by the hypervisor are copied
from stale page contents.

Allocate the output page zeroed before issuing the hypercall. Also check
both bounce-page allocations before using them so memory pressure cannot
turn the copy paths into NULL pointer dereferences.

Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
 drivers/hv/mshv_vtl_main.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index 0d3d41619..0365d207c 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -1147,7 +1147,11 @@ static int mshv_vtl_hvcall_call(struct mshv_vtl_hvcall_fd *fd,
 	 * TODO: Take care of this when CVM support is added.
 	 */
 	in = (void *)__get_free_page(GFP_KERNEL);
-	out = (void *)__get_free_page(GFP_KERNEL);
+	out = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+	if (!in || !out) {
+		ret = -ENOMEM;
+		goto free_pages;
+	}

 	if (copy_from_user(in, (void __user *)hvcall.input_ptr, hvcall.input_size)) {
 		ret = -EFAULT;
@@ -1162,8 +1166,10 @@ static int mshv_vtl_hvcall_call(struct mshv_vtl_hvcall_fd *fd,
 	}
 	ret = put_user(hvcall.status, &hvcall_user->status);
 free_pages:
-	free_page((unsigned long)in);
-	free_page((unsigned long)out);
+	if (in)
+		free_page((unsigned long)in);
+	if (out)
+		free_page((unsigned long)out);

 	return ret;
 }
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH] hyperv: mshv: zero VTL hypercall output page
From: sashiko-bot @ 2026-06-24 17:32 UTC (permalink / raw)
  To: Yousef Alhouseen; +Cc: linux-hyperv
In-Reply-To: <20260624172157.2790-1-alhouseenyousef@gmail.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] This problem wasn't introduced by this patch, but the patch only partially fixes the information leak. While it zeroes the `out` page, the input bounce page `in` is still allocated without the `__GFP_ZERO` flag. If a user provides an `hvcall.input_size` that is smaller than the input size expected by the hypercall (specified by `hvcall.control`), `copy_from_user()` will only partially initialize the `in` page. The remainder of the page will contain uninitialized, stale kernel memory. This full page is then passed to the hypervisor. If the hypervisor processes these uninitialized bytes and reflects them into the `out` page or alters its observable behavior based on them, sensitive kernel data is leaked to userspace.
--

commit d8767fa45f0b927a8630a3fee3fba738b9ae7851
Author: Yousef Alhouseen <alhouseenyousef@gmail.com>

hyperv: mshv: zero VTL hypercall output page
    
This patch zeroes the hypercall output page to avoid copying stale memory
to userspace and adds NULL pointer checks for the page allocations.

> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
> index 0d3d4161974f8..0365d207cb110 100644
> --- a/drivers/hv/mshv_vtl_main.c
> +++ b/drivers/hv/mshv_vtl_main.c
> @@ -1147,7 +1147,11 @@ static int mshv_vtl_hvcall_call(struct mshv_vtl_hvcall_fd *fd,
>  	 * TODO: Take care of this when CVM support is added.
>  	 */
>  	in = (void *)__get_free_page(GFP_KERNEL);

[Severity: High]
This isn't a bug introduced by this patch, but should the "in" page also
be allocated with __GFP_ZERO to prevent leaking stale kernel memory?

If userspace provides an hvcall.input_size that is smaller than what the
hypercall expects, copy_from_user() will only partially initialize the
"in" page. The remainder of the page will contain uninitialized kernel
memory which is then passed to the hypervisor in hv_do_hypercall(). If the
hypervisor reflects this input or alters behavior based on it, could this
result in an information leak?

> -	out = (void *)__get_free_page(GFP_KERNEL);
> +	out = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!in || !out) {
> +		ret = -ENOMEM;
> +		goto free_pages;
> +	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624172157.2790-1-alhouseenyousef@gmail.com?part=1

^ permalink raw reply

* [PATCH] hyperv: mshv: zero VTL hypercall input page
From: Yousef Alhouseen @ 2026-06-24 17:57 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
  Cc: linux-hyperv, linux-kernel, Yousef Alhouseen

mshv_vtl_hvcall_call() copies only the user-provided input size.

It then passes the page to hv_do_hypercall().

For short inputs, stale bytes can remain in the bounce page.

Those bytes can be consumed by the hypervisor.

Allocate the input page zeroed, matching the output page.

Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
 drivers/hv/mshv_vtl_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index 0365d207c..f2633148c 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -1146,7 +1146,7 @@ static int mshv_vtl_hvcall_call(struct mshv_vtl_hvcall_fd *fd,
 	 *
 	 * TODO: Take care of this when CVM support is added.
 	 */
-	in = (void *)__get_free_page(GFP_KERNEL);
+	in = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	out = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	if (!in || !out) {
 		ret = -ENOMEM;
-- 
2.54.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox